One of the most exciting developments in database theory in recent years is the entropic framework for cardinality bounds, which emerges from deep connections between database theory and information theory. In this blog post, I explain the fundamental concepts and key steps for estimating join size via information inequalities. This post is based on a lecture by Dan Suciu during the Simons Institute program Logic and Algorithms in Database Theory and AI. The technique is developed in a line of work including AGM13, GLVV12, GM14, ANS16, ANS17. See Suc23 for a wonderful survey of the literature.

The intuition behind using information theory to estimate join output
size is that the *entropy of a relation*, a notion to be made
precise later, carries information about the relation. Inequalities over
entropic functions have also been studied for decades, so we can
leverage the results to derive bounds for database queries.

We begin by reviewing several basic definitions in information
theory. First, the *entropy* h(X) of a finite random variable X is defined as follows:

h(X) \stackrel{\text{def}}{=} -\sum_i p_i \log p_i

where p_i is the probability of the
outcome X_i, and \log has base 2. Intuitively, the entropy
measures the *randomness* of X:
if X is deterministic, then:

p_i = \begin{cases} 1 & \text{when } i = c \\ 0 & \text{when } i \neq c \end{cases}

for some constant c. Therefore h(X)= - 1 \log 1 = 0. If X is uniform (as random as it gets), then h(X) = - m \times \frac{1}{m} \times \log \frac{1}{m} = \log m, where m is the size of the support for X.

When \mathbf{X} is a set of
variables, h(\mathbf{X}) denotes the
entropy over the joint distribution over the set. We write h(\mathbf{X}\mathbf{Y}) for h(\mathbf{X}\cup \mathbf{Y}), and h(XY) for h(\{X,
Y\}). For a set of n random
variables \mathbf{X}, we define the
*entropic vector* as a function from any subset of \mathbf{X} to the entropy of their joint
distribution:

(h(\mathbf{X}_S))_{S \subseteq [n]} \in \mathbb{R}^{2^n}_+

As an example, let’s consider the distribution over X,Y,Z as shown in the table below:

X | Y | Z | p |
---|---|---|---|

0 | 0 | 0 | 1/4 |

0 | 1 | 1 | 1/4 |

1 | 0 | 1 | 1/4 |

1 | 1 | 0 | 1/4 |

The entropic vector h then has values as shown in the Hasse diagram below:

We next define the *conditional entropy* as an analog to the
conditional distribution:

h(\mathbf{V}\mid \mathbf{U}) \stackrel{\text{def}}{=} h(\mathbf{UV}) - h(\mathbf{U})

Importantly, *the conditional entropy is not an entropic
vector*! In other words, there may not always be a distribution
whose entropies corresponds to a given conditional entropy. Instead, the
conditional entropy can be equivalently defined as the expectation of
the entropy of \mathbf{V} given \mathbf{U}:

h(\mathbf{V\mid U}) = \mathop{{}\mathbb{E}}_\mathbf{u}[h(\mathbf{V}\mid \mathbf{U}=\mathbf{u})]

The reader can verify this is indeed equivalent to the definition on our example above.

One final bit of notation we need is the *conditional mutual
information*:

I_h(\mathbf{V};\mathbf{W}\mid\mathbf{U}) \stackrel{\text{def}}{=} h(\mathbf{UV}) + h(\mathbf{UW}) - h(\mathbf{UVW}) - h(\mathbf{U})

The key property is that \mathbf{V} and \mathbf{W} are independent given \mathbf{U} iff I_h(\mathbf{V};\mathbf{W}\mid\mathbf{U})=0. Rearranging in terms of conditional entropies can perhaps make the definition more intuitive:

\begin{align*} I_h(\mathbf{V;W \mid U}) & = (h(\mathbf{UV}) - h(\mathbf{U})) - (h(\mathbf{UVW}) - h(\mathbf{UW})) \\ & = h(\mathbf{V\mid U}) - h(\mathbf{V\mid UW}) \end{align*}

Here we see that I_h measures how much less random \mathbf{V} becomes once we know \mathbf{W}, if we already knew \mathbf{U}. Or, how much more information we gain about \mathbf{V} after conditioning on \mathbf{W}. Note that the definition is symmetric: I_h(\mathbf{V;W\mid U}) = I_h(\mathbf{W;V\mid U}).

From these definitions, the three fundamental entropic inequalities,
called *Shannon inequalities*, fall out very naturally:

\begin{align*} 0 & \leq h(\mathbf{U}) \leq \log |\text{supp}(\mathbf{U})| \\ 0 & \leq h(\mathbf{U \mid V}) \\ 0 & \leq I_h(\mathbf{V;W\mid U}) \end{align*}

where |\text{supp}(\mathbf{U})| is
the size of the support of \mathbf{U},
i.e. number of possible outcomes. The second inequality is called
*monotonicity*, because it is equivalent to h(\mathbf{V}) \leq h(\mathbf{UV}), which says
the distribution over more variables becomes more random. The last
inequality is called *submodularity* (a.k.a. “law of diminishing
returns”). If we replace \mathbf{V}
with \delta, we can rewrite the
inequality as h(\mathbf{U\cup W} \cup \delta)
- h(\mathbf{U\cup W}) \leq h(\mathbf{U}\cup\delta) -
h(\mathbf{U}) which is exactly the law of diminishing
returns.

OK, enough about entropies! How can we use them for databases? The key idea is to see a relation as a distribution over its tuples. Our example before (copied here) assigns the same probability to every tuple in the relation, so it defines a uniform distribution over the tuples.

X | Y | Z | p |
---|---|---|---|

0 | 0 | 0 | 1/4 |

0 | 1 | 1 | 1/4 |

1 | 0 | 1 | 1/4 |

1 | 1 | 0 | 1/4 |

Then, we can use the first Shannon inequality to connect the entropy
of a relation to its size, because the relation R(X, Y, Z) is exactly the support \text{supp}(XYZ), so h(XYZ) \leq \log |R|. Similarly, we can
connect the conditional entropy to *degree constraints*. For two
sets of attributes \mathbf{U, V} and
values \mathbf{u}, define the
*degree* of \mathbf{V} given
\mathbf{U = u}, written \text{deg}(\mathbf{V \mid U=u}), as the
number of tuples with distinct \mathbf{V} values given \mathbf{U=u}. For example, \text{deg}(XY\mid Z=0)=2, \text{deg}(X\mid YZ=01)=1, and \text{deg}(XYZ\mid \emptyset) = |R| = 4. The
max degree of \mathbf{V} given \mathbf{U}, written \max \text{deg}(\mathbf{V \mid U}), is the
maximum of \text{deg}(\mathbf{V \mid U =
u}) over all \mathbf{u \in U}.
Then we have:

\begin{align*} h(\mathbf{V \mid U}) & = \mathop{{}\mathbb{E}}_\mathbf{u}[h(\mathbf{V}\mid \mathbf{U}=\mathbf{u})] \\ & \leq \max_\mathbf{u} h(\mathbf{V}\mid \mathbf{U}=\mathbf{u}) \\ & \leq \max_\mathbf{u} \log (\text{deg}(\mathbf{V} \mid \mathbf{U=u})) \\ & = \log (\max \text{deg}(\mathbf{V\mid U})) \end{align*}

With this, we can translate bounds on the size of relations and degree constraints into bounds on entropies. Note that degree constraints generalize functional dependencies (FDs). For example an FD X \rightarrow Y can be expressed as \text{deg}(Y\mid X) \leq 1. This means we can take advantage of functional dependencies (e.g. primary keys) when deriving bounds on query output size. Let’s try it!

Consider the following query:

Q(X, Y, Z, U) = R(X, Y) \wedge S(Y, Z) \wedge T(Z, U) \wedge A(X, Z, U) \wedge B(X, Y, U)

And suppose we know |R| = |S| = |T| =
N, but we don’t how large A and
B can be. Then Q can be as large as the cartesian product of
R and T, so |Q| \leq
N^2. But suppose we also know the FDs XZ \rightarrow U and YU \rightarrow X hold, we can use the Shannon
inequalities to derive a tighter bound. First, consider a uniform
distribution over the *output* of Q. Then h(XY) \leq
\log |\pi_{XY}Q|\leq \log |R| because the projection of the
output on XY must be in the input R(X, Y), and similarly for S(Y, Z) and T(Z,
U). By the same reasoning, h(U\mid XZ)
\leq \log \max \text{deg}_Q(U \mid XZ) \leq \log \max \text{deg}_A (U
\mid XZ), because the degree of any (sets of) attribute(s) in the
output is bounded by the degree in the input. We can now derive:

\begin{align*} & \log |R| + \log |S| + \log |T| + \log \max \text{deg}_A (U \mid XZ) + \log \max \text{deg}_B (X \mid YU) \\ \geq & h(XY) + h(YZ) + h(ZU) + h(U\mid XZ) + h(X \mid YU) \\ \geq & \underline{h(XYZ) + h(Y)} + h(ZU) + h(U\mid XZ) + h(X \mid YU) \\ \geq & h(XYZ) + \underline{h(YZU)} + h(U\mid XZ) + h(X \mid YU) \\ \geq & h(XYZ) + h(YZU) + \underline{h(U\mid XYZ) + h(X \mid YZU)} \\ = & 2h(XYZU) = 2 \log(|Q|) \end{align*}

All inequalities after the first one follow from submodularity, and the second-to-last equality follows from the definition of conditional entropy. Removing the \log by exponentiation on both sides, we get:

|Q| \leq \sqrt{|R|\cdot |S| \cdot |T| \cdot \max \text{deg}(U \mid XZ) \cdot \max \text{deg}(X\mid YU)} = N^{\frac{3}{2}}

And it’s tighter than the naïve N^2 bound!

Of course, to use these entropic bounds in a database system, we can’t ask a database theorist to find a proof every time we want to run a new query. Instead we need an automated way to compute a bound, given a set of input sizes and degree constraints. In the next post, we will see how to do exactly that by encoding the constraints, together with the Shannon inequalities, into a linear program (LP). In fact, we can even extract a proof from the LP, and use the proof to derive a query evaluation algorithm that meets the maximum output size bound!