One of the most exciting developments in database theory in recent years is the entropic framework for cardinality bounds, which emerges from deep connections between database theory and information theory. In this blog post, I explain the fundamental concepts and key steps for estimating join size via information inequalities. This post is based on a lecture by Dan Suciu during the Simons Institute program Logic and Algorithms in Database Theory and AI. The technique is developed in a line of work including AGM13, GLVV12, GM14, ANS16, ANS17. See Suc23 for a wonderful survey of the literature.
The intuition behind using information theory to estimate join output size is that the entropy of a relation, a notion to be made precise later, carries information about the relation. Inequalities over entropic functions have also been studied for decades, so we can leverage the results to derive bounds for database queries.
We begin by reviewing several basic definitions in information theory. First, the entropy of a finite random variable is defined as follows:
where is the probability of the outcome , and has base 2. Intuitively, the entropy measures the randomness of : if is deterministic, then:
for some constant . Therefore . If is uniform (as random as it gets), then , where is the size of the support for .
When is a set of variables, denotes the entropy over the joint distribution over the set. We write for , and for . For a set of random variables , we define the entropic vector as a function from any subset of to the entropy of their joint distribution:
As an example, let’s consider the distribution over as shown in the table below:
X | Y | Z | p |
---|---|---|---|
0 | 0 | 0 | 1/4 |
0 | 1 | 1 | 1/4 |
1 | 0 | 1 | 1/4 |
1 | 1 | 0 | 1/4 |
The entropic vector then has values as shown in the Hasse diagram below:
We next define the conditional entropy as an analog to the conditional distribution:
Importantly, the conditional entropy is not an entropic vector! In other words, there may not always be a distribution whose entropies corresponds to a given conditional entropy. Instead, the conditional entropy can be equivalently defined as the expectation of the entropy of given :
The reader can verify this is indeed equivalent to the definition on our example above.
One final bit of notation we need is the conditional mutual information:
The key property is that and are independent given iff . Rearranging in terms of conditional entropies can perhaps make the definition more intuitive:
Here we see that measures how much less random becomes once we know , if we already knew . Or, how much more information we gain about after conditioning on . Note that the definition is symmetric: .
From these definitions, the three fundamental entropic inequalities, called Shannon inequalities, fall out very naturally:
where is the size of the support of , i.e. number of possible outcomes. The second inequality is called monotonicity, because it is equivalent to , which says the distribution over more variables becomes more random. The last inequality is called submodularity (a.k.a. “law of diminishing returns”). If we replace with , we can rewrite the inequality as which is exactly the law of diminishing returns.
OK, enough about entropies! How can we use them for databases? The key idea is to see a relation as a distribution over its tuples. Our example before (copied here) assigns the same probability to every tuple in the relation, so it defines a uniform distribution over the tuples.
X | Y | Z | p |
---|---|---|---|
0 | 0 | 0 | 1/4 |
0 | 1 | 1 | 1/4 |
1 | 0 | 1 | 1/4 |
1 | 1 | 0 | 1/4 |
Then, we can use the first Shannon inequality to connect the entropy of a relation to its size, because the relation is exactly the support , so . Similarly, we can connect the conditional entropy to degree constraints. For two sets of attributes and values , define the degree of given , written , as the number of tuples with distinct values given . For example, , , and . The max degree of given , written , is the maximum of over all . Then we have:
With this, we can translate bounds on the size of relations and degree constraints into bounds on entropies. Note that degree constraints generalize functional dependencies (FDs). For example an FD can be expressed as . This means we can take advantage of functional dependencies (e.g. primary keys) when deriving bounds on query output size. Let’s try it!
Consider the following query:
And suppose we know , but we don’t how large and can be. Then can be as large as the cartesian product of and , so . But suppose we also know the FDs and hold, we can use the Shannon inequalities to derive a tighter bound. First, consider a uniform distribution over the output of . Then because the projection of the output on must be in the input , and similarly for and . By the same reasoning, , because the degree of any (sets of) attribute(s) in the output is bounded by the degree in the input. We can now derive:
All inequalities after the first one follow from submodularity, and the second-to-last equality follows from the definition of conditional entropy. Removing the by exponentiation on both sides, we get:
And it’s tighter than the naïve bound!
Of course, to use these entropic bounds in a database system, we can’t ask a database theorist to find a proof every time we want to run a new query. Instead we need an automated way to compute a bound, given a set of input sizes and degree constraints. In the next post, we will see how to do exactly that by encoding the constraints, together with the Shannon inequalities, into a linear program (LP). In fact, we can even extract a proof from the LP, and use the proof to derive a query evaluation algorithm that meets the maximum output size bound!