A Perfect Join Algorithm

When comparing algorithms, we usually look at their asymptotic complexity. Even Obama knows Bubble sort is no good, because it runs in O(n^2), compared to O(n \log n) of many other algorithms like Merge sort. You might have also heard of the \Omega(n \log n) lower bound for any comparison-based sorting algorithms. So does that mean Merge sort is optimal? Well, yes and no: it is worst-case optimal in the sense that no (comparison) sort can run faster than O(n \log n) in the worst case, but certain algorithms, like Timsort, are faster in other cases. In fact, if you check out that wikipedia page, you might be surprised to see even Timsort is not optimal, and people have been making faster sorting algorithms as recently as 2022!

So is there really such a thing as a perfect algorithm? We say an algorithm is instance optimal if it has the best possible asymptotic complexity on every input instance, including the worst case, the best case, or any "average case". In other words, the algorithm is essentially perfect in terms of asymptotics. You don't hear about instance-optimal algorithms often, because they are exceedingly rare. In fact, the more recently popularized notion of universal optimality is weaker than instance-optimality, but has a much catchier name. But if you work in databases like me, you are in luck, because we are blessed with perfection: there is an instance-optimal algorithm for computing nearly any SQL query you ever hope to write, at least in theory.

As a warm-up, we'll look at the simple cases. Suppose we want to compute SELECT * FROM R WHERE R.x < 10. We can perform a scan over R and output any row satisfying the condition, taking O(n) time. This is instance-optimal (ignoring indexing), as any algorithm would have to read all the input which takes linear time. For joining two tables SELECT * FROM R, S WHERE R.x = S.x, the naive nested loop join takes O(n^2), which is worst-case optimal but not instance optimal; if we use hash join and ignore collisions, we'd spend linear time building a hash table over S, then scan through R while probing into S to produce the results. This is essentially instance optimal: we'd take O(|IN| + |OUT|) where |IN| is the total size of the inputs and |OUT| is the total output size, while any algorithm has to spend that much time to read the input and produce the output. We can show instance optimality in a similar way for all other common relational algebra operators, if we pretend hash table operations take constant time.

Time to celebrate? Not so fast. The issue comes when we consider more complicated queries. In particular, if we join together more than 2 tables, the intermediate result can be asymptotically larger than the output. Consider SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y: it's possible to find input tables such that both R \bowtie S and S \bowtie T have \Omega(n^2) size, but the final result is linear (try constructing such an input yourself; one example is in Figure 1 in our paper). In this case, no matter in what order we join the tables, hash join will have to run in \Omega(n^2). And in fact, so will any mainstream database system, because all of them implement some form of binary join algorithm joining two tables at a time. However, there's a 40+ years old algorithm, discovered by Mihalis Yannakakis in 1981, that guarantees to run in O(|IN| + |OUT|) for almost all multi-table joins found in practice. We'll do a deep dive of Yannakakis' algorithm in the rest of this post. To keep things simple, we will only be talking about joins from now on. In particular, we'll only consider inner equijoins (where the join conditions are always equalities).

Acyclic queries

The first thing to know about Yannakakis' algorithm is that it only works on a class of join queries called \alpha-acyclic (or simply acyclic). Traditionally, query acyclicity is defined using a certain hypergraph associated with the query, which is quite cool but also rather confusing. Instead, we'll use a more intuitive definition using a regular graph that can be derived immediately from the SQL query: the query graph for a join query has a node for each relation and an edge for each equality predicate. For example, the query graph for SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y is a line:

R
│
S
│
T

And if we add another predicate, SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y AND T.z = R.z, we get a triangle:

   R
 /   \
S ─── T

You might be guessing that a query is acyclic if its query graph doesn't have cycles. Almost! But there's one more wrinkle: a query can be written in different ways to "break" cycles. Consider SELECT * FROM A, B, C WHERE A.x = B.x AND B.x = C.x AND A.x = C.x. Its query graph is also a triangle (which is a cycle):

   A
 /   \
B ─── C

But one of the equality predicates is actually redundant: if we already have A.x = B.x and B.x = C.x, the last equality A.x = C.x is implied by transitivity, and we can drop it to break the cycle:

   A
 /
B ─── C

So the full definition of acyclicity is: a query is acyclic, if it can be rewritten to an equivalent query whose query graph has no cycles. Now, how do we know if a query can be rewritten to have an acyclic graph? The short answer is that we can first add all implied predicates, then try and drop sets of redundant edges until we get a tree. I also put a slightly longer and better way to do this at the end of this post, but all we need to know for now is that there is some way to massage every acyclic query so that its query graph is a tree.

Yannakakis' algorithm

The reason we require the query to be acyclic is that we'll need the tree-shaped query graph as an execution plan to guide Yannakakis' algorithm. In the literature, people also call this a join tree, which is unfortunate because the operator tree used in regular binary join query plans is also called a join tree. The important detail to remember is that, in our join trees here, every node corresponds to a table, while in the tree-shaped binary join plan, each internal node is a binary join operator, and each leaf is a table. In any case, given any join tree for an acyclic query, Yannakakis' algorithm works in three steps:

Traverse the tree bottom-up from leaves to root. When visiting a node corresponding to table R, remove from R any tuples that do not join with its children (a.k.a. semijoin-reduce R with its children).
Traverse the tree top-down from root to leaves. When visiting a node for table R, semijoin reduce R with its parent.
Join together the reduced tables using hash join. In most cases we can do this in any order, but to guarantee optimality we can always do a final bottom-up traversal and join as we go.

Let's work out an example. Suppose we have the query

SELECT * 
  FROM R1, R2, R3, R4
 WHERE R1.x = R2.x 
   AND R2.y = R3.y
   AND R2.y = R4.y

whose query graph is as follows:

   R1
   │
   R2
  /  \
R3    R4

We start by traversing the tree bottom-up. The leaves R3 and R4 do not have children, so we do nothing. The first internal node we encounter is R2, whose children are R3 and R4. To semijoin reduce R2, we can build a hash table each for R3 and R4 keyed on their y attribute. Then we scan R2 while probing into the hash tables of R3 and R4: if a row fails to find a match in either hash table, it will be removed from R2. When we're done, every row in R2 will join with some row in each of R3 and R4. Next, we move up to R1 and repeat the same: after building a hash table for the reduced R2, we scan R1 to remove bad rows from it.

Then, we repeat similar semijoin reductions in the top-down pass: build a hash table for (what's left of) R1 and further reduce R2 with it, and finally reduce R3 and R4 with the remaining R2.

In the final join pass, we go bottom-up again to compute the result: R_2' = R_2 \bowtie R_4;\, R_2'' = R_2' \bowtie R_3;\, \text{out} = R_1 \bowtie R_2''

The key to Yannakakis' algorithm is that the two semijoin reductions remove all dangling tuples, which are redundant rows in the table that do not contribute to any final output result. It does so in linear time, because each semijoin reduction requires building a hash table on one side (linear time ignoring collisions) and scanning the other side (also linear time). This is where the |IN| factor in the run time complexity comes from. In the final bottom-up pass, every pairwise join will produce an intermediate no larger than the final output (because all dangling tuples are gone and no useless results are produced), so overall the run time is bounded by O(|IN| + |OUT|).

Fun fact: Yannakakis' algorithm is basically the same algorithm as the linear time arc consistency algorithm in Constraint Satisfaction, or the message-passing algorithm for belief propagation over tree-shaped networks.

But if Yannakakis' algorithm is so perfect, why haven't you heard of it? More importantly, why does no one¹ implement it in modern DB systems? Well, that's the topic for another post. And a teaser: I will show you four different ways to improve Yannakakis' algorithm in the next post!

Checking if a query is acyclic

I promised a better way to check if a query is acyclic -- here it is. First, we need to enhance our query graph a little bit, and annotate each edge with the number of join predicates on that edge. For example, the query SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x would have the following (weighted) query graph:

R
│ 2
S
│ 1
T

Note the edge between R and S has a weight of 2 because there are 2 join predicates between them. We only count each unique join predicate once, so if we write R.x = S.x twice in the WHERE clause it would still contribute a weight of 1. Next, we will add in any transitively implied predicates to get the complete query graph. For our example query, the predicate R.x = T.x is implied by R.x = S.x and S.x = T.x, and after adding it, we get the query SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x AND R.x = T.x whose graph is:

    R
2 /   \ 1
 S ─── T
    1

Now, we take any maximum spanning tree of this complete query graph, for example:

    R
2 /
 S ─── T
    1

If this tree corresponds to a query that is equivalent to the original one, then the query is acyclic. In this example, the corresponding query is exactly the same as the original one. In general, if we want to check if two SQL join queries are equivalent, we can group the joined attributes into equivalent classes, and if we end up with the same grouping it means the queries are equivalent. For example, looking at the original query SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x:

The first join predicate R.x = S.x puts R.x and S.x into the same class
The second predicate R.a = S.a puts R.a and S.a into the same class
The last predicate S.x = T.x puts T.x into the same class as S.x

So we end up with the groups [R.x, S.x, T.x] and [R.a, S.a]. Try this with the completed query SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x AND R.x = T.x and you'll get the same grouping (ordering does not matter).