4 Ways to Improve A Perfect Join Algorithm

Last time I covered one of my favorite results from database theory: Yannakakis's algorithm for acyclic joins. The algorithm is very simple, yet achieves instance-optimality on a large class of queries, meaning it's basically perfect as far as asymptotic complexity goes. So why doesn't every database implement it? The main reason is that the algorithm is actually not very fast in practice. To be exact, it's usually 2-3x slower than hash join. Let's look at an example to understand this. Consider the natural join:

Q = R_1(x_1, x_2) \bowtie R_2(x_2, x_3) \bowtie R_3(x_3, x_4) \bowtie R_4(x_4, x_5)

Here I'm using R_1(x_1, x_2) to imply the schema of R_1 is \{x_1, x_2\}. In SQL the query looks like this:

SELECT * FROM R1, R2, R3, R4
 WHERE R1.x2 = R2.x2 AND R2.x3 = R3.x3
   AND R3.x4 = R4.x4

R1
│
R2
│
R3
|
R4

Remember that Yannakakis's algorithm requires 2 semijoin passes to reduce the relations, before a final pass to actually join them and output the results. Let's follow the execution in slow motion. We'll consider input relations such that every tuple successfully join with the next relation, e.g. R_1 = R_2 = R_3 = R_4 = \{(1, 1), (2, 2), \ldots, (n, n)\}. The bottom-up pass starts with building a hash table for R_4 (let's call that T4), mapping each x_4 value to the corresponding x_5's. It looks like this on our input:

Note that each key is mapped to an array of values, because in general the same key can appear in multiple tuples, but for our input every x_4 maps to a unique x_5.

Now, to compute R_3 \ltimes R_4, we'll loop over R_3 while probing into T4, and keep the tuples if the probe succeeds:

for (x3, x4) in R3:
  if T4[x4] is not nil:
    T3[x3].push(x4)

Here I'm already applying a small optimization: instead of collecting the semijoin result into a buffer R_3' and make another scan to build the hash table T3, I'm directly building T3 on-the-fly. I'm also abusing the notation a little bit and use T3[x3].push(x4) to mean "if x3 is already mapped to an array, insert x4 to it, otherwise map x3 to a new array containing just x4".

On our input instance, the semijoin doesn't really do anything - every tuple in R_3 survives, and T3 looks exactly like T4. The same will happen for the other two semijoins on the bottom-up pass, namely R_2 \ltimes R_3 and R_1 \ltimes R_2. In fact, the top-down pass will also be the same. Overall, we will run 6 useless semijoins, each requiring n probes and n inserts into a hash table. The final join pass will take n hash lookups per join. Overall the algorithm takes 6n + 3n = 9n lookups and 6n inserts. In contrast, regular hash join will only require building 3 hash tables (3n inserts) and probing into them (3n probes).

To be fair, the example above uses the worst possible input for the algorithm - because no tuple can be removed, the semijoin passes are completely useless. This points to a natural remedy: the main idea of the first enhancement is to reduce the semijoin overhead with Bloom filters.

Bloom filters

A Bloom filter is a probabilistic data structure to approximately store a set of elements. After the filter is built, one can ask "is this element in the set?", and get an answer of either "maybe yes" or "definitely not". If you haven't seen it before I encourage you to check out the Wikipedia page - it's actually really simple and elegant! But all you need to know now is that Bloom filter can store a very large set very compactly: a filter for an in-memory DB table can usually fit in the CPU cache, drastically accelerating both insertions and lookups. For Yannakakis's algorithm, we can store the semijoin result in a Bloom filter instead of a hash table:

for (x3, x4) in R3:
  if x4 in R4_filter:
    R3_filter.insert(x3)

On our example, 6n of the total 9n lookups are now done on Bloom filters, and 3n of the inserts are into Bloom filters. If the filter fits in the cache, both lookups and inserts will be orders-of-magnitude faster than the corresponding operations on hash tables which likely requires memory accesses with bad locality. And on average cases, the semijoin passes will throw away bad tuples, further accelerating the final join pass.

The idea of using compact filters to accelerate joins is not new: SQLite has been doing that for several years, and bitmap filters has been in SQLServer for even longer. It's just that people are only recently realizing that these filters pair really well with the theoretical guarantees of Yannakakis's algorithm.

Aggregate pushdown

While Bloom filters accelerate Yannakakis's algorithm "horizontally" at the level of each semijoin operation, the next idea works "vertically" by replacing semijoins all together. The key insight is that, in most cases, we don't actually need the millions of tuples coming out of the join, but rather some aggregate information over them. For example, we might want to know the best rated movie in some category, or the average rating of some director. In SQL, this means instead of SELECT *, we usually GROUP BY a small number of columns and output some aggregates per group. Using our slightly contrived example, we'll be querying:

  SELECT x1, max(x5)
    FROM R1, R2, R3, R4
   WHERE ...
GROUP BY x1

For this query, we can squeeze all 3 passes of Yannakakis's algorithm into a single pass! The idea is that in stead of running semijoins, we push down the aggregation to run as we build the hash table for each relation. Starting with R4:

for (x4, x5) in R4:
  if T4[x4] is nil or T4[x4] < x5:
    T4[x4] = x5

Note that x4 is now mapped to a single value instead of an array, because we only need to keep one max per group. We do the same for R3:

for (x3, x4) in R3:
  x5 = T4[x4] 
  if x5 is not nil
    and (T3[x3] is nil or T3[x3] < x5):
    T3[x3] = x5

for (x2, x3) in R2:
  x5 = T3[x3] 
  if x5 is not nil
    and (T2[x2] is nil or T2[x2] < x5):
    T2[x2] = x5

for (x1, x2) in R1:
  x5 = T2[x2] 
  if x5 is not nil
    and (T1[x1] is nil or T1[x1] < x5):
    T1[x1] = x5

Overall, instead of adding 6 additional semijoins, we are running just 3 steps, each performing a join and a group-by together. Another advantage of this approach is that each step can be formulated as a SQL query, making it applicable to any database without touching the internal engine. To compute T4, we can create a view:

CREATE VIEW T4 AS (
     SELECT x4, max(x5)
       FROM R4
   GROUP BY x4
)

CREATE VIEW T3 AS (
     SELECT x3, max(x5)
       FROM R3, T4
      WHERE R3.x4 = T4.x4
   GROUP BY x3
)

Of course, we can't do this for every query - we have no choice but to output all tuples for our SELECT * query. In general, this method requires the query to be relation-dominated, which basically says that the GROUP BY attributes must all come from the same relation. Another way to understand this approach is to think of aggregate pushdown as a way of "creating" new primary keys - after an aggregation, the GROUP BY attributes automatically becomes a primary key; and since joins on primary keys will always be O(n), the entire algorithm runs in linear time.

Join on the way up, unnest on the way down

While the GROUP BY trick only applies when there are enough aggregates, we will see a more general way to "squeeze together" the passes of Yannakakis's algorithm. But this method is better understood as an optimization of the classic hash join, so we will tweak our example a little. We'll use the same query, but now consider the input: \begin{aligned} R_1 &= \{(2, 0), (4, 0), \ldots, (2n, 0)\} \\ R_2 &= \{(0, 2), (0, 4), \ldots, (0, 2n)\} \\ R_3 &= \{(1, 0), (3, 0), \ldots, (2n-1, 0)\} \\ R_4 &= \{(0, 1), (0, 3), \ldots, (0, 2n-1)\} \end{aligned} On these relations, both R_1 \bowtie R_2 and R_3 \bowtie R_4 will produce n^2 tuples, but the final result is empty. Now suppose we join with the plan (R_1 \bowtie (R_2 \bowtie (R_3 \bowtie R_4))), meaning to first join R_3 with R_4, then join the result with R_2, and finally with R_1. We will spend O(n^2) time computing R_3 \bowtie R_4, only to realize nothing joins with R_2 and throw it all away. Let's look at the code:

for (x3, x4) in R3:
  if T4[x4] is not nil:
    for x5 in T4[x4]:
      T3[x3].insert((x4, x5))

for (x2, x3) in R2:
  if T3[x3] is not nil:
    # the code below will never run
    ...

Our mistake is that we are "expanding" the matching results from T4[x4] into T3 too early: every single tuple in R_3 shares x_4 = 0, so the entire R_4 will match and is duplicated for every entry in T3, causing the quadratic blowup. The key idea of our 3rd enhancement is to delay the expansion of matching results, and instead only store a pointer to the matches:

for (x3, x4) in R3:
  if T4[x4] is not nil:
    T3[x3].insert(&T4[x4])

Now we no longer have 2 nested loops for the first join, and this join is guaranteed to run in linear time. We will repeat the same pattern for joining with R_2 and R_1, and eventually we will end up with a nested representation of the join results. For this example, we can stop right after the join with R_2 because the result is empty; in general, we will need to unnest by following the pointers in a top-down pass. This still requires 2 passes - a join pass and an unnest pass - but the unnest pass no longer touches any hash tables.

On-the-fly semijoin

The plan (R_1 \bowtie (R_2 \bowtie (R_3 \bowtie R_4))) we considered in the previous section is known as a right-deep plan, because if we draw out the operator tree, it goes down to the right. Most database however tend to favor left-deep plans because such a plan allows for a pipelined implementation without materializing intermediate results. For our example query, consider the left-deep plan (((R_1 \bowtie R_2) \bowtie R_3) \bowtie R_4), which means we first join together R_1 with R_2, then join the result with R_3, and finally with R_4. The execution engine will first build a hash table each for R_2, R_3, R_4, then iterate over them with nested loops:

for (x1, x2) in R1:
  for x3 in T2[x2]:
    for x4 in T3[x3]:
      for x5 in T4[x4]:
        print(x1, x2, x3, x4)

Now the large intermediate results are implicitely hidden in the loop nests. Because there's nothing to materialize, we can't really create any nested representation either. Instead, we will "piggyback" semijoins on the loop nest:

for (x1, x2) in R1:
  if T2[x2] is nil:
    R1.del((x1, x2))
  for x3 in T2[x2]:
    if T3[x3] is nil:
      T2.del((x2, x3))
    for x4 in T3[x3]:
      if T4[x4] is nil:
        T3.del((x3, x4))
      for x5 in T4[x4]:
        print(x1, x2, x3, x4)

Here we added code to check if a hash probe will fail, and delete the offending tuple from the iterating relation if so. Deletion is safe, because we know any future probes with the same tuple is doomed to fail again. This way we are essentially running the semijoins on-the-fly and by demand; the deletions are also fast thanks to good temporal locality, and most hash tables implement deletion by setting a single "tombstone" bit to indicate the entry is invalidated. It is easy to see why the algorithm runs in linear time: every time a tuple is visited, it either eventually contributes to an output result, or gets deleted; overall this leads to O(|\textsf{IN}|+|\textsf{OUT}|) complexity.

Conclusion

I jumped over many details above. For example, the predicate transfer work (and follow-up) implements some pretty neat SIMD acceleration for Bloom filters; the aggregate pushdown paper precisely defines when the optimization applies; our nested Yannakakis paper defines an entire new nested relational algebra and describes a vectorized implementation; and for the last approach (which we call TreeTracker Join), our paper introduces an additional twist to jump over loops when a variable used in a hash probe was introduced in an outer loop. If this post whets your appetite, you should check out our very short survey paper for pointers to the literature and start digging from there. There's also an exciting new paper to appear at CIDR 2026 on how SQLServer "accidentally" implemented Yannakakis's algorithm that didn't come out yet when we wrote our survey. I hope you will now see optimal join algorithms are not that mysterious after all, and you should demand your DB vendor to start considering them! Or better yet, if you are a DB maintainer, you should consider implementing one of the many approaches here, and talk to us researchers so we can better help you.