When comparing algorithms, we usually look at their asymptotic complexity. Even Obama knows Bubble sort is no good, because it runs in O(n^2), compared to O(n \log n) of many other algorithms like Merge sort. You might have also heard of the \Omega(n \log n) lower bound for any comparison-based sorting algorithms. So does that mean Merge sort is optimal? Well, yes and no: it is worst-case optimal in the sense that no (comparison) sort can run faster than O(n \log n) in the worst case, but certain algorithms, like Timsort, are faster in other cases. In fact, if you check out that wikipedia page, you might be surprised to see even Timsort is not optimal, and people have been making faster sorting algorithms as recently as 2022!
So is there really such a thing as a perfect algorithm? We say an algorithm is instance optimal if it has the best possible asymptotic complexity on every input instance, including the worst case, the best case, or any "average case". In other words, the algorithm is essentially perfect in terms of asymptotics. You don't hear about instance-optimal algorithms often, because they are exceedingly rare. In fact, the more recently popularized notion of universal optimality is weaker than instance-optimality, but has a much catchier name. But if you work in databases like me, you are in luck, because we are blessed with perfection: there is an instance-optimal algorithm for computing nearly any SQL query you ever hope to write, at least in theory.
As a warm-up, we'll look at the simple cases. Suppose we want to
compute SELECT * FROM R WHERE R.x < 10. We can perform a
scan over R and output any row satisfying the condition,
taking O(n) time. This is
instance-optimal (ignoring indexing), as any algorithm would have to
read all the input which takes linear time. For joining two tables
SELECT * FROM R, S WHERE R.x = S.x, the naive nested loop
join takes O(n^2), which is worst-case
optimal but not instance optimal; if we use hash join and ignore
collisions, we'd spend linear time building a hash table over
S, then scan through R while probing into
S to produce the results. This is essentially instance
optimal: we'd take O(|IN| + |OUT|)
where |IN| is the total size of the
inputs and |OUT| is the total output
size, while any algorithm has to spend that much time to read the input
and produce the output. We can show instance optimality in a similar way
for all other common relational algebra operators, if we pretend hash
table operations take constant time.
Time to celebrate? Not so fast. The issue comes when we consider more
complicated queries. In particular, if we join together more than 2
tables, the intermediate result can be asymptotically larger than the
output. Consider
SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y: it's
possible to find input tables such that both R
\bowtie S and S \bowtie T have
\Omega(n^2) size, but the final result
is linear (try constructing such an input yourself; one example is in
Figure 1 in our paper).
In this case, no matter in what order we join the tables, hash join will
have to run in \Omega(n^2). And in
fact, so will any mainstream database system, because all of them
implement some form of binary join algorithm joining two tables
at a time. However, there's a 40+ years old algorithm, discovered by
Mihalis Yannakakis in 1981, that guarantees to run in O(|IN| + |OUT|) for almost all multi-table
joins found in practice. We'll do a deep dive of Yannakakis' algorithm
in the rest of this post. To keep things simple, we will only be talking
about joins from now on. In particular, we'll only consider inner
equijoins (where the join conditions are always equalities).
The first thing to know about Yannakakis' algorithm is that it only
works on a class of join queries called \alpha-acyclic (or simply
acyclic). Traditionally, query acyclicity is defined using a
certain hypergraph associated with the query, which is quite
cool but also rather confusing. Instead, we'll use a more intuitive
definition using a regular graph that can be derived immediately from
the SQL query: the query graph for a join query has a node for
each relation and an edge for each equality predicate. For example, the
query graph for
SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y is a
line:
R │ S │ T
And if we add another predicate,
SELECT * FROM R, S, T WHERE R.x = S.x AND S.y = T.y AND T.z = R.z,
we get a triangle:
R / \ S ─── T
You might be guessing that a query is acyclic if its query graph
doesn't have cycles. Almost! But there's one more wrinkle: a query can
be written in different ways to "break" cycles. Consider
SELECT * FROM A, B, C WHERE A.x = B.x AND B.x = C.x AND A.x = C.x.
Its query graph is also a triangle (which is a cycle):
A / \ B ─── C
But one of the equality predicates is actually redundant: if we
already have A.x = B.x and B.x = C.x, the last
equality A.x = C.x is implied by transitivity, and we can
drop it to break the cycle:
A / B ─── C
So the full definition of acyclicity is: a query is acyclic, if it can be rewritten to an equivalent query whose query graph has no cycles. Now, how do we know if a query can be rewritten to have an acyclic graph? The short answer is that we can first add all implied predicates, then try and drop sets of redundant edges until we get a tree. I also put a slightly longer and better way to do this at the end of this post, but all we need to know for now is that there is some way to massage every acyclic query so that its query graph is a tree.
The reason we require the query to be acyclic is that we'll need the tree-shaped query graph as an execution plan to guide Yannakakis' algorithm. In the literature, people also call this a join tree, which is unfortunate because the operator tree used in regular binary join query plans is also called a join tree. The important detail to remember is that, in our join trees here, every node corresponds to a table, while in the tree-shaped binary join plan, each internal node is a binary join operator, and each leaf is a table. In any case, given any join tree for an acyclic query, Yannakakis' algorithm works in three steps:
R, remove from R
any tuples that do not join with its children (a.k.a. semijoin-reduce
R with its children).R, semijoin reduce R with its
parent.Let's work out an example. Suppose we have the query
SELECT *
FROM R1, R2, R3, R4
WHERE R1.x = R2.x
AND R2.y = R3.y
AND R2.y = R4.y
whose query graph is as follows:
R1 │ R2 / \ R3 R4
We start by traversing the tree bottom-up. The leaves R3
and R4 do not have children, so we do nothing. The first
internal node we encounter is R2, whose children are
R3 and R4. To semijoin reduce R2,
we can build a hash table each for R3 and R4
keyed on their y attribute. Then we scan R2
while probing into the hash tables of R3 and
R4: if a row fails to find a match in either hash table, it
will be removed from R2. When we're done, every row in
R2 will join with some row in each of R3 and
R4. Next, we move up to R1 and repeat the
same: after building a hash table for the reduced R2, we
scan R1 to remove bad rows from it.
Then, we repeat similar semijoin reductions in the top-down pass:
build a hash table for (what's left of) R1 and further
reduce R2 with it, and finally reduce R3 and
R4 with the remaining R2.
In the final join pass, we go bottom-up again to compute the result: R_2' = R_2 \bowtie R_4;\, R_2'' = R_2' \bowtie R_3;\, \text{out} = R_1 \bowtie R_2''
The key to Yannakakis' algorithm is that the two semijoin reductions remove all dangling tuples, which are redundant rows in the table that do not contribute to any final output result. It does so in linear time, because each semijoin reduction requires building a hash table on one side (linear time ignoring collisions) and scanning the other side (also linear time). This is where the |IN| factor in the run time complexity comes from. In the final bottom-up pass, every pairwise join will produce an intermediate no larger than the final output (because all dangling tuples are gone and no useless results are produced), so overall the run time is bounded by O(|IN| + |OUT|).
Fun fact: Yannakakis' algorithm is basically the same algorithm as the linear time arc consistency algorithm in Constraint Satisfaction, or the message-passing algorithm for belief propagation over tree-shaped networks.
But if Yannakakis' algorithm is so perfect, why haven't you heard of it? More importantly, why does no one1 implement it in modern DB systems? Well, that's the topic for another post. And a teaser: I will show you four different ways to improve Yannakakis' algorithm in the next post!
I promised a better way to check if a query is acyclic -- here it is.
First, we need to enhance our query graph a little bit, and annotate
each edge with the number of join predicates on that edge. For
example, the query
SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x
would have the following (weighted) query graph:
R │ 2 S │ 1 T
Note the edge between R and S has a weight
of 2 because there are 2 join predicates between them. We only count
each unique join predicate once, so if we write R.x = S.x
twice in the WHERE clause it would still contribute a
weight of 1. Next, we will add in any transitively implied predicates to
get the complete query graph. For our example query, the
predicate R.x = T.x is implied by R.x = S.x
and S.x = T.x, and after adding it, we get the query
SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x AND R.x = T.x
whose graph is:
R
2 / \ 1
S ─── T
1
Now, we take any maximum spanning tree of this complete query graph, for example:
R
2 /
S ─── T
1
If this tree corresponds to a query that is equivalent to the
original one, then the query is acyclic. In this example, the
corresponding query is exactly the same as the original one. In general,
if we want to check if two SQL join queries are equivalent, we can group
the joined attributes into equivalent classes, and if we end up
with the same grouping it means the queries are equivalent. For example,
looking at the original query
SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x:
R.x = S.x puts
R.x and S.x into the same classR.a = S.a puts R.a
and S.a into the same classS.x = T.x puts T.x into
the same class as S.xSo we end up with the groups [R.x, S.x, T.x] and
[R.a, S.a]. Try this with the completed query
SELECT * FROM R, S, T WHERE R.x = S.x AND R.a = S.a AND S.x = T.x AND R.x = T.x
and you'll get the same grouping (ordering does not matter).