Acknowledgements All ideas in this post are due to @mabokhamis and @hung-q-ngo; I’m just writing them down so that I don’t have to bother them when I forget.
In an earlier post I talked about how to use the AGM bound to bound the run time of the Generic Join algorithm. It turns out we can sometimes bound the run time of recursive queries as well. Consider the transitive closure program in Datalog:
P(x, y) :- E(x, y).
P(x, y) :- E(x, z), P(z, y).
During semi-naive evaluation, we will compute a delta relation every iteration as . The set difference can be done in linear time, so we will focus on the join only.
If we look at all iterations, we’ll be computing
.
Factoring out E
, we get
,
where
is the relation
at iteration
.
Since
must be contained in the final output
,
i.e. ,
at this point we can say the whole semi-naive algorithm runs in
.
But turns out we can do better.
To reduce clutter, I’ll write
for
.
Now take a closer look at the join
Q(x, y, z) :- E(x, z), P(z, y)
. For the moment, let’s also
add
into the join to make a triangle
Q'(x, y, z) :- E(x, z), P(z, y), O(x, y)
. With this, we can
use the AGM bound to get
,
which is a tighter bound than the above. I now claim we can also use
this bound for
.
The key is that the execution of Generic Join for
is exactly the same as that for
.
Consider the variable ordering z, x, y
. The GJ loop for
is the following:
for z in E.z ^ P.z
for x in E[z].x ^ O.x
for y in P[z].y ^ O[x].y
output(x, y, z)
Since is the final output of the transitive closure program, we have the invariant . Therefore, we can remove the intersections with on both inner loops, and the run time would remain the same since does not filter out any value. With removed, the nested loop now computes exactly , taking the same time as .