To compute R \bowtie_{x} S, first choose one of R or S to build hash on. Suppose we build on S:
First loop over S to build hash table:
= HashTable::new()
h
for s in S:
if h.has_key(s.x):
h[x].push(s)else:
= Vec::new()
h[x] h[x].push(s)
Then, loop over R while probing into S to find matches:
for r in R:
if h.has_key(r.x):
for s in h[x]:
print(r.concat(s))
Try generalize the code above to the case where we join on multiple attributes.
Let us first review the merge sort algorithm. It is a divide-and-conquer algorithm that sorts an array by recursively splitting it into halves, sorting each half, and then merging the sorted halves back together.
def merge_sort(A):
if len(A) <= 1:
return A
= len(A) // 2
mid = merge_sort(A[:mid])
left = merge_sort(A[mid:])
right return merge(left, right)
The merge function takes two sorted arrays and merges them into a single sorted array.
def merge(A, B):
= []
output = 0
i = 0
j
while i < len(A) and j < len(B):
if A[i] < B[j]:
output.append(A[i])+= 1
i else:
output.append(B[j])+= 1
j
while i < len(A):
output.append(A[i])+= 1
i while j < len(B):
output.append(B[j])+= 1
j return output
First consider a simple case where we join two relations R and S, each with a single attribute x. In this case, we can sort both relations on x and then merge them to find matches.
def merge_join(R, S):
= merge_sort(R)
R_sorted = merge_sort(S)
S_sorted
return merge(R_sorted, S_sorted)
The general case is similar, we just need to retrieve the join attributes and carefully handle cartesian products.
def merge_join(R, S, join_attrs):
= merge_sort(R, join_attrs)
R_sorted = merge_sort(S, join_attrs)
S_sorted
= []
output = 0
i = 0
j
while i < len(R_sorted) and j < len(S_sorted):
if R_sorted[i][join_attrs] < S_sorted[j][join_attrs]:
+= 1
i elif R_sorted[i][join_attrs] > S_sorted[j][join_attrs]:
+= 1
j else:
# Found a match, take the cartesian product
= i
i_start = j
j_start = ...
i_end = ...
j_end for k in range(i_start, i_end):
for l in range(j_start, j_end):
output.append(R_sorted[k].concat(S_sorted[l]))return output
To join many relations, we can join a pair of relations at a time: ((((R_1 \bowtie R_2) \bowtie R_3) \bowtie R_4) \bowtie \cdots). The order of the joins matters, because the total run time is proportional to the total size of the intermediate results. Example: suppose R_1 \bowtie R_2 is empty, but |R_2 \bowtie R_3| = O(N^2). Then joining R_1 and R_2 first will result in a total run time of O(N) (why is it not instant?), while joining R_2 and R_3 first will result in a total run time of O(N^2).
But there are also examples where any join order will be suboptimal. Example in class.
Luckily, there is a (very old) algorithm, called Yannakakis algorithm, that always run in time O(|IN| + |OUT|), for a class of queries called acyclic queries.
The predicate graph of a query has a vertex for each relation, and an edge for each join predicate. A query is acyclic if it can be rewritten into another query with an acyclic predicate graph. But how do we know (efficiently) if it is possible to rewrite a query into an acyclic one? We’ll come back to this later.
Assume a query is acyclic, and we have rewritten it into one with an acyclic predicate graph. Then the Yannakakis algorithm works as follows:
The key property is that, after the two semijoin passes, every tuple that remains in any relation must contribute to an output.
This algorithm runs in time O(|IN| + |OUT|), because each semijoin operation takes O(|IN|) time, and the final hash join takes O(|OUT|) time.