SQL is the language for databases. A SQL program is also called a query. Each query takes one or more tables as input, and produces one table as output. A table is a collection of rows, where each row has the same number of columns. Each column has a name and a type, and stores the same type of data.
| name | age | weight |
|---|---|---|
| Casa | 9 | 12 |
| Kira | 7 | 10 |
| Toby | 17 | 40 |
| Maya | 10 | 34 |
Here’s the simplest SQL query:
SELECT name
FROM petsIt returns a table containing a single column, namely
name.
We can also return multiple columns, or evaluate expressions over them:
SELECT name, weight / (age + 5)
FROM petsConditions in the WHERE clause can be used to filter out
rows:
SELECT name
FROM pets
WHERE age < 10The query can be understood as a for-loop:
for (name, age, weight) in pets:
if age < 10:
print(name)An aggregate function takes in a whole column and returns a single value:
-- also try MAX, SUM, or COUNT
SELECT MIN(age)
FROM petsAggregate functions can also be applied by group:
| name | age | weight | kind |
|---|---|---|---|
| Casa | 9 | 12 | cat |
| Kira | 7 | 10 | cat |
| Toby | 17 | 40 | dog |
| Maya | 10 | 34 | dog |
SELECT kind, max(age)
FROM pets
GROUP BY kindThe rows will get collected into different groups, one per different
value of kind. Here we will get two groups:
| name | age | weight | kind |
|---|---|---|---|
| Casa | 9 | 12 | cat |
| Kira | 7 | 10 | cat |
| name | age | weight | kind |
|---|---|---|---|
| Toby | 17 | 40 | dog |
| Maya | 10 | 34 | dog |
Then, the aggregate function is applied to each group, reducing it to one row each:
| kind | max(age) |
|---|---|
| cat | 12 |
| dog | 17 |
Note that every column in the GROUP BY clause
must also appear in the SELECT clause, and any
other column in the SELECT clause must be
aggregated. The following query is invalid because weight
neither appears in GROUP BY nor is aggregated.
SELECT kind, weight, max(age)
FROM pets
GROUP BY kindA more formal way to represent database queries is relational algebra. Relational algebra consists of a set of relation (a.k.a. table) operators. Each operator takes one or more tables as input, and produces an output table.
The selection operator \sigma_p(t) filters table t by condition p, e.g. \sigma_{age < 10}(\text{pets}) corresponds to:
SELECT *
FROM pets
WHERE age < 10Here SELECT * means “output all columns”.
The projection operator \pi_e(t) evaluates (“maps”) an expression e over each row of t. For example, \pi_{\text{age} + 5, \text{weight} * 2}(t) corresponds to:
SELECT age + 5, weight * 2
FROM petsThe aggregation operator \gamma_{x, y, F(z)} groups a table by columns x, y and aggregates each group by function F. \gamma_{\text{age}, \max(\text{weight})}(\text{pets}) corresponds to:
SELECT age, MAX(weight)
FROM pets
GROUP BY ageNote that grouping and aggregation is done in one operator, because grouping alone does not produce a valid table (it produces a “mini table” per group).
Here are the relevant operators so far, and which part of a SQL query
they correspond to. Confusingly, although \sigma is called the selection
operator, it actually corresponds to the WHERE clause,
whereas the projection \pi
corresponds to the SELECT clause.1
| Name | Notation | Meaning | SQL |
|---|---|---|---|
| selection | \sigma_p(t) | filter by condition p | WHERE |
| projection | \pi_{e(x, y)}(t) | map an expression e over x, y | SELECT |
| aggregation | \gamma_{x, F(y)}(t) | group by x, aggregate over y using F | GROUP BY & SELECT |
So far we’ve only been using one table in the FROM
clause. We can also supply multiple tables. Suppose we break down our
pets table into separate ones linked by ID:
| ID | name |
|---|---|
| 1 | Casa |
| 2 | Kira |
| ID | age |
|---|---|
| 1 | 9 |
| 2 | 7 |
To query across tables, we need to join them:
SELECT name, age
FROM Pname, Page
WHERE Pname.ID = Page.IDNote that we can just use name and age in
the output because it is unambiguous which table each column comes from,
but in the WHERE clause we have to use
Pname.ID and Page.ID to specify which
ID columns we are comparing.
Let’s break down the query to understand what’s going on. First, if
we drop the WHERE clause and SELECT all
columns:
SELECT *
FROM Pname, PageThis takes the Cartesian product of the tables. Namely it produces a table containing all possible pairings of the rows:
| Pname.ID | name | Page.ID | age |
|---|---|---|---|
| 1 | Casa | 1 | 9 |
| 1 | Casa | 2 | 7 |
| 2 | Kira | 1 | 9 |
| 2 | Kira | 2 | 7 |
Cartesian product has its own relational algebra operator \times; the current query computes exactly \text{Pname} \times \text{Page}.
SELECT *
FROM Pname, Page
WHERE Pname.ID = Page.IDAdding back the WHERE clause, we filter down the
Cartesian product to keep only the rows where the IDs match:
| Pname.ID | name | Page.ID | age |
|---|---|---|---|
| 1 | Casa | 1 | 9 |
| 2 | Kira | 2 | 7 |
In relational algebra: \sigma_{\text{Pname}.\text{ID} = \text{Page}.\text{ID}}(\text{Pname} \times \text{Page})
SELECT name, age
FROM Pname, Page
WHERE Pname.ID = Page.IDFinally, we keep only the name and age
columns in the output:
| name | age |
|---|---|
| Casa | 9 |
| Kira | 7 |
In relational algebra: \pi_{\text{name}, \text{age}}(\sigma_{\text{Pname}.\text{ID} = \text{Page}.\text{ID}}(\text{Pname} \times \text{Page}))
Joins can be understood as running nested loops over the tables:
for Pname.ID, name in Pname:
for Page.ID, age in Page:
if Pname.ID == Page.ID:
print(name, age)If the column being joined on is a unique identifier, you can also think of the join as looping over one table while looking up from the other:
for ID, name in Pname:
age = Page[ID]
print(name, age)So far we’ve been implementing our computation with single queries. Sometimes it’s useful, or necessary, to compose multiple subqueries into a large one.
Suppose we add an owner column to our pets
table. The following query finds people with either cats or dogs:
SELECT owner
FROM pets
WHERE pets.kind = 'cat' OR pets.kind = 'dog'But the following query does not find people with both cats and dogs (why?):
SELECT owner
FROM pets
WHERE pets.kind = 'cat' AND pets.kind = 'dog'Instead, we can INTERSECT two subqueries:
SELECT owner FROM pets WHERE kind = 'cat'
INTERSECT
SELECT owner FROM pets WHERE kind = 'dog'Similarly, we can also take the union/difference of two tables:
-- also try EXCEPT
SELECT owner FROM pets WHERE kind = 'cat'
UNION
SELECT owner FROM pets WHERE kind = 'dog'When working with subqueries, it can be helpful to store the query result in variables. There are several different ways to do that in SQL.
First, we can just create a new table storing the query result:
CREATE TABLE catpeople AS
SELECT owner FROM pets WHERE kind = 'cat';
CREATE TABLE dogpeople AS
SELECT owner FROM pets WHERE kind = 'dog';
SELECT * FROM catpeople
INTERSECT
SELECT * FROM dogpeopleThis stores the query result in “global variables” that are also accessible to other queries.
To store the result in a local variable accessible to a single query
only, use WITH:
WITH catpeople AS (
SELECT owner FROM pets WHERE kind = 'cat'
), dogpeople AS (
SELECT owner FROM pets WHERE kind = 'dog'
)
SELECT * FROM catpeople
INTERSECT
SELECT * FROM dogpeopleNote there are no semicolons after the WITH clauses
because the above is considered one single query.
Sometimes we want to keep data stored in variables automatically updated as the underlying table changes. For this we can create views:
CREATE VIEW catpeople AS
SELECT owner FROM pets WHERE kind = 'cat';
CREATE VIEW dogpeople AS
SELECT owner FROM pets WHERE kind = 'dog';Below, Lisa will appear in the result when the query runs the second
time after the insertions, even though we didn’t explicitly update
catpeople and dogpeople.
SELECT * FROM catpeople
INTERSECT
SELECT * FROM dogpeople;
INSERT INTO pets values ('Carlo', 5, 20, 'dog', 'Lisa');
INSERT INTO pets values ('Zoe', 3, 8, 'cat', 'Lisa');
SELECT * FROM catpeople
INTERSECT
SELECT * FROM dogpeople;Subqueries are very helpful for complex queries. Suppose we want to
find pet kinds with average age > 10. The following query is wrong
because aggregate cannot appear in the WHERE clause:2
SELECT kind, AVG(age) FROM pets
WHERE AVG(age) > 10
GROUP BY kindInstead, we can first calculate the average age per kind:
CREATE TABLE averages AS
SELECT kind, AVG(age) AS a
FROM pets
GROUP BY kind;Then we filter that table by the average age:
SELECT *
FROM averages
WHERE averages.a > 10.0;Or we can use the WITH syntax:
WITH averages AS (
SELECT kind, AVG(age) AS a
FROM pets GROUP BY kind
)
SELECT * FROM averages WHERE averages.a > 10.0;For this particular query, SQL also has a convenient
HAVING clause:
SELECT kind, AVG(age)
FROM pets
GROUP BY kind
HAVING AVG(age) > 10It’s like WHERE, but the condition is evaluated on each
group after grouping so we can use aggregate functions.
Subqueries are even more powerful when nested inside another query. To appreciate this, let’s try to find the oldest cat (“argmax”). We’ll first do this the “hacky way” to pick up some SQL features:
SELECT name, age
FROM pets
WHERE kind = 'cat'
ORDER BY age DESC
LIMIT 1The new bits here are ORDER BY age DESC – which sorts
the output by age from large to small – and
LIMIT 1 – which limits the output to the first row. This is
hacky because SQL is agnostic to ordering in general, but here we rely
on the order.
To do this properly, we can use a subquery. First, we find the age of the oldest cat:
SELECT max(age)
FROM pets
WHERE kind = 'cat'Then, we store this in a variable for use in another query:
WITH oldage AS (
SELECT max(age) AS a
FROM pets
WHERE kind = 'cat'
)
SELECT name, age
FROM pets, oldage
WHERE kind = 'cat'
AND age = oldage.aIn this case, because the oldage subquery stores a
single value (a.k.a. a scalar), we can use it directly as a
nested subquery:
SELECT name, age
FROM pets
WHERE kind = 'cat'
AND age = (SELECT max(age)
FROM pets
WHERE kind = 'cat')Remember, this only works because the inner query is a scalar! Also note that, unlike the hacky query, this query returns all pets that are oldest.
Now, let’s move on to the more general problem: how can we find the
oldest pet(s) per kind? First we can remove the
kind = 'cat' filter:
SELECT name, age
FROM pets
WHERE age = (SELECT max(age) FROM pets)But that won’t work, because the inner query now returns the overall oldest age.
To fix it, we need to filter the inner query by the current pet kind considered by the outer query:
SELECT name, age
FROM pets as p1
WHERE age = (SELECT max(age)
FROM pets as p2
WHERE p2.kind = p1.kind)The idea is that, the inner query is evaluated once per row of the outer query, because the WHERE clause of the outer query runs once per row. Just like a nested loop.
We can also use nested queries in the SELECT clause:
SELECT kind, (SELECT max(age)
FROM pets as p2
WHERE p2.kind = p1.kind)
FROM pets as p1Which is a convoluted way to implement “max age per kind”.3
Let’s see two other ways to find old pets, while introducing a bit more SQL features along the way.
SELECT name, age
FROM pets as p1
WHERE NOT EXISTS (SELECT *
FROM pets as p2
WHERE p2.age > p1.age
AND p2.kind = p2.kind)This query is rather self-explanatory: an oldest cat is one where there is no cat older than it.
We can also use the ALL keyword:
SELECT name, age
FROM pets as p1
WHERE age >= ALL (SELECT age
FROM pets as p2
WHERE p2.kind = p1.kind)Here, age >= ALL (SELECT ...) is true if and only if
age >= x for all values of x
returned by the subquery.
Similarly, we can also use ANY:
SELECT name, age
FROM pets as p1
WHERE not age < ANY (SELECT age
FROM pets as p2
WHERE p2.kind = p1.kind)Naturally, the oldest cat shouldn’t be younger than any other cat.
There’s nothing left to learn in SQL, and we’ll cover exactly that now. Wait, what?
Consider the following queries. What do you expect them to return?
SELECT COUNT(*) FROM r;
SELECT COUNT(*) FROM r WHERE r.x = r.x;
SELECT COUNT(*) FROM r WHERE NOT (r.x <> r.x);
SELECT COUNT(*) FROM r WHERE r.x = r.x OR r.x = r.x;Normally, all queries should return the size of the input table
r, but not if r contains nothing; and
by nothing, I mean NULL.
In SQL, the NULL value stands for the absence of a
value. There are many occasions when you might want to use
NULL:
In general, NULL means “missing information”.
SQL treats NULL specially during query evaluation. To
understand that, we need to look at SQL operations more formally.
First, there are two kinds of values in SQL:
Data values are like numbers, dates, strings,
etc.—the stuff you’re actually interested in storing in your tables. A
special data value is the NULL value, representing
“nothingness” or “unknown”.
Logical values are your familiar booleans true and
false, used by the engine in intermediate calculations. A special
logical value is the UNKNOWN value, representing “unknown”.
You can think of it as just the logical counterpart to
NULL.
There are 3 kinds of operators in SQL:
Data operators:
1 + 2
<data value> <data operator> <data value>
The entire expression evaluates to itself another data value.
Predicates:
1 < 2
<data value> <predicate operator> <data value>
The entire expression evaluates to a logical value.
Logical connectives:
true AND false
<logical value> <logical operator> <logical value>
The entire expression evaluates to itself another logical value.
Conveniently summarized:
| Operator kind | Example | Input -> Output |
|---|---|---|
| Data | + - * / | data \to data |
| Predicate | > < = | data \to logic |
| Logical | AND OR NOT | logic \to logic |
Some systems support “if/case statements” that goes from logic to data, but you won’t need those very often.
These operators all behave as you would expect if NULL
is not involved. When NULL is involved,
there are special rules:
| Kind | Example | Output on missing input |
|---|---|---|
| Data | + - * / | NULL |
| Predicate | > < = | UNKNOWN |
| Logical | AND OR NOT | 3 Valued Logic |
For data: if either operand is NULL,
the result will also be NULL
For predicate: if any argument is NULL,
the output is UNKNOWN4
For logical, we apply 3 valued logic
3 Valued Logic
| AND | T | F | U |
|---|---|---|---|
| T | T | F | U |
| F | F | F | F |
| U | U | F | U |
| OR | T | F | U |
|---|---|---|---|
| T | T | T | T |
| F | T | F | U |
| U | T | U | U |
| NOT | T | F | U |
|---|---|---|---|
| F | T | U |
Don’t memorize these – they fall naturally out of “short-circuiting”: if we can already deduce the result from the non-missing operand, we use that, otherwise it’s UNKOWN.
Finally, SQL will return a row only if the condition in
WHERE is true on that row. In other words, it
discards rows when the condition is false or
UNKNOWN.
Let’s now revisit the examples using these rules.
SELECT *
FROM R
WHERE R.x=R.x; -- NULL=NULL => UNKNOWNWhere R.x is NULL, the WHERE
clause evaluates to UNKNOWN, so that row is
excluded. The return table is not necessarily the same
as R.
SELECT *
FROM R
WHERE R.x = R.x
OR R.x <> R.x; -- NULL <> NULL => UNKNOWNSame as above.
SELECT *
FROM R
WHERE null = null; -- NULL=NULL => UNKNOWNSame as above. It’s just explicit this time.
SELECT *
FROM R
WHERE null <> null; -- NULL <> NULL => UNKNOWNSame as above. It’s just explicit this time.
SELECT *
FROM R
-- NULL <> NULL => UNKNOWN
-- NOT UNKNOWN => still UNKNOWN
WHERE NOT null <> null;This time we have a nested operation. Just follow through with the
rules and you’ll find when WHERE evaluates to
UNKNOWN.