In this homework we will put our SQL-fu to use, getting a peek to the dark underworld of our city. We will be exploring the public Crime Dataset of Los Angeles. Buckle up, becuase what lies ahead may not be what you expect.
The dataset is hosted on data.gov, home to hundreds of thousands of public datasets spanning all domains of the American life. I highly recommend you to poke around the website and see what you can find. The LA Crime dataset we’re working with is apparently the second most-viewd dataset on the website.
Start by downloading the CSV (Comma Separated Values File) data from this page. Once download finishes, load the data into SQLite Browser as follows (these instructions are for macOS; if you use linux/windows please refer to the SQLite documentation):
Create a new database with File > New Database. It will ask you for a name and a location to save the database file. Give any name you like and click “save”. Another window titled “Edit table definition” will popup - that is for creating new tables. ignore that by clicking cancel.
Now, we will import our dataset as a table into the database. Click File > Import > Table from CSV file. Locate the CSV file you just downloaded, and click Open. Another window pops up with a preview of the data. Change the Table name to “crime”, and make sure the Field seperator is set to “,” (comma), and you should be able to see a spreadsheet-like preview of the data with many rows and columns. Click OK.
Now we have our data in the database! Take a look around by clicking on “Browse Data”. This should display a spreadsheet showing the entire table. Note that the table is very wide with many columns - bad design! Also note that there are NULLs literally every where. So this data is a bit dirty, and we will clean it up first. But to clean it up we first need to understand what’s in the data. Let’s go column by column.
The first column is a bit mysterious and contains a bunch of random numbers. Usually, this is some kind of ID used to uniquely idenitfy each data record. In other words, it should form a primary key of the table. Let’s test out our hypothesis:
DR_NO
column is a primary key?The second and third columns are more self-explanatory:
Date Rptd
probably means date reported, and
Date OCC
is the date that the crime actually occured. After
each date we also see a time, but it looks like the time is always 12
AM? Let’s check if that’s the case. To do that, we first need to pull
out the time part out of the string.
Write a SQL query to extract the time out of each of the two
columns. You may ask AI to find the appropriate SQLite feature for this.
Tip: when you have spaces and other weird charactors in column names,
you need to quote it:
SELECT "Date Rptd" FROM crime;
Now, write another query to see if the time is always 12 AM.
Spoiler: turns out the time is always 12 AM, so we can just drop that.
There is a small problem: SQLite requires date to
be stored in YYYY-MM-DD
, but in this dataset it’s stored as
MM/DD/YYYY
.
Now moving on to the TIME OCC
column. You guessed it
from the name: it’s the time that the crime actually occured. If you
read a few entries you’ll realize they are using “military time”,
i.e. 2100 means 21:00.
DATE OCC
column and
TIME OCC
column into SQLite datetime format
YYYY-MM-DD HH:MM
The next two columns are AREA
and
AREA NAME
. If you sort by AREA
(by clicking on
it), you’ll see that the AREA
column is probably some kind
of ID
for AREA NAME
. But it’s not a primary
key for this table!
AREA
uniquely determines
AREA NAME
.Next comes a column named Rpt Dist No
. Compare this to
AREA
, you’ll find some patterns.
AREA
always appears at the
front of Rpt Dist No
The column Part 1-2
seems to suggest some crimes has
part 1 and part 2, but I actually couldn’t figure out a way to link two
parts of the same crime together. So if you found a way, please let me
know!
After that we have Crm Cd
, which should determine
Crm Cd Desc
, and you can check that in the usual way. I’m
not 100% sure what Mocodes
are but searching online it
seems to be some “modus operandi” codes describing specific activities
invovled in the crime. The next columns of age, sex and descent are
self-explanantory. The premis, weapon, and status columns are also
similar to what we have seen.
But now we see four columns Crm Cd 1
-
Crm Cd 4
again! This seems to suggest certain incidents
involve multiple crimes. But this way of keeping track of multiple
crimes is pretty awkward, evident by the NULLs everywhere.
Finally, we have some columns recording the location and coordinates of the crimes. Then we’re done! Phew!
So we’ve said the organization of the table isn’t ideal - there are way too many columns! We know the solution is to decompose the table into multiple ones. Given that we have a primary key, there is actually a rather extreme way of decomposition:
There is another nice side effect of this extreme decomposition: if there is a NULL in an entry, we can just drop that row without really losing information.
For each table in your decomposition, drop any row containing NULLs.
After dropping the NULLs, how can you reconstruct the original table?
Decomposing tables like that is usually not necessary, and can complicate our queries. Usually it suffices to remove strict independency and dependency as we have done in class. There isn’t really any useful strict independency despite some trivial ones in this table:
In other words, if we decompose using the trivial independencies, we
would end up with the extreme decomposition above. So we won’t do that.
Instead, we will deal with the dependencies. For example, because we
know AREA
determines AREA NAME
, we can “factor
out” the AREA NAME
column into a separate table. I.e., we
keep the AREA
column, and make a new table with columns
AREA
and AREA NAME
.
Now do the same for all other strict dependencies in the table except for those from the primary key.
There is one dependency that’s very special: recall that
Rpt Dist No
determines AREA
, so we would be
factoring them out into a separate table. But that table will look a bit
silly, where the AREA
column simply drops the last 2 digits
from the Rpt Dist No
column. This is a case when
database views can be very helpful: instead of actually
creating the physical table, we create a virtual table that
computes the AREA
code from Rpt Dist No
.
Rpt Dist No
but returns
the same result as
SELECT DISTINCT "Rpt Dist No", AREA FROM crime
.The last bit to clean up is the issue with the crime codes: every row
has a Crm Cd
column as well as additional
Crm Cd 1
through Crm Cd 4
. Also, it seems
Crm Cd
is always contained in one of the
Crm Cd n
columns and therefore might be redundant.
The current way of storing crime code is quite awkward, and that’s an artifact of using only one table to store all information. But we’re using a database, and we can use more tables!
At this point, our table is more or less clean! Your table schema should look something like this:
crime(DR_NO, Date Rptd, Date OCC, TIME OCC, Rpt Dist No,
Part 1-2, mocodes, age, sex, descent,
Premis Cd, Weapon Cd, Status,
Location, Cross Street, Lat, Lon)
dist_area(Rpt Dist No, AREA) (virtual table)
area(AREA, AREA_NAME)
code(DR_NO, Crm Cd) (this consolidates all crime codes)
code_desc(Crm Cd, Crm Cd Desc)
premise(Premis Cd, Premis Desc)
weapon(Weapon Cd, Weapon Desc)
status(Status, Status Desc)
We can create these new tables in two different ways:
Use CREATE TABLE
to actually create them
“physically”
Create them as views/“virtual tables”
Let’s try the first approach first.
Be careful to save your work. For example, you can first save a version of the DB containig both the original table and the new ones.
Now save two different copies of the DB file, one with the original crime table created directly from CSV, and another with only the decomposed tables. Compare their size.
The benefit of creating the table physically is that we can save space, and we can query the decomposed tables directly. The issue is that, if the city of LA sends us new data, it’s probably in the form of the original table, and we would have to manually insert those into our new schema. That’s where views come in:
With the data cleaned, now we can ask some interesting questions! For example:
What are the most popular weapons?
What are the least popular weapons?
How many different weapons are there?
What’s the “most dangerous” neighborhood?
There’s also a nice plot feature in SQLite browser that lets you do some quick visualization, for example
What’s the crime distribution over hours of the day?
What’s the crime distribution over months of a year?
My favorite one is this:
LAT
against LON
, and you might
recognize the shape. You’ll need to remove 0.0 which denotes missing
values, and it also helps to subtract min(lat) and min(lon) from lat and
lon (you’ll need nested queries). The visualization could be a little
laggy due to having too many points.If you’re really up for it, use some proper visualization library together with the SQLite library for your favorite language to run some more analysis. Make some interactive maps, go wild!