An introduction to SQL
Alkistis Pourtsidou
Queen Mary, University of London
Image credit: Hayden Planetarium, 2014
WHAT IS SQL?
✦ SQL stands for Structured Query Language
✦ Used to query (“talk to”) a database server
✦ Data manipulation, database creation
✦ Almost all companies use databases to store their data
✦ Have a look at the Indeed Job Search Engine…tens of
thousands of jobs mentioning SQL!
✦ https://www.indeed.co.uk/
WHY DO WE NEED DATABASES?
✦ Concurrency: multiple simultaneous changes to data
✦ Data changes regularly
✦ Large data sets but only need subsets
✦ Sharing large data sets
✦ Rapid queries
✦ Data web interfaces (dynamic data)
WAYS TO USE SQL
✦ Standard console command (e.g. mysql -u user -p dbname)
✦ GUI interfaces often available
✦ Interfaces to many programming languages (Python, R, …)
✦ SQLite - use SQL without a database server: this is what we
are going to use in our tutorial
✦ PostgreSQL - the world’s most advanced open source
database: we’ll see how that works later on
MORE ABOUT DATABASES
✦ A database server can contain many databases
✦ Basically, databases are collections of tables with rows
(observations) and columns (variables)
✦ Limited mathematical operations available
✦ Very good at combining information from several related tables
✦ We’ll explore the above (and more) in detail
duction to
SQL Finding Your Way Around
EXPLORING the Server
THE SERVER
✦ A given server can support multiple databases
✦ EachSince
database contains
a single servermany tables many databases, each
can support
containing many tables, with each table having a variety
✦ Each table contains many columns
of columns, it’s easy to get lost when you’re working with
databases.
✦ But keeping These
things commands
under control is will help figure out what’s
straightforward!
available:
I SHOW DATABASES;
I SHOW TABLES IN database;
I SHOW COLUMNS IN table;
I DESCRIBE table; - shows the columns and their
types
duction to
SQL Variable Types
VARIABLE TYPES
✦ SQLSQL
supports a variety
supports a veryoflarge
different
numberformats for storing
of di↵erent information
formats for
internal storage of information.
Numeric
I INTEGER, SMALLINT, BIGINT
I NUMERIC(w,d), DECIMAL(w,d) - numbers with width
w and d decimal places
I REAL, DOUBLE PRECISION - machine and database
dependent
I FLOAT(p) - floating point number with p binary
digits of precision
VARIABLE TYPES
uction to
SQL Variable Types (cont’d)
✦ SQL supports a variety of different formats for storing information
Character
I CHARACTER(L) - a fixed-length character of length L
I CHARACTER VARYING(L) or VARCHAR(L) - supports
maximum length of L
Binary
I BIT(L), BIT VARYING(L) - like corresponding
characters
I BINARY LARGE OBJECT(L) or BLOB(L)
Temporal
I DATE
I TIME
I TIMESTAMP
SQL: THE BASICS
✦ Enough intro! Let’s dive into SQL with hands-on examples
✦ We will use SQLite3, which is part of Python, for details see
https://www.pythoncentral.io/introduction-to-sqlite-in-python/
✦ SQLite is an embedded SQL database engine. It doesn’t have a
separate server process, which makes it really easy to use,
immediately.
✦ Using Python (in the form of a Jupyter notebook) to run our SQL
code allows us to use Pandas for importing our results and make
everything look nice and clear!
SQL: THE BASICS
SQL: THE BASICS
✦ Let’s first define a function that takes our query, stored as a string, as an input.
✦ Then shows the result as a formatted data frame (we’ll see this in action in a
bit…)
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
✦ As we saw, the database has two tables, TRIPS and
STATIONS. We will first work with the TRIPS table. Let’s see
what kind of information it contains:
SQL: THE BASICS
✦ See all columns:
SQL: THE BASICS
✦ See specific columns:
SQL: THE BASICS
SQL: THE BASICS
✦ DESC: Descending
SQL: THE BASICS
✦ The longest trip lasts
a bit less than 3
hours.
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
✦ So far we’ve been looking at queries pulling data from the
TRIPS table
✦ But as you might remember there’s also the STATIONS table
✦ The STATIONS table contains information about every station
in the Hubway network
✦ It also includes an id column referenced by the TRIPS table
✦ So these tables can be combined to extract useful information
✦ Let’s have a look…
SQL: THE BASICS
ID is a unique identifier for each station, corresponding to the
start_station and end_station columns in the TRIPS table
SQL: THE BASICS
✦ Let’s say we want to know which station is the most
popular starting point
✦ For that we need to combine information from both the TRIPS
and STATIONS tables
✦ We will use the JOIN command
SQL: THE BASICS
✦ We will use SELECT to return the station column from the stations table using the
table.column syntax, i.e. stations.station in our case
✦ We also return the COUNT of the number of rows from the trips table
✦ To tell the database how the stations and trips tables are connected, we’ll use JOIN and
ON.
✦ JOIN specifies which tables should be connected
✦ ON specifies which columns in each table are related
✦ INNER JOIN means rows will only be returned where there is a much in the columns
specified by ON
✦ Tables are connected ON trips.start_station = stations.id
✦ Then we group by the station column so that COUNT will give the number of trips for
each station separately
✦ Finally we ORDER BY descending order
SQL: THE BASICS
SQL: THE BASICS
SQL: THE BASICS
✦ Let’s slightly expand this query to see which are the most
popular round-trip stations:
EXERCISES/TASKS
✦ Code up the queries we just learned in Jupyter and reproduce
the results
✦ How many trips lasted more than half an hour? (this induces
extra charges)
✦ Which bike was used for the least total time?
✦ Did registered or casual users take more round trips?
✦ Pick up any publicly available database and play with it!
PostgreSQL
✦ PostgreSQL: “the world’s most advanced open source relational
database
✦ Active development for 30 years now!
✦ www.postgresql.org
✦ Installation: For MAC OS I strongly recommend using
Postgress.app, see https://www.calhoun.io/how-to-install-
postgresql-9-6-on-mac-os-x/
✦ For other systems, see https://www.dataquest.io/blog/sql-
intermediate/ and https://www.systems.ethz.ch/sites/default/files/
ex1a_postgresql_jupyter_setup.pdf (not tested…)
PostgreSQL
✦ First we need to create new user, database, and tables
✦ Follow the instructions in https://www.dataquest.io/blog/sql-
intermediate/ to run psql, create a new user named ‘oracle’ (or
another name of your preference) and a new database
✦ The new database contains consumer complaints
✦ It has two tables: one for bank account complaints and one
for credit card complaints
PostgreSQL
✦ We need to populate these with actual data!
✦ We will use data from here https://data.world/dataquest/bank-
and-credit-card-complaints
✦ Again, follow the instructions in https://www.dataquest.io/blog/
sql-intermediate/ to load the data
✦ They are CSV files
✦ They have identical fields: complaint_id, date_received,
product, …, issue, consumer_complaint_narrative, etc.
PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions
✦ One to run queries and one to run commands
PostgreSQL
✦ Before having a look and playing with the data, we need to
create two helper functions
✦ One to run queries and one to run commands
PostgreSQL
✦ OK, now let’s test everything works OK.
✦ First let’s see how the credit card complaints table looks like.
PostgreSQL
✦ Then let’s get the number of records using the COUNT
function
✦ Works well! (try the bank account complaints table too)
PostgreSQL
✦ How to deal with NULL values
✦ Let’s see how many records in each table have null values for
the consumer complaint narrative field
✦ When comparing a column to null (no value), we cannot use
arithmetic operators. Instead we use IS NULL / IS NOT NULL.
PostgreSQL: Views
✦ So we just saw a large amount of records had null values for
the consumer complaint narrative field.
✦ Instead of having to filter on this field later, we’ll create a view
with this subset only.
✦ Syntax is simple:
PostgreSQL: Views
✦ Let’s have a look:
PostgreSQL: String Concatenation
✦ Extremely useful, combines two or more strings (text values)
together to form a single string
✦ For example say we have a “month” field and a “year” field but
we need to show “month-year” instead
✦ Syntax:
✦ Let’s try it out with our credit card complaints table
✦ Let’s select complaint_id, product, company, and concatenate
separated by a hyphen
PostgreSQL: String Concatenation
PostgreSQL: Subqueries
✦ Subqueries (“inline views”) create a mini view within a single
query
✦ The best way to understand how they work is via an example:
TASKS
✦ Read about UNION/UNION ALL, and put them in action using
different views of the banking data
✦ Same with INTERSECT/EXCEPT
✦ Explore subqueries, for example by reproducing the
“Subqueries in action” examples in https://www.dataquest.io/
blog/sql-intermediate/
✦ Go through the SQL/Pandas tutorial in https://
www.dataquest.io/blog/python-pandas-databases/
REFERENCES
✦ Introduction to SQL https://www.stat.berkeley.edu/~spector/
sql.pdf
✦ https://www.w3schools.com/sql/
✦ http://www.sql-tutorial.net/
✦ https://www.kaggle.com/learn/sql
✦ https://www.dataquest.io/blog/sql-basics/
✦ https://www.dataquest.io/blog/sql-intermediate/
✦ https://www.dataquest.io/blog/python-pandas-databases/