DS 5110 – Lecture 5
SQL Part III
Roi Yehoshua
Agenda
Accessing SQL from Python
Data warehouses and lakes
OLAP
Window functions
2 Roi Yehoshua, 2024
Accessing SQL from a Programming Languages
To access SQL from a programming language you need a database driver
A driver converts application commands into a format that the DBMS can understand
Typical commands include
Connection commands (e.g., open or close a connection to the database)
SQL commands
Handling cursors, which allow you to traverse through result sets
Transaction management commands
Types of drivers
Native drivers – provided by specific database vendors
ODBC drivers – provide a standard interface to communicate with databases
JDBC drivers – specific to Java-based applications
3 Roi Yehoshua, 2024
ODBC
Open Database Connectivity (ODBC) is a standard API for accessing DBMS
Independent of any specific DBMS or operating system
ODBC drivers exist for most of the commercial databases
4 Roi Yehoshua, 2024
MySQL Connector/Python
Python Database API (DB-API) defines a standard interface for accessing relational
databases from Python programs
Different packages in Python implement this interface for different databases
e.g., mysql.connector for MySQL, sqlite3 for SQLite, pyodbc for ODBC
mysql.connector is the recommended driver for interacting with MySQL
Developed by the MySQL group at Oracle
Install it via pip
pip install mysql-connector-python
If you’ve installed Python with the Anaconda distribution, you should already have it
5 Roi Yehoshua, 2024
MySQL Connector/Python
The general workflow of a Python program that interacts with a MySQL database:
Connect to the MySQL server
Execute a SQL query
Fetch the results
Inform the database if any table changes were made (by committing the changes)
Close the connection to the server
6 Roi Yehoshua, 2024
Establishing a Connection with MySQL Server
To connect to a MySQL server, call the connect() function in mysql.connector module
This function gets 4 parameters: host, user, password and database name
It returns a MySQLConnection object
You should always close the connection in the end by calling conn.close()
You should never hard-code your login credentials directly in a Python script
7 Roi Yehoshua, 2024
The Cursor Object
In order to execute SQL queries in Python, you need a cursor object
A cursor object allows you to traverse over database records
To create a cursor, use the cursor() method of your connection object
Then you execute a SQL query by calling cursor.execute(query)
If the query returns rows, you can retrieve them using one of cursor’s fetch methods:
fetchall() - retrieves all the rows from the result as a list of tuples
fetchone() - retrieves the next row of the result as a tuple
Returns None if no more rows are available
fetchmany(n) - retrieves the next n rows from the result as a list of tuples (n defaults to 1)
Returns an empty list if no more rows are available
8 Roi Yehoshua, 2024
Reading Records from a Table
The following example selects all the records from the instructor table:
• The result variable holds the
records returned from.fetchall()
• It’s a list of tuples representing
individual records from the table
9 Roi Yehoshua, 2024
Reading Records from a Table
To get specific attributes in each row, specify their indexes in the returned tuples
For example, to print only the instructor names:
10 Roi Yehoshua, 2024
Cursor as an Iterator
To process the rows in the result one at a time, you can use the cursor as an iterator:
11 Roi Yehoshua, 2024
More Complex Queries
You can make your select queries as complex as you want using the same methods
Example: Find the courses that were taken by the highest number of students
12 Roi Yehoshua, 2024
Cursor Properties
The cursor has a few useful properties that provide information about the result set
column_names – returns the list containing the column names of the result set
rowcount – the number of rows in the result set
13 Roi Yehoshua, 2024
SQL Injection Attack
A common attack that enables execution of malicious SQL statements in the DB
By insertion (“injection”) of a SQL query via the input data from the client
Suppose you write a script that checks if a given username exists in the users table
You construct the following SQL query:
username = # read from the input
query = "SELECT * FROM users WHERE username = '" + username + "'"
If the user, instead of entering their username, enters:
Then the query becomes:
query = "SELECT * FROM users WHERE username = '' or 1 = 1 --'"
The where clause is now always true and the entire users table is returned
14 Roi Yehoshua, 2024
SQL Injection Attack
Most databases allow execution of multiple SQL statements separated by semicolon
This allows the hacker to inject whole SQL statements into the query
For example, the hacker could enter the following string:
The resulting query would be:
query = "SELECT * FROM users WHERE username = ''; drop table users; --"
This query would result in deleting the entire users table!
Solution: use parameterized queries whenever user input is involved in the query
15 Roi Yehoshua, 2024
SQL Injection Attack
Example:
16 Roi Yehoshua, 2024
Parameterized Query
A parameterized query is a query that uses placeholders (%) for attribute values
Strings passed to the placeholders are correctly escaped by the library at runtime
e.g., each quotation mark is doubled
You can pass a parameterized query to cursor.execute() as follows:
17 Roi Yehoshua, 2024
Parameterized Query
If a user tries to sneak in some problematic characters, the resulting statement will
cause no harm since each quotation mark will be doubled:
select * from instructor where name = 'X'' or ''Y'' = ''Y'
This query will return an empty set
18 Roi Yehoshua, 2024
Inserting New Records
To insert data, pass the insert into command to the cursor’s execute() method
For example, to add a new instructor:
You must call conn.commit() at the end, otherwise your changes will be lost!
Unless you turn on automatic commits by setting conn.autocommit = True
19 Roi Yehoshua, 2024
Inserting New Records
Verifying that the new record was added to the table:
20 Roi Yehoshua, 2024
Inserting New Records
If the values to be inserted come from an external source (e.g., the user), again you
should use parameters in the SQL statement:
21 Roi Yehoshua, 2024
Inserting a Bulk of Records
You can insert multiple records using the executemany() method
It accepts two parameters:
A query that contains placeholders for the records that need to be inserted
A list that contains all records that you wish to insert
22 Roi Yehoshua, 2024
Update and Delete
Updating and deleting work the same way, just pass the SQL to cursor.execute()
You can use the cursor rowcount attribute to check how many records were affected:
23 Roi Yehoshua, 2024
Calling Stored Procedures from Python
To call a stored procedure use the callproc() method of the Cursor object
cursor.callproc(procedure_name, args=())
Then, you can call the stored_results() method to get an iterator with the result set
The rows in the result can be read by calling the fetchall() method
Example:
24 Roi Yehoshua, 2024
Loading Data into a DataFrame
Pandas provides a read_sql() method that reads an SQL query into a DataFrame
For example, loading the instructor table into a DataFrame:
25 Roi Yehoshua, 2024
Loading Data into a DataFrame
You can use the params argument to pass a list of parameters to the query
For example, the following displays all the students that took the course CS-319
26 Roi Yehoshua, 2024
Data Warehouse
A specialized type of database designed for analysis of large volumes of data
Integrates data from various sources and arranges them in order to optimize queries
Stores both current and historical data
Data is usually added or deleted, but not updated
OLAP (Online Analytical Processing) tools allow analysis of multi-dimensional data
Popular data warehouses: Snowflake, Amazon Redshift, Google BigQuery, Teradata
27 Roi Yehoshua, 2024
OLTP vs. OLAP
28 Roi Yehoshua, 2024
Multidimensional Data
A data warehouse contains two primary types of tables:
Fact tables store quantitative data (facts) about a business process
For example, a table that stores daily sales:
sales(date, product_id, store_id, total_sales_amount, total_units_sold)
Each row in the table represents a combination of dimensions (date, product and store) and
the corresponding measures (total sales amount and units sold)
Dimension tables store descriptive attributes (dimensions) related to the data
For example, a product dimension table would include details about each product
e.g., id, name, category, price
Dimension tables are typically denormalized to make the design simpler
Data modeled as dimension + measures attributes is called multidimensional data
29 Roi Yehoshua, 2024
Star Schema
The resulting schema is a star schema
The fact table is connected to multiple peripheral tables (dimension tables)
Advantages
Queries are simpler and run faster compared to normalized structures (less joins)
Flexible design that allows to add new dimensions without making extensive changes
30 Roi Yehoshua, 2024
Snowflake Schema
A variant of star schema with multiple levels of dimension tables
The dimension tables are usually normalized
Reduces data redundancy but increases the number of tables and complexity of joins
31 Roi Yehoshua, 2024
Data Lakes
Storage systems that can store vast amounts of data in its native format
Including structured data (e.g., databases) and unstructured data (e.g., images, text)
Require less upfront effort, but more effort during querying
Use “schema-on-read” approach instead of “schema-on-write”
You store the data as-is and define the schema only when you need to read/process it
Usually built on top of distributed storage systems such as Hadoop
which can scale out by adding more nodes to the system
Integrate well with big data tools such as Apache Spark
Popular data lakes:
Amazon S3
Azure Data Lake Storage
Google Cloud Storage
32 Roi Yehoshua, 2024
Data Lakes
Source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for
33 Roi Yehoshua, 2024
OLAP (Online Analytical Processing)
Set of tools that provide analysis of data from multiple dimensions
Allows users to interactively analyze and explore data from different points of view
Used in BI (Business Intelligence) to uncover insights and patterns that are not
immediately apparent with traditional, two-dimensional databases
Types of OLAP
MOLAP (Multidimensional OLAP): Uses multidimensional databases
ROLAP (Relational OLAP): Uses relational databases to store data and a multidimensional
model at the front end
HOLAP (Hybrid OLAP): Combines the features of both MOLAP and ROLAP
34 Roi Yehoshua, 2024
Data Cube
The primary unit of storage and analysis in OLAP
A multi-dimensional grid of data
Each dimension represents a different aspect of the data
Actual data points (measures) reside within the cube
Example: sales_data (date, item_type, location, sales_amount)
35 Roi Yehoshua, 2024
OLAP Operations
The main OLAP operations
Roll-up (aggregation)
Drill-down (de-aggregation)
Slice
Dice
Pivot (rotating)
36 Roi Yehoshua, 2024
Roll-Up
Aggregate the data by ascending the level in the dimension hierarchy
Example: aggregate sales by countries instead of cities
37 Roi Yehoshua, 2024
Roll-Up in SQL
Can be achieved by using GROUP BY
For example, aggregating daily sales to monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS monthly_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date);
date item_type location sale_amount year month total_sales
2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2023 1 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290
38 Roi Yehoshua, 2024
The ROLLUP Operator
Enhances the grouping capability of your queries by providing subtotals for
hierarchical combinations of columns in the GROUP BY
For example, using ROLLUP to get yearly and monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date) WITH ROLLUP;
date item_type location sale_amount year month total_sales
2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2022 NULL 1670
2022-02-03 Clothing New York 320 2023 1 770
2023-01-05 Electronics Los Angeles 480 2023 NULL 770
2023-01-10 Clothing Los Angeles 290 NULL NULL 2440
39 Roi Yehoshua, 2024
Drill-Down
Break down the data by descending the level in the dimension hierarchy
Example: aggregate the sales by months instead of quarters
40 Roi Yehoshua, 2024
Drill-Down in SQL
Just add another level to the GROUP BY
For example, breaking down monthly sales by location:
SELECT YEAR(date) AS year, MONTH(date) AS month, location, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location;
date item_type location sale_amount year month location total_sales
2022-01-01 Electronics New York 500 2022 1 New York 800
2022-01-02 Clothing New York 300 2022 2 New York 870
2022-02-01 Electronics New York 550 2023 1 Los Angeles 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290
41 Roi Yehoshua, 2024
Slice
View a slice of the cube by fixing some of the dimensions
Example: viewing sales for a specific quarter
42 Roi Yehoshua, 2024
Slice in SQL
Use WHERE to fix one of the dimensions and group by the other dimensions
Example: viewing sales for the year 2022
SELECT item_type, location, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY item_type, location;
43 Roi Yehoshua, 2024
Dice
Select two or more dimensions to get a sub-cube
Example: viewing sales for a specific location, quarter, and item type
44 Roi Yehoshua, 2024
Dice in SQL
Use WHERE to fix two or more dimensions and group by the other dimensions
Example: viewing sales for the year 2022 in Chicago
SELECT item_type, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022 AND location = 'Chicago'
GROUP BY item_type;
45 Roi Yehoshua, 2024
Pivot
Rotate the data axes to view the data from a different perspective
Example: show the item types in the columns instead of the rows
46 Roi Yehoshua, 2024
Pivot in SQL
SQL doesn’t directly support the pivot operation like some specialized OLAP systems
You can achieve a pivot using conditional aggregation
For example, pivoting monthly sales for different item types:
SELECT MONTH(date) AS Month,
SUM(CASE WHEN item_type = 'Electronics' THEN sale_amount ELSE 0 END) AS electronics_sales,
SUM(CASE WHEN item_type = 'Clothing' THEN sale_amount ELSE 0 END) AS clothing_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY MONTH(date);
47 Roi Yehoshua, 2024
Other OLAP Operations
Split
Break out one dimension into two separate dimensions for finer granularity
Drill across
Navigate from one cube to another cube that shares one or more dimensions
Drill-through
Access the detailed data that constitutes a certain higher-level aggregate value
48 Roi Yehoshua, 2024
The CUBE Operator
An extension to the GROUP BY clause
Generates a result set that represents all combinations of all aggregations
In contrast to ROLLUP that represents aggregations from the most detailed level to the total
For each grouping, the result contains NULL for attributes not present in the
grouping
Supported by many relational databases (but not MySQL )
49 Roi Yehoshua, 2024
The CUBE Operator
date location sale_amount year month location total_sales
2022 1 New York 800
2022-01-01 New York 500
2022 2 Los Angeles 870
2022-01-02 New York 300
2022 1 NULL 800
2022-02-01 Los Angeles 550 2022 2 NULL 870
2022-02-03 Los Angeles 320 NULL 1 New York 800
NULL 2 Los Angeles 870
SELECT YEAR(date) AS year, MONTH(date) AS month, 2022 NULL New York 800
location, SUM(sale_amount) AS total_sales 2022 NULL Los Angeles 870
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location WITH CUBE; NULL NULL New York 800
NULL NULL Los Angeles 870
2022 NULL NULL 1670
NULL 1 NULL 800
NULL 2 NULL 870
NULL NULL NULL 2440
50 Roi Yehoshua, 2024
Class Exercise
Compute the total number of student enrollments in each department’s courses
over the years, with subtotals by year and a grand total across all years
51 Roi Yehoshua, 2024
Solution
52 Roi Yehoshua, 2024
Window Functions in SQL
Perform calculations across a set of table rows that are related to the current row
e.g., calculating running totals / moving average
Return a single value for each row from the query
Unlike aggregate functions which return a single value for each group
General syntax:
<window_function>(<arguments>) OVER (
[PARTITION BY <partition_expression(s)>]
[ORDER BY <order_expression(s)>]
[frame_specification]
)
The OVER clause defines the rows on which the window function operates
An optional PARTITION BY clause divides the rows into groups
The ORDER BY clause determines the order of the rows within each partition
Frame specification defines which rows are included in the frame (subset of the current row)
53 Roi Yehoshua, 2024
Example: Moving Average
The AVG() function can be used as a window function to get a moving average
For example, assume that we have a table with daily sales figures
We can calculate a 3-day moving average for sales as follows:
54 Roi Yehoshua, 2024
Frame Specifications
Define the set of rows included in the window relative to the current row
Has two components:
Frame start: where the frame starts relative to the current row
Frame end: where the frame ends relative to the current row
ROWS, RANGE, or GROUPS: determine how the frame boundaries are defined
ROWS: Defines the frame by a specific number of rows
RANGE: Defines the frame by value range
GROUPS: Used with ordered set aggregate functions
Frame boundary:
UNBOUNDED PRECEDING/FOLLOWING: The frame starts/ends at the first row of the partition
N PRECEDING/FOLLOWING: The frame starts/ends N rows before the current row
CURRENT ROW: The frame starts or ends at the current row
55 Roi Yehoshua, 2024
Common Window Functions
Ranking functions
RANK() assigns a rank to each row based on a specified column’s values
DENSE_RANK() is similar to RANK() but without gaps between rank values for rows with the
same rank
ROW_NUMBER() assigns a unique sequential integer to rows
Aggregate functions
Almost all aggregate functions (e.g., SUM() and AVG()) can be used as window functions
Navigation functions
LEAD() returns a value from a subsequent row
LAG() returns a value from a preceding row
FIRST_VALUE() returns a value from the first row of the window frame
LAST_VALUE() returns a value from the last row of the window frame
56 Roi Yehoshua, 2024
Rank Function
The RANK() function assigns a rank to each row of a partition or the entire result set
It should be used with ORDER BY to sort the rows into the desired order
For example, we can rank the students based on their total credits:
57 Roi Yehoshua, 2024
PARTITION BY
PARTITION BY divides the result set into partitions and the window function is
applied to each partition independently
For example, we can rank students by the total credits within each department:
58 Roi Yehoshua, 2024
Class Exercise
For each course in the Fall 2017 semester rank the students based on their grades
If two students have the same grade, they should get the same rank
Expected columns in the result: ‘course_id’, ‘student_id’, ‘grade’, ‘student_rank’
59 Roi Yehoshua, 2024
Solution
60 Roi Yehoshua, 2024