Posted on Jun 27

WINDOW FUNCTIONS FOR DATA ANALYSIS

#postgressql #analytics #data #datascience

They allow you to perform calculations across a set of table rows that are somehow related to the current row.

Window functions can be used to

Rank Rows.
Calculate cumulative totals.
Find the difference between consecutive rows in a dataset.

Window functions return a value for each row while still providing information from the related rows.

ROW_NUMBER ()

Assign a unique row number to each row in the result set.

In a real world scenario it can help us track which order each customer made first, second.......

Assigns a unique number to each row, starting from 1 based on the order specified by the ORDER BY clause.

The number will reset for each position if PARTITION BY is used.

Assign a unique row number to each order based on the order date and we want to reset numbering for each customer

SELECT o.order_id, c.first_name, c.second_name, o.order_date, ROW_NUMBER() OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS row_num FROM orders o JOIN customers c ON o.customer_id = c.customer_id;

ROW_NUMBER () : Assigns a unique number to each order.

PARTITION BY o.customer_id : Ensures that the row numbering starts fresh for each customer.

Query will list orders for each customer showing their row number (1,2,3---) in the sequence of orders.

Ranking the orders globally based on order date without resetting the numbering for each customer

What is needed just remove the PARTITION BY Clause

SQL Query Without Resetting Row Number:

SELECT o.order_id, c.first_name, c.second_name, o.order_date, ROW_NUMBER() OVER (ORDER BY o.order_date) AS row_num FROM orders o JOIN customers c ON o.customer_id = c.customer_id;

Without PARTITION BY the numbering is continous for all orders across customers based on their order_date.

RANK() AND DENSE_RANK()

RANK() assigns a rank to each row, with ties getting the same rank but leaving gaps in subsequent ranks.

DENSE_RANK() works similarly but without leaving gaps in the ranking.

RANK() SQL QUERY

Rank the customers based on the total quantity of books as they are ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity, RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS rank FROM orders o JOIN customers c ON o.customer_id = c.customer_id GROUP BY c.first_name, c.second_name;

RANK() assigns a rank based on the SUM(o.quantity) in descending order.

If two customers have the same total quantity ordered they will receive the same rank and the next rank will have a gap. *Two customers rank 1 will result in the next customer being ranked 3rd.

USING DENSE_RANK()

Assigns a rank without gaps for ties.

Assigns a rank to each row but it does not leave gaps in the rankings if there are ties.

Calculate the dense rank of customers based on the total quantity of books they ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity, DENSE_RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS dense_rank FROM orders o JOIN customers c ON o.customer_id = c.customer_id GROUP BY c.first_name, c.second_name ORDER BY dense_rank;

If two customers are tied they will receive the same rank but the next customer will receive the next consecutive rank.

1, 1, next will be 2nd

This key difference in RANK() and DENSE_RANK() is crucial for how you want to treat tied values in your analysis.

LEAD() AND LAG()

LEAD()

LEAD() access next row's value.

It is used to access a row that follows the current row at a specific physical offset.

Generally employed to compare the value of the current row with the value of the next row following the current row.

Compare quantity ordered by each customer in the current row with the quantity ordered in the next row

SELECT o.order_id, o.customer_id, o.quantity, LEAD(o.quantity) OVER (ORDER BY o.order_id) AS next_quantity FROM orders o;

LEAD(o.quantity) allows you to access the quantity of the next row for each customer.

Query gives the quantity ordered by the customer in the current row and the quantity ordered by the same customer in the next row.

For the last row for each customer the next quantity will be NULL because there is no next row.

LAG()

Access previous rows value

It is crucial for analyzing trends or behavior change over time.

Allows you to access data from a previous row within the same result set and is crucial for comparing values in the current row with values in the preceding row.

Operates on partitions created by the PARTITION BY clause.

Compare quantity ordered by each customer in the current row with the quantity ordered in the previous row using the LAG() function

SELECT o.order_id, o.customer_id, o.quantity, LAG(o.quantity) OVER (ORDER BY o.order_id) AS prev_quantity FROM orders o;

LAG(o.quantity) allows you to access the quantity of the previous row for each customer

The query shows the quantity ordered in the current row and the quantity ordered in the previous row for the same customer.

First row previous quantity will be NULL as there is no previous row.

NTILE() FUNCTION

Partitions data into specified number of buckets.

Crucial for data analysis and reporting as it allows users to efficiently distribute rows and analyze data in a structured manner.

We want to divide customers into 2 groups (quartiles) based on their total order quantity.

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity, NTILE(2) OVER (ORDER BY SUM(o.quantity) DESC) AS quantity_tile FROM orders o JOIN customers c ON o.customer_id = c.customer_id GROUP BY c.first_name, c.second_name ORDER BY quantity_tile;

NTILE(2) divides customers into two equal groups(quartiles) based on their total quantity ordered.

Tile 1 : Top 50% of customers
Tile 2 : Next 50% of customers

PARTITION BY

Divides result set into partitions to apply window functions independently.

This clause divides the result set into partitions and the window function works independently within each partition.

calculate the total quantity of orders for each customer and the average price of the books ordered by each customer.

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity, AVG(b.price) AS avg_price, SUM(o.quantity) OVER (PARTITION BY o.customer_id) AS total_order_quantity FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN books b ON o.book_id = b.book_id GROUP BY c.first_name, c.second_name, o.quantity, o.customer_id ORDER BY c.first_name;

SUM(o.quantity) gives total quantity ordered by each customer.

PARTITION BY o.customer_id ensures the total order quantity is calculated for each individual customer.

DEV Community