PostgreSQL: Data analysis and analytics

Cybertec Training: Data Analysis Hans-Jürgen Schönig www.postgresql-support.de Hans-Jürgen Schönig www.postgresql-support.de

Introduction Hans-J¨urgen Sch¨onig www.postgresql-support.de

Scope of this training Importing data Simple aggregations Windowing and analytics Analyzing time series Managing incomplete data Writing custom aggregates Hans-J¨urgen Sch¨onig www.postgresql-support.de

Importing data Hans-J¨urgen Sch¨onig www.postgresql-support.de

Loading data Things to consider when importing data There are many ways to import data Avoid mini-transactions for performance reasons In case of large data sets speed is a major issue There is a life after importing Hans-J¨urgen Sch¨onig www.postgresql-support.de

Importing a simple data set A simple data structure . . . test=# CREATE TABLE t_test (a int, b int); CREATE TABLE Let us add 10.000 rows now . . . INSERT INTO t_test VALUES (1, 2); ... INSERT INTO t_test VALUES (1, 2); Hans-J¨urgen Sch¨onig www.postgresql-support.de

Using one transaction to import things BEGIN; INSERT INTO t_test VALUES (1, 2); ... INSERT INTO t_test VALUES (1, 2); COMMIT; Hans-J¨urgen Sch¨onig www.postgresql-support.de

Observations Performance can vary depending on hardware Longer transactions can be WAYS faster PostgreSQL has to flush every transaction to disk Most of the time is burned by flushing Hans-Jürgen Schönig www.postgresql-support.de

Changing durability requirements Performance will sky rocket . . . SET synchronous_commit TO off; INSERT INTO t_test VALUES (1, 2); ... INSERT INTO t_test VALUES (1, 2); The reason is that PostgreSQL does not have to flush every transaction anymore. Trading “durability” for performance Hans-Jürgen Schönig www.postgresql-support.de

Use bulk loads Loading single rows is usually a bad idea Use COPY to do bulk loading COPY can load data A LOT faster than INSERT due to significantly smaller overhead Hans-Jürgen Schönig www.postgresql-support.de

A simple COPY This time 10 million lines are imported: (10.000 rows are not enough) COPY t_test FROM stdin; 1 2 1 2 ... . Note the performance difference (rows per second). There is no need to check column lists, existence of table, etc. anymore -> higher throughput Hans-Jürgen Schönig www.postgresql-support.de

COPY: Observations to be made In the default configuration (= checkpoint segments = 3) you will see a steady up and down of I/O speed This is caused by checkpoints happening in the background Data has to go to the transaction log to “repair” data files in case of a crash Performance is limited by writing data twice Hans-Jürgen Schönig www.postgresql-support.de

COPY: Using a transaction log bypass Writing to the transaction log can be avoided in some cases: BEGIN; TRUNCATE t_test; COPY t_test FROM stdin; ... COMMIT; Hans-J¨urgen Sch¨onig www.postgresql-support.de

COPY: Why the bypass works TRUNCATE will schedule the removal of the data file on COMMIT COPY will start writing to a new data file Concurrency is not an issue because TRUNCATE locks the table PostgreSQL can take the old or the new data file on COMMIT or ROLLBACK no need to actually repair a data file anymore Hans-Jürgen Schönig www.postgresql-support.de

COPY: More on WAL-bypassing BEGIN; CREATE TABLE ... WAL bypassing only works if you are not using streaming replication Hans-J¨urgen Sch¨onig www.postgresql-support.de

Freezing rows Before you import data, run . . . ALTER TABLE t_test SET (autovacuum_enabled = off); Compare the timing of the first SELECT count(id) FROM t_test; with the second one. Hans-Jürgen Schönig www.postgresql-support.de

Observations The first run is a lot slower During the first run, writes will happen No more writes from the second run on PostgreSQL sets bits in the background Hans-Jürgen Schönig www.postgresql-support.de

The purpose of hint bits On first write PostgreSQL checks if a row can be seen by everybody. This bit is set to make sure that PostgreSQL does not have to go through expensive visibility checks next time. This is an issue for big data sets Hans-Jürgen Schönig www.postgresql-support.de

Fixing after-import performance To set hint bits straight away do a test=# COPY t_test FROM ’/tmp/file.txt’ FREEZE; Be careful. It only works in case . . . ERROR: cannot perform FREEZE because the table was not created or truncated in the current subtransaction Hans-J¨urgen Sch¨onig www.postgresql-support.de

VACUUM and hint bits Bits can be set on an entire block as well (not just on rows) VACUUM will set those hint bits However, block bits will not speed things up as much as row-level bits usually For a heavily used read-only database system vacuuming data can actually make sense (not to reclaim space) Hans-J¨urgen Sch¨onig www.postgresql-support.de

Importing some test data: Here is some test data: CREATE TABLE t_oil (country text, year int, production int); COPY t_oil FROM PROGRAM ’curl www.cybertec.at/secret/oil.txt’; Hans-J¨urgen Sch¨onig www.postgresql-support.de

Simple aggregations Hans-J¨urgen Sch¨onig www.postgresql-support.de

Basic aggregation test=# SELECT country, avg(production) FROM t_oil GROUP BY 1; country | avg ---------------+----------------------- USA | 9141.3478260869565217 Saudi Arabien | 7641.8260869565217391 (2 rows) Hans-J¨urgen Sch¨onig www.postgresql-support.de

GROUP BY is needed for aggregates A GROUP BY clause is needed because otherwise groups cannot be built: test=# SELECT country, avg(production) FROM t_oil; ERROR: column "t_oil.country" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT country, avg(production) FROM t_oil; Hans-J¨urgen Sch¨onig www.postgresql-support.de

HAVING: Filtering on aggregated data test=# SELECT country, avg(production) FROM t_oil GROUP BY 1 HAVING avg(production) > 8000; country | avg ---------+----------------------- USA | 9141.3478260869565217 (1 row) NOTE that an alias is not allowed in a HAVING clause Hans-J¨urgen Sch¨onig www.postgresql-support.de

Windowing and analytics Hans-J¨urgen Sch¨onig www.postgresql-support.de

The purpose of windowing An analogy: Your car is not a valuable car because it is good It is valuable because it is better than the ones driven by your friends This is what windowing does: The current row in relation to all rows in the reference group Hans-J¨urgen Sch¨onig www.postgresql-support.de

Windowing vs GROUP BY GROUP BY has been designed to reduce the amount of data and turn it into aggregated values Windowing is used to compare values and put them into relation. Windowing is used along with aggregate functions (e.g. sum, count, avg, min, max, . . . ) Hans-J¨urgen Sch¨onig www.postgresql-support.de

A simple aggregate: Average values SELECT *, avg(production) OVER () FROM t_oil ; country | year | production | avg ---------------+------+------------+--------------- USA | 1965 | 9014 | 8391.58695652 USA | 1966 | 9579 | 8391.58695652 USA | 1967 | 10219 | 8391.58695652 USA | 1968 | 10600 | 8391.58695652 USA | 1969 | 10828 | 8391.58695652 ... Hans-J¨urgen Sch¨onig www.postgresql-support.de

What does the result mean? ‘Give me all rows and the average “over” all rows in the table’ Logically it is the same as . . . SELECT *, (SELECT avg(production) FROM t_oil) AS avg FROM t_oil; However, subselects can be very nasty if the task is a more complex one Hans-J¨urgen Sch¨onig www.postgresql-support.de

OVER()-clauses can define order Calculate max production up to a certain point SELECT *, max(production) OVER (ORDER BY year) FROM t_oil WHERE country = ’Saudi Arabien’; Saudi Arabia is a so called ‘swing producer’. Note that max stays up even if production declines Hans-Jürgen Schönig www.postgresql-support.de

OVER()-clauses can form groups Averages for each country SELECT *, avg(production) OVER (PARTITION BY country) FROM t_oil; country | year | production | avg ---------------+------+------------+--------------- Saudi Arabien | 1965 | 2219 | 7641.82608695 Saudi Arabien | 1966 | 2615 | 7641.82608695 ... USA | 1965 | 9014 | 9141.34782608 USA | 1966 | 9579 | 9141.34782608 ... Hans-J¨urgen Sch¨onig www.postgresql-support.de

Forming groups Data is split into groups Each row shows the average of all rows in its group Note that we got one group (= window) per country Hans-J¨urgen Sch¨onig www.postgresql-support.de

OVER() can contain order and groups SELECT *, max(production) OVER (PARTITION BY country ORDER BY year) FROM t_oil; In this case we get the maximum up to a given point This is done for each country Hans-J¨urgen Sch¨onig www.postgresql-support.de

Abstracting window-clauses SELECT *, min(production) OVER (w), max(production) OVER (w), count(production) OVER (w) FROM t_oil WINDOW w AS (PARTITION BY country ORDER BY year) the same clause can be used for many columns many window-clauses may exist (w, w2, w3, etc.) Hans-J¨urgen Sch¨onig www.postgresql-support.de

rank() and dense rank() Data can be ranked according to some order In case of duplicates rank gives 1, 2, 2, 2, 5 dense rank gives 1, 2, 2, 2, 3 Hans-J¨urgen Sch¨onig www.postgresql-support.de

Moving rows: lead ORDER BY defines into which direction to “move” the row the number defines the offset SELECT *, lag(production, 1) OVER (ORDER BY year) FROM t_oil WHERE country = ’USA’; country | year | production | lag ---------+------+------------+------- USA | 1965 | 9014 | USA | 1966 | 9579 | 9014 USA | 1967 | 10219 | 9579 USA | 1968 | 10600 | 10219 Hans-Jürgen Schönig www.postgresql-support.de

Calculating the change in production Very easy thing to do now SELECT *, production - lag(production, 1) OVER (ORDER BY year) FROM t_oil WHERE country = ’USA’; country | year | production | ?column? ---------+------+------------+---------- USA | 1965 | 9014 | USA | 1966 | 9579 | 565 USA | 1967 | 10219 | 640 USA | 1968 | 10600 | 381 Hans-J¨urgen Sch¨onig www.postgresql-support.de

lead is the opposite of lag lag is the same as ‘lead(. . . , -1)’ lag pushes elements down lead pushes elemens up Hans-J¨urgen Sch¨onig www.postgresql-support.de

moving entire rows SELECT *, lag(t_oil, 1) OVER (ORDER BY year) FROM t_oil WHERE country = ’USA’; country | year | production | lag ---------+------+------------+------------------ USA | 1965 | 9014 | USA | 1966 | 9579 | (USA,1965,9014) USA | 1967 | 10219 | (USA,1966,9579) USA | 1968 | 10600 | (USA,1967,10219) the composite type can then be disected using a subselect for the current query Hans-J¨urgen Sch¨onig www.postgresql-support.de

works for more than just one column SELECT *, lag((year, production), 1) OVER (ORDER BY year) FROM t_oil WHERE country = ’USA’; country | year | production | lag ---------+------+------------+-------------- USA | 1965 | 9014 | USA | 1966 | 9579 | (1965,9014) USA | 1967 | 10219 | (1966,9579) USA | 1968 | 10600 | (1967,10219) this is the perfect foundation to build custom aggregates to solve complex problems Hans-J¨urgen Sch¨onig www.postgresql-support.de

Splitting data into equal parts ntile can split your data into n equally sized blocks ntile(4) will therefore give you a nice quantile distribution order is needed to achieve that Hans-J¨urgen Sch¨onig www.postgresql-support.de

Here is how it works . . . SELECT year, production, ntile(4) OVER (ORDER BY production) FROM t_oil WHERE country = ’USA’ ORDER BY 3, 2 DESC; year | production | ntile ------+------------+------- 2000 | 7733 | 1 1999 | 7731 | 1 ... 1966 | 9579 | 2 1989 | 9159 | 2 ... 1972 | 11185 | 4 Hans-J¨urgen Sch¨onig www.postgresql-support.de

Work can proceed from there SELECT ntile, min(production), max(production) FROM ( SELECT year, production, ntile(4) OVER (ORDER BY production) FROM t_oil WHERE country = ’USA’) AS x GROUP BY 1 ORDER BY 1 Hans-J¨urgen Sch¨onig www.postgresql-support.de

The query returns nice quantiles ntile | min | max -------+-------+------- 1 | 6734 | 7733 2 | 8011 | 9579 3 | 9736 | 10231 4 | 10247 | 11297 (4 rows) Hans-J¨urgen Sch¨onig www.postgresql-support.de

Moving averages More sophisticated frame-clauses are needed The average is done for 2 years = current + previous one SELECT *, avg(production) OVER (ORDER BY year ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING) FROM t_oil WHERE country = ’Saudi Arabien’; country | year | production | avg ---------------+------+------------+------------ Saudi Arabien | 1965 | 2219 | 2219.0000 Saudi Arabien | 1966 | 2615 | 2417.0000 Saudi Arabien | 1967 | 2825 | 2720.0000 Hans-J¨urgen Sch¨onig www.postgresql-support.de

Combining joins, aggregates, and windowing Hans-J¨urgen Sch¨onig www.postgresql-support.de

Combining data To combine data we need some more import data CREATE TABLE t_president (name text, start_year int, end_year int, party text); Hans-J¨urgen Sch¨onig www.postgresql-support.de

Some input data A list of all American presidents and their presidency test=# COPY t_president FROM PROGRAM ’curl www.cybertec.at/secret/president.txt’; COPY 9 The format is not too nice for analysis Hans-J¨urgen Sch¨onig www.postgresql-support.de

Input data: American presidents SELECT * FROM t_president ; name | start_year | end_year | party -------------------+------------+----------+------------ Lyndon B. Johnson | 1963 | 1969 | Democrat Richard M. Nixon | 1969 | 1974 | Republican Gerald Ford | 1974 | 1977 | Republican Jimmy Carter | 1977 | 1981 | Democrat Ronald W. Reagan | 1981 | 1989 | Republican George H. W. Bush | 1989 | 1993 | Republican Bill Clinton | 1993 | 2001 | Democrat George W. Bush | 2001 | 2009 | Republican Barack Obama | 2009 | 2017 | Democrat Hans-J¨urgen Sch¨onig www.postgresql-support.de

The challenge: Adjust the format LATERAL can come to the rescue SELECT name, party, year FROM t_president AS x, LATERAL (SELECT * FROM generate_series(x.start_year, x.end_year - 1) AS year) AS y LIMIT 8; Hans-J¨urgen Sch¨onig www.postgresql-support.de

Which party is better for oil? The following way to solve the problem is definitely not the only one. There might be other factors than the party of the president when it comes to this kind of data. Keep in mind: It is just an SQL exercise Hans-Jürgen Schönig www.postgresql-support.de

Putting things together (1) CREATE VIEW v AS WITH b AS ( SELECT name, party, year FROM t_president AS x, LATERAL (SELECT * FROM generate_series( x.start_year, x.end_year - 1) AS year) AS y) SELECT a.*, party, production - lag(production, 1) OVER (ORDER BY a.year) AS lag FROM t_oil AS a, b WHERE a.year = b.year AND country = ’USA’; Hans-J¨urgen Sch¨onig www.postgresql-support.de

What we got so far SELECT * FROM v; country | year | production | party | lag ---------+------+------------+------------+------ USA | 1965 | 9014 | Democrat | USA | 1966 | 9579 | Democrat | 565 USA | 1967 | 10219 | Democrat | 640 USA | 1968 | 10600 | Democrat | 381 USA | 1969 | 10828 | Republican | 228 USA | 1970 | 11297 | Republican | 469 Hans-J¨urgen Sch¨onig www.postgresql-support.de

Making use of NULL Remember NULL is ignored by aggregate functions We can use that to do ‘partial counts’ SELECT party, lag, CASE WHEN lag > 0 THEN 1 END AS up, CASE WHEN lag < 0 THEN 1 END AS down FROM v ORDER BY year; Hans-J¨urgen Sch¨onig www.postgresql-support.de

Which gives us . . . party | lag | up | down ------------+------+----+------ Democrat | | | Democrat | 565 | 1 | Democrat | 640 | 1 | Democrat | 381 | 1 | Republican | 228 | 1 | Republican | 469 | 1 | Republican | -141 | | 1 Republican | 29 | 1 | Republican | -239 | | 1 Republican | -485 | | 1 Hans-J¨urgen Sch¨onig www.postgresql-support.de

We can move on from there easily SELECT party, count(CASE WHEN lag > 0 THEN 1 END) AS up, count(CASE WHEN lag < 0 THEN 1 END) AS down FROM v GROUP BY party; party | up | down ------------+----+------ Democrat | 9 | 8 Republican | 10 | 18 (2 rows) Hans-J¨urgen Sch¨onig www.postgresql-support.de

Handling missing data Hans-J¨urgen Sch¨onig www.postgresql-support.de

Preparing our sample data test=# UPDATE t_oil SET production = NULL WHERE year IN (1998, 1999) AND country = ’USA’ RETURNING *; country | year | production ---------+------+------------ USA | 1998 | USA | 1999 | (2 rows) Hans-J¨urgen Sch¨onig www.postgresql-support.de

Challenges ahead How can we make lead and lag work again? How can we fill the gaps? How can we control our behavior in a more efficient way? Hans-Jürgen Schönig www.postgresql-support.de

Turning to frame-clauses once again One idea is to just use the average of some previous values However, you might also want to turn to interpolation or outright guess work A custom aggregate might help Hans-J¨urgen Sch¨onig www.postgresql-support.de

A ‘lazy’ idea Creating an array with some historic values Applying a function on this array SELECT year, production, array_agg(production) OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING) FROM t_oil WHERE country = ’USA’; Hans-J¨urgen Sch¨onig www.postgresql-support.de

Which gives us . . . ... snip ... 1995 | 8322 | {8868,8583,8389,8322} 1996 | 8295 | {8583,8389,8322,8295} 1997 | 8269 | {8389,8322,8295,8269} 1998 | | {8322,8295,8269,NULL} 1999 | | {8295,8269,NULL,NULL} 2000 | 7733 | {8269,NULL,NULL,7733} 2001 | 7669 | {NULL,NULL,7733,7669} 2002 | 7626 | {NULL,7733,7669,7626} 2003 | 7400 | {7733,7669,7626,7400} ... snip ... Hans-J¨urgen Sch¨onig www.postgresql-support.de

Applying a function A simple function could look like this: SELECT avg(x) FROM unnest(’{8295,8269,NULL,NULL}’::int4[]) AS x; avg ----------------------- 8282.0000000000000000 (1 row) Hans-J¨urgen Sch¨onig www.postgresql-support.de

A query could therefore look like this SELECT *, (SELECT avg(x) FROM unnest(array_agg) AS x) FROM (SELECT year, production, array_agg(production) OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING) FROM t_oil WHERE country = ’USA’) AS y OFFSET 32 LIMIT 4; year | production | array_agg | avg ------+------------+-----------------------+------------- 1997 | 8269 | {8389,8322,8295,8269} | 8318.750000 1998 | | {8322,8295,8269,NULL} | 8295.333333 1999 | | {8295,8269,NULL,NULL} | 8282.000000 2000 | 7733 | {8269,NULL,NULL,7733} | 8001.000000 Hans-J¨urgen Sch¨onig www.postgresql-support.de

Defining an aggregate Defining an aggregate is really the more desirable way It is ways more clean CREATE AGGREGATE is your friend Hans-Jürgen Schönig www.postgresql-support.de

A simple example the aggregate can be created like this: CREATE FUNCTION my_final(int[]) RETURNS numeric AS $$ SELECT avg(x) FROM unnest($1) AS x; $$ LANGUAGE sql; CREATE AGGREGATE artificial_avg(int) ( SFUNC = array_append, STYPE = int[], INITCOND = ’{}’, FINALFUNC = my_final ); Hans-J¨urgen Sch¨onig www.postgresql-support.de

Using our new aggregate SELECT year, production, artificial_avg(production) OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING) FROM t_oil WHERE country = ’USA’; the aggregate can be used just like any other aggregate in the system Hans-J¨urgen Sch¨onig www.postgresql-support.de

Finally Hans-J¨urgen Sch¨onig www.postgresql-support.de

Thank you for your attention Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt www.postgresql-support.de Hans-Jürgen Schönig www.postgresql-support.de

PostgreSQL: Data analysis and analytics

More Related Content

What's hot

Viewers also liked

Similar to PostgreSQL: Data analysis and analytics

Recently uploaded

PostgreSQL: Data analysis and analytics