10 Reasons To Start Your Analytics Project with PostgreSQL Satoshi Nagayasu @snaga HKOSCon 2016
Agenda • Collecting Data / Database Federation • Building Data Warehouse and Data Mart • Writing Queries / SQL Features • Performance • In-Database Analytics
Collecting Data / Database Federation Foreign Data Wrapper Unlogged Table
Foreign Data Wrapper • Connects external data sources (RDBMS, NoSQL, files, etc) to the PostgreSQL executor. • Allows SELECT/INSERT/UPDATE/DELETE operations for external tables. PostgreSQL Oracle MySQL HDFS https://wiki.postgresql.org/wiki/Foreign_data_wrappers
Unlogged Table • Does not record XLOG. • Has better performance compared to regular table. • Will be truncated after crash recovery. http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html
Building Data Warehouse and Data Mart Materialized Views Transactional DDLs
Materialized View • Defines a view with caching records. • Allows to avoid running complicated queries and aggregations every time. • Requires updating cache by the users. Table View Table Table Materialized View Table Query Query Cache
Transactional DDLs • Most of DDLs can be performed in transaction in PostgreSQL. • Schema can be modified with keeping atomicity even online. (commit or rollback) • Transactional DDLs would help DBAs manage their schema easier.
Writing Queries / SQL Features Rich SQL features Compatibility with SQL standard
Writing Queries / SQL Features • Rich SQL features – Subqueries – WITH clauses (Common Table Expressions, CTEs) – Many aggregation functions – Window functions • JSON support • Compatibility with the SQL standard
WITH clause • Defines a temporary table for a query. • May make a better performance compared to using the same subquery more than once. WITH foo AS ( SELECT ... FROM ... GROUP BY ... ) SELECT ... FROM foo WHERE ... UNION ALL SELECT ... FROM foo WHERE ...; https://www.postgresql.org/docs/9.5/static/queries-with.html
Many Aggregations • New in 9.4 – percentile_cont() – percentile_disc() – mode() – rank() – dense_rank() – percent_rank() – cume_dist() • New in 9.5 – ROLLUP() – CUBE() – GROUPING SETS() https://www.postgresql.org/docs/9.5/static/functions-aggregate.html
ROLLUP • Calculates total/subtotal values
CUBE • Calculates for all combinations of the specified columns
GROUPING SETS • Runs multiple GROUP BY queries at once Two GROUP BYs at once.
JSON data type testdb=# create table t1 ( j jsonb ); CREATE TABLE testdb=# insert into t1 values ('{ "key1": "value1", "key2": "value2" }'); INSERT 0 1 testdb=# select * from t1; j -------------------------------------- {"key1": "value1", "key2": "value2"} (1 row) testdb=# select j->>'key2' key2 from t1; key2 -------- value2 (1 row)
JSON data type testdb=# select n_nationkey,n_name from nation where n_nationkey = 12; n_nationkey | n_name -------------+--------------------------- 12 | JAPAN (1 row) testdb=# select jsonb_build_object('n_nationkey', n_nationkey, 'n_name', n_name) from nation where n_nationkey = 12; jsonb_build_object ------------------------------------------------------------ {"n_name": "JAPAN ", "n_nationkey": 12} (1 row)
JSON data type Operator Description 9.4 -> Get an element by key as a JSON object ->> Get an element by key as a text object #> Get an element by path as a JSON object #>> Get an element by path as a text object <@, @> Evaluate whether a JSON object contains a key/value pair ? Evaluate whether a JSON object contains a key or a value ?| Evaluate whether a JSON object contains ANY of keys or values ?& Evaluate whether a JSON object contains ALL of keys or values 9.5 || Insert or Update an element to a JSON object - Delete an element by key from a JSON object #- Delete an element by path from a JSON object http://www.postgresql.org/docs/9.5/static/functions-json.html
JSON data type • Allows to collect data without defining schema. • “Schema-less”, “Schema on Read” or “Schema- later”. • Still accessible with SQL. JSON Data Type Fluentd pg-Json plugin View (Schema) App App Fluentd
Performance 3 types of Join Full text search (n-gram) Table Partition BRIN Index Table Sample Parallel Queries
3 types of Join • Nested Loop (NL) Join – Works good when joining small number of records between tables with indexes. • Merge Join • Hash Join – Works better than NL when joining large number of records between large tables.
Full-text search (n-gram) • Splits a text into N-char tokens and build an index. – Pg_trgm: Tri-gram (3-char) – Pg_bigm: Bi-gram (2-char) • CJK has lots of 2-char words, so Bi-gram may be useful rather than Tri-gram. – CJK: Chinese, Japanese and Korean. Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.html Pg_bigm: http://pgbigm.osdn.jp/index_en.html
Pg_bigm performance • Wikipedia title data (2,789,266 records) – https://dumps.wikimedia.org/zhwiki/20160601/ – zhwiki-20160601-pages-articles-multistream-index.txt.bz2 zhwikidb=> select * from zhwiki_index where title like '%香港%'; id1 | id2 | title ----------+-------+---------------------------------------- 5693863 | 2087 | 香港特別行政區基本法第二十三條 11393231 | 4323 | 香港特别行政区 12830042 | 5085 | 香港大学列表 14349335 | 6088 | 香港行政区划 14349335 | 6090 | 香港行政區劃 14349335 | 6091 | 香港十八区 14349335 | 6092 | 香港十八區 16084672 | 7168 | 香港兒童文學作家 18110426 | 8206 | 北區 (香港) 18110426 | 8236 | 東區 (香港) 19537078 | 9528 | 香港專業教育學院 19537078 | 9567 | 香港中文大學
Pg_bigm performance Aggregate (actual time=481.512..481.541 rows=1 loops=1) -> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1) Filter: (title ~~ '%香港電影%'::text) Rows Removed by Filter: 2788949 Planning time: 0.125 ms Execution time: 481.654 ms (6 rows) select count(*) from zhwiki_index where title like '%香港電影%';
Pg_bigm performance Aggregate (actual time=1.790..1.792 rows=1 loops=1) -> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317 loops=1) Recheck Cond: (title ~~ '%香港電影%'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=191 -> Bitmap Index Scan on zhwiki_index_title_idx (actual time=0.258..0.258 rows=318 loops=1) Index Cond: (title ~~ '%香港電影%'::text) Planning time: 0.103 ms Execution time: 1.833 ms (9 rows) select count(*) from zhwiki_index where title like '%香港電影%'; 481.6ms → 1.8ms. 200x faster than a regular LIKE.
Table Partition • Table Partitioning by Range or List – Called “Constraint Exclusion” • Does not scan unnecessary partitions – Determined by the “constraints”. • Is able to eliminate “full table scan” for large tables entirely. https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html
BRIN Index • Block Range INdex (New in 9.5) – Holds "summary“ data, instead of raw data. – Reduces index size tremendously. – Also reduces creation/maintenance cost. – Needs extra tuple fetch to get the exact record. 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN Elapsedtime(ms) Index Creation 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN NumberofBlocks Index Size 0 2 4 6 8 10 12 14 16 18 Btree BRIN Elapsedtime(ms) Select 1 record https://gist.github.com/snaga/82173bd49749ccf0fa6c
BRIN Index • Structure of BRIN Index Table File Block Range 1 (128 Blocks) Block Range 2 Block Range 3 Block Range Min. Value Max. Value 1 1992-01-02 1992-01-28 2 1992-01-27 1992-02-08 3 1992-02-08 1992-02-16 … … … Holds only min/max values for “Block Ranges”, 128 blocks each. (in case a date column)
TABLESAMPLE • Allows to get approximate results for aggregations by sampling. • BERNOULLI – Accurate – Sample by Tuple • SYSTEM – Performance – Sample by Block http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
TABLESAMPLE • Calculating the average of total price. – The actual value and the approximate ones
TABLESAMPLE Without TABLESAMPLE 1787ms SYSTEM Sampl. 22ms BERNOULLI Sampl. 405ms
Parallel Queries • The leader process cooperates with those worker processes for: – Sequential scan – Joins (Nested Loop & Hash) – Aggregations • Will be shipped with 9.6 – 9.6 is beta2 as of today Leader Worker Worker Client Data Read & Examine Query Result Launch & Gather
Parallel Aggregation Performance & Scalability • count(*) on 30M rows – Shows a good parallel scalability
In-Database Analytics User Defined Functions Apache MADlib
In-Database Analytics • In-Database Analytics? – Performs analytics workload in the database without pulling the data out of the server. • Advantages of In-Database Analytics – No need to move “BigData” between server and client for analytics. – Higher performance hardware resources (CPU, memory, storage) compared to client PCs.
In-Database Analytics • User defined functions – PL/Python, PL/R, PL/v8, ... or C lang. – Allow you to run (almost) any logics within the database. • Apache MADlib – Machine Learning Library for PostgreSQL
UDF by Python CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu;
UDF by Python CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu; testdb=# select * from dumpenv() order by 1 limit 10; column1 | column2 --------------------+----------------------- G_BROKEN_FILENAMES | 1 HISTCONTROL | ignoredups HISTSIZE | 1000 HOME | /home/snaga HOSTNAME | localhost.localdomain LANG | ja_JP.UTF-8 LC_COLLATE | C LC_CTYPE | C LC_MESSAGES | C LC_MONETARY | C (10 rows)
Apache MADlib • An Open Source Machine Learning Library – Can run in PostgreSQL, Greenplum Database and Apache HAWQ. – Supports many ML algorithms. http://madlib.incubator.apache.org/
Others Strict type checking and constraints. Industry Standard Interface (for BI tools)
Others • Strict type checking and constraints. – Avoid “Garbage in, garbage out.” • Industry Standard Interface (for BI tools) – ODBC, JDBC
Summary • PostgreSQL has already had lots of features that help your analytics project – In terms of productivity and performance. • And more “BigData” features are coming in the future release. – Parallel query must be a big-shot. • Let’s start your analytic project with PostgreSQL and join our community.  – PostgreSQL 9.6 beta2 is available now!
Resources • http://www.postgresql.org • http://wiki.postgresql.org • http://planet.postgresql.org • http://pgcon.org
pgDay Asia 2016 • pgDay Asia 2016 / FOSSASIA 2016 – March 17-19 in Singapore • Speakers: – 19+ speakers from 9 countries • Sessions: – 19 Regular Sessions. – Plus, lightning talks • Attendees: – Around 100 attendees
pgDay Asia 2017 • FOSSASIA 2017 (March, 2017) – Probably, the same format, in the same season, in the same region. • Do not miss the next one! – Will be better and bigger.  • Join us at: – http://pgday.asia – https://www.facebook.com/pgdayasia
Q&A

10 Reasons to Start Your Analytics Project with PostgreSQL

  • 1.
    10 Reasons ToStart Your Analytics Project with PostgreSQL Satoshi Nagayasu @snaga HKOSCon 2016
  • 2.
    Agenda • Collecting Data/ Database Federation • Building Data Warehouse and Data Mart • Writing Queries / SQL Features • Performance • In-Database Analytics
  • 3.
    Collecting Data /Database Federation Foreign Data Wrapper Unlogged Table
  • 4.
    Foreign Data Wrapper •Connects external data sources (RDBMS, NoSQL, files, etc) to the PostgreSQL executor. • Allows SELECT/INSERT/UPDATE/DELETE operations for external tables. PostgreSQL Oracle MySQL HDFS https://wiki.postgresql.org/wiki/Foreign_data_wrappers
  • 5.
    Unlogged Table • Doesnot record XLOG. • Has better performance compared to regular table. • Will be truncated after crash recovery. http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html
  • 6.
    Building Data Warehouse andData Mart Materialized Views Transactional DDLs
  • 7.
    Materialized View • Definesa view with caching records. • Allows to avoid running complicated queries and aggregations every time. • Requires updating cache by the users. Table View Table Table Materialized View Table Query Query Cache
  • 8.
    Transactional DDLs • Mostof DDLs can be performed in transaction in PostgreSQL. • Schema can be modified with keeping atomicity even online. (commit or rollback) • Transactional DDLs would help DBAs manage their schema easier.
  • 9.
    Writing Queries /SQL Features Rich SQL features Compatibility with SQL standard
  • 10.
    Writing Queries /SQL Features • Rich SQL features – Subqueries – WITH clauses (Common Table Expressions, CTEs) – Many aggregation functions – Window functions • JSON support • Compatibility with the SQL standard
  • 11.
    WITH clause • Definesa temporary table for a query. • May make a better performance compared to using the same subquery more than once. WITH foo AS ( SELECT ... FROM ... GROUP BY ... ) SELECT ... FROM foo WHERE ... UNION ALL SELECT ... FROM foo WHERE ...; https://www.postgresql.org/docs/9.5/static/queries-with.html
  • 12.
    Many Aggregations • Newin 9.4 – percentile_cont() – percentile_disc() – mode() – rank() – dense_rank() – percent_rank() – cume_dist() • New in 9.5 – ROLLUP() – CUBE() – GROUPING SETS() https://www.postgresql.org/docs/9.5/static/functions-aggregate.html
  • 13.
  • 14.
    CUBE • Calculates forall combinations of the specified columns
  • 15.
    GROUPING SETS • Runsmultiple GROUP BY queries at once Two GROUP BYs at once.
  • 16.
    JSON data type testdb=#create table t1 ( j jsonb ); CREATE TABLE testdb=# insert into t1 values ('{ "key1": "value1", "key2": "value2" }'); INSERT 0 1 testdb=# select * from t1; j -------------------------------------- {"key1": "value1", "key2": "value2"} (1 row) testdb=# select j->>'key2' key2 from t1; key2 -------- value2 (1 row)
  • 17.
    JSON data type testdb=#select n_nationkey,n_name from nation where n_nationkey = 12; n_nationkey | n_name -------------+--------------------------- 12 | JAPAN (1 row) testdb=# select jsonb_build_object('n_nationkey', n_nationkey, 'n_name', n_name) from nation where n_nationkey = 12; jsonb_build_object ------------------------------------------------------------ {"n_name": "JAPAN ", "n_nationkey": 12} (1 row)
  • 18.
    JSON data type OperatorDescription 9.4 -> Get an element by key as a JSON object ->> Get an element by key as a text object #> Get an element by path as a JSON object #>> Get an element by path as a text object <@, @> Evaluate whether a JSON object contains a key/value pair ? Evaluate whether a JSON object contains a key or a value ?| Evaluate whether a JSON object contains ANY of keys or values ?& Evaluate whether a JSON object contains ALL of keys or values 9.5 || Insert or Update an element to a JSON object - Delete an element by key from a JSON object #- Delete an element by path from a JSON object http://www.postgresql.org/docs/9.5/static/functions-json.html
  • 19.
    JSON data type •Allows to collect data without defining schema. • “Schema-less”, “Schema on Read” or “Schema- later”. • Still accessible with SQL. JSON Data Type Fluentd pg-Json plugin View (Schema) App App Fluentd
  • 20.
    Performance 3 types ofJoin Full text search (n-gram) Table Partition BRIN Index Table Sample Parallel Queries
  • 21.
    3 types ofJoin • Nested Loop (NL) Join – Works good when joining small number of records between tables with indexes. • Merge Join • Hash Join – Works better than NL when joining large number of records between large tables.
  • 22.
    Full-text search (n-gram) •Splits a text into N-char tokens and build an index. – Pg_trgm: Tri-gram (3-char) – Pg_bigm: Bi-gram (2-char) • CJK has lots of 2-char words, so Bi-gram may be useful rather than Tri-gram. – CJK: Chinese, Japanese and Korean. Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.html Pg_bigm: http://pgbigm.osdn.jp/index_en.html
  • 23.
    Pg_bigm performance • Wikipediatitle data (2,789,266 records) – https://dumps.wikimedia.org/zhwiki/20160601/ – zhwiki-20160601-pages-articles-multistream-index.txt.bz2 zhwikidb=> select * from zhwiki_index where title like '%香港%'; id1 | id2 | title ----------+-------+---------------------------------------- 5693863 | 2087 | 香港特別行政區基本法第二十三條 11393231 | 4323 | 香港特别行政区 12830042 | 5085 | 香港大学列表 14349335 | 6088 | 香港行政区划 14349335 | 6090 | 香港行政區劃 14349335 | 6091 | 香港十八区 14349335 | 6092 | 香港十八區 16084672 | 7168 | 香港兒童文學作家 18110426 | 8206 | 北區 (香港) 18110426 | 8236 | 東區 (香港) 19537078 | 9528 | 香港專業教育學院 19537078 | 9567 | 香港中文大學
  • 24.
    Pg_bigm performance Aggregate (actualtime=481.512..481.541 rows=1 loops=1) -> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1) Filter: (title ~~ '%香港電影%'::text) Rows Removed by Filter: 2788949 Planning time: 0.125 ms Execution time: 481.654 ms (6 rows) select count(*) from zhwiki_index where title like '%香港電影%';
  • 25.
    Pg_bigm performance Aggregate (actualtime=1.790..1.792 rows=1 loops=1) -> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317 loops=1) Recheck Cond: (title ~~ '%香港電影%'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=191 -> Bitmap Index Scan on zhwiki_index_title_idx (actual time=0.258..0.258 rows=318 loops=1) Index Cond: (title ~~ '%香港電影%'::text) Planning time: 0.103 ms Execution time: 1.833 ms (9 rows) select count(*) from zhwiki_index where title like '%香港電影%'; 481.6ms → 1.8ms. 200x faster than a regular LIKE.
  • 26.
    Table Partition • TablePartitioning by Range or List – Called “Constraint Exclusion” • Does not scan unnecessary partitions – Determined by the “constraints”. • Is able to eliminate “full table scan” for large tables entirely. https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html
  • 27.
    BRIN Index • BlockRange INdex (New in 9.5) – Holds "summary“ data, instead of raw data. – Reduces index size tremendously. – Also reduces creation/maintenance cost. – Needs extra tuple fetch to get the exact record. 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN Elapsedtime(ms) Index Creation 0 50,000 100,000 150,000 200,000 250,000 300,000 Btree BRIN NumberofBlocks Index Size 0 2 4 6 8 10 12 14 16 18 Btree BRIN Elapsedtime(ms) Select 1 record https://gist.github.com/snaga/82173bd49749ccf0fa6c
  • 28.
    BRIN Index • Structureof BRIN Index Table File Block Range 1 (128 Blocks) Block Range 2 Block Range 3 Block Range Min. Value Max. Value 1 1992-01-02 1992-01-28 2 1992-01-27 1992-02-08 3 1992-02-08 1992-02-16 … … … Holds only min/max values for “Block Ranges”, 128 blocks each. (in case a date column)
  • 29.
    TABLESAMPLE • Allows toget approximate results for aggregations by sampling. • BERNOULLI – Accurate – Sample by Tuple • SYSTEM – Performance – Sample by Block http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
  • 30.
    TABLESAMPLE • Calculating theaverage of total price. – The actual value and the approximate ones
  • 31.
  • 32.
    Parallel Queries • Theleader process cooperates with those worker processes for: – Sequential scan – Joins (Nested Loop & Hash) – Aggregations • Will be shipped with 9.6 – 9.6 is beta2 as of today Leader Worker Worker Client Data Read & Examine Query Result Launch & Gather
  • 33.
    Parallel Aggregation Performance &Scalability • count(*) on 30M rows – Shows a good parallel scalability
  • 34.
    In-Database Analytics User DefinedFunctions Apache MADlib
  • 35.
    In-Database Analytics • In-DatabaseAnalytics? – Performs analytics workload in the database without pulling the data out of the server. • Advantages of In-Database Analytics – No need to move “BigData” between server and client for analytics. – Higher performance hardware resources (CPU, memory, storage) compared to client PCs.
  • 36.
    In-Database Analytics • Userdefined functions – PL/Python, PL/R, PL/v8, ... or C lang. – Allow you to run (almost) any logics within the database. • Apache MADlib – Machine Learning Library for PostgreSQL
  • 37.
    UDF by Python CREATEOR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu;
  • 38.
    UDF by Python CREATEOR REPLACE FUNCTION dumpenv(OUT text, OUT text) RETURNS SETOF record AS $$ import os for e in os.environ: plpy.notice(str(e) + ": " + os.environ[e]) yield(e, os.environ[e]) $$ LANGUAGE plpythonu; testdb=# select * from dumpenv() order by 1 limit 10; column1 | column2 --------------------+----------------------- G_BROKEN_FILENAMES | 1 HISTCONTROL | ignoredups HISTSIZE | 1000 HOME | /home/snaga HOSTNAME | localhost.localdomain LANG | ja_JP.UTF-8 LC_COLLATE | C LC_CTYPE | C LC_MESSAGES | C LC_MONETARY | C (10 rows)
  • 39.
    Apache MADlib • AnOpen Source Machine Learning Library – Can run in PostgreSQL, Greenplum Database and Apache HAWQ. – Supports many ML algorithms. http://madlib.incubator.apache.org/
  • 40.
    Others Strict type checkingand constraints. Industry Standard Interface (for BI tools)
  • 41.
    Others • Strict typechecking and constraints. – Avoid “Garbage in, garbage out.” • Industry Standard Interface (for BI tools) – ODBC, JDBC
  • 42.
    Summary • PostgreSQL hasalready had lots of features that help your analytics project – In terms of productivity and performance. • And more “BigData” features are coming in the future release. – Parallel query must be a big-shot. • Let’s start your analytic project with PostgreSQL and join our community.  – PostgreSQL 9.6 beta2 is available now!
  • 43.
    Resources • http://www.postgresql.org • http://wiki.postgresql.org •http://planet.postgresql.org • http://pgcon.org
  • 44.
    pgDay Asia 2016 •pgDay Asia 2016 / FOSSASIA 2016 – March 17-19 in Singapore • Speakers: – 19+ speakers from 9 countries • Sessions: – 19 Regular Sessions. – Plus, lightning talks • Attendees: – Around 100 attendees
  • 48.
    pgDay Asia 2017 •FOSSASIA 2017 (March, 2017) – Probably, the same format, in the same season, in the same region. • Do not miss the next one! – Will be better and bigger.  • Join us at: – http://pgday.asia – https://www.facebook.com/pgdayasia
  • 49.