PostgreSQL and Compressed Documents Aleksander Alekseev a.alekseev@postgrespro.ru
A few words about me ● I live in Moscow, Russia; ● Develop software since 2007; ● Contribute to PostgreSQL since 2015; ● Work in Postgres Professional company; ● Interests: OSS, functional programming, electronics, SDR, distributed systems, blogging, podcasting; ● https://eax.me/ & http://devzen.ru/ ;
In this talk ● On data compression in general; ● Compressing JSONB; ● Indexing Protobuf; ● Ideas for new projects; ● Fun facts;
Fun fact! I was informed that I’m giving this talk only yesterday. Sorry for raw slides :)
ZSON
ZSON ● An extension for transparent JSONB compression; ● A dictionary of common strings is created based on your data (re-learning is also supported); ● This dictionary is used to replace strings to 16-bit codes; ● Data is compressed in memory and on the disk; ● In some cases it gives 10% more TPS; ● Free and open source software (MIT license); ●
How JSONB looks like
JSONB problems ● Redundancy; ● Disk space; ● Memory; ● => IO & TPS;
The idea ● Step 1: replace common strings to 16-bit codes; ● Step 2: compress using PGLZ as usual;
zson_learn zson_learn( tables_and_columns text[][], max_examples int default 10000, min_length int default 2, max_length int default 128, min_count int default 2) Example: select zson_learn('{{"table1", "col1"}, {"table2", "col2"}}');
zson_extract_strings
Other ZSON internals
Encoding
pg_protobuf
What it has to do with Star Wars?
Protocol Buffers Protocol Buffers is a method of serializing structured data. It is useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data. -- Wikipedia
Person.proto
Protobuf
These two images were borrowed from ● http://shop.oreilly.com/product/0636920032175.do ● https://martin.kleppmann.com/
Fun fact! ● The attribute `required` was removed in Protobuf 3; ● All fields are optional now;
pg_protobuf ● Protobuf support for PostgreSQL; ● Like ZSON but even better; ● No shared dictionaries; ● No learning/re-learning steps; ● Requires changes in the application; ● Free and open source software (MIT license); ●
pg_protobuf: example create extension pg_protobuf; create table heroes (x bytea); create function hero_name(x bytea) returns text as $$ begin return protobuf_get_string(x, 1); end $$ language 'plpgsql' immutable; create index hero_name_idx on heroes using btree(hero_name(x)); select protobuf_decode(x) from heroes where hero_name(x) = 'foo';
Fun facts!
Data layout
Order matters
NULLs are free* ● Tuple header size: 23 bytes; ● With alignment: 24 bytes; ● Null mask is placed right after the header; ● Result: up to 8 nullable columns cost nothing; ● Also: buy one NULL, get 7 NULLs for free!
Alignment and B-tree Index entries are 8-bytes aligned.
Timetz vs timestamptz ● timetz: int64 (timestamp) + int32 (timezone); ● timestamptz: always an int64, UTC time; ● Result: time takes more space then date + time;
TOAST ● PGLZ: more or less same speed and ratio as ZLIB; ● Heuristic: if beginning of the attribute is compressed well, compress it; ● Works out-of-the-box for large string-like attributes;
Ideas
Ideas: ZSON ● Use PGLZ directly, don’t rely on PostgreSQL heuristics; ● Use more than one dictionary for different tables / columns; ● Same idea for TEXT / XML / whatever;
Ideas: pg_protobuf ● Add an ability to modify Protobuf data; ● Write a tool that will generate PL/pgSQL procedures for accessing Protobuf fields; ● Support unsigned types: uint, fixed32, fixed64; ● Support fields with [packet=true] attribute (in Protobuf 3 - by default); Fun fact! There are no unsigned integer types in PostgreSQL.
Ideas: more extensions! ● pg_thrift, pg_avro, pg_capnproto, pg_messagepack, …; ● An extension with pluggable compression algorithms;
Links ● https://github.com/afiskon/zson ● https://github.com/afiskon/pg_protobuf ● https://github.com/google/protobuf ● https://eax.me/postgresql-extensions/ (in Russian) ● https://eax.me/cpp-protobuf/ (in Russian) ● https://afiskon.github.io/pgconf2017-talk.html
We are hiring! ● https://postgrespro.ru/jobs
Thank you for your attention! ● a.alekseev@postgrespro.ru ● https://afiskon.github.io/ ● https://postgrespro.com/ ● https://github.com/postgrespro/

PostgreSQL and Compressed Documents (pgconf.ru 2018)

  • 1.
    PostgreSQL and Compressed Documents AleksanderAlekseev a.alekseev@postgrespro.ru
  • 2.
    A few wordsabout me ● I live in Moscow, Russia; ● Develop software since 2007; ● Contribute to PostgreSQL since 2015; ● Work in Postgres Professional company; ● Interests: OSS, functional programming, electronics, SDR, distributed systems, blogging, podcasting; ● https://eax.me/ & http://devzen.ru/ ;
  • 3.
    In this talk ●On data compression in general; ● Compressing JSONB; ● Indexing Protobuf; ● Ideas for new projects; ● Fun facts;
  • 4.
    Fun fact! I wasinformed that I’m giving this talk only yesterday. Sorry for raw slides :)
  • 5.
  • 6.
    ZSON ● An extensionfor transparent JSONB compression; ● A dictionary of common strings is created based on your data (re-learning is also supported); ● This dictionary is used to replace strings to 16-bit codes; ● Data is compressed in memory and on the disk; ● In some cases it gives 10% more TPS; ● Free and open source software (MIT license); ●
  • 7.
  • 8.
    JSONB problems ● Redundancy; ●Disk space; ● Memory; ● => IO & TPS;
  • 9.
    The idea ● Step1: replace common strings to 16-bit codes; ● Step 2: compress using PGLZ as usual;
  • 10.
    zson_learn zson_learn( tables_and_columns text[][], max_examples intdefault 10000, min_length int default 2, max_length int default 128, min_count int default 2) Example: select zson_learn('{{"table1", "col1"}, {"table2", "col2"}}');
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    What it hasto do with Star Wars?
  • 16.
    Protocol Buffers Protocol Buffersis a method of serializing structured data. It is useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data. -- Wikipedia
  • 17.
  • 18.
  • 19.
    These two imageswere borrowed from ● http://shop.oreilly.com/product/0636920032175.do ● https://martin.kleppmann.com/
  • 20.
    Fun fact! ● Theattribute `required` was removed in Protobuf 3; ● All fields are optional now;
  • 21.
    pg_protobuf ● Protobuf supportfor PostgreSQL; ● Like ZSON but even better; ● No shared dictionaries; ● No learning/re-learning steps; ● Requires changes in the application; ● Free and open source software (MIT license); ●
  • 22.
    pg_protobuf: example create extensionpg_protobuf; create table heroes (x bytea); create function hero_name(x bytea) returns text as $$ begin return protobuf_get_string(x, 1); end $$ language 'plpgsql' immutable; create index hero_name_idx on heroes using btree(hero_name(x)); select protobuf_decode(x) from heroes where hero_name(x) = 'foo';
  • 23.
  • 24.
  • 25.
  • 26.
    NULLs are free* ●Tuple header size: 23 bytes; ● With alignment: 24 bytes; ● Null mask is placed right after the header; ● Result: up to 8 nullable columns cost nothing; ● Also: buy one NULL, get 7 NULLs for free!
  • 27.
    Alignment and B-tree Indexentries are 8-bytes aligned.
  • 28.
    Timetz vs timestamptz ●timetz: int64 (timestamp) + int32 (timezone); ● timestamptz: always an int64, UTC time; ● Result: time takes more space then date + time;
  • 29.
    TOAST ● PGLZ: moreor less same speed and ratio as ZLIB; ● Heuristic: if beginning of the attribute is compressed well, compress it; ● Works out-of-the-box for large string-like attributes;
  • 30.
  • 31.
    Ideas: ZSON ● UsePGLZ directly, don’t rely on PostgreSQL heuristics; ● Use more than one dictionary for different tables / columns; ● Same idea for TEXT / XML / whatever;
  • 32.
    Ideas: pg_protobuf ● Addan ability to modify Protobuf data; ● Write a tool that will generate PL/pgSQL procedures for accessing Protobuf fields; ● Support unsigned types: uint, fixed32, fixed64; ● Support fields with [packet=true] attribute (in Protobuf 3 - by default); Fun fact! There are no unsigned integer types in PostgreSQL.
  • 33.
    Ideas: more extensions! ●pg_thrift, pg_avro, pg_capnproto, pg_messagepack, …; ● An extension with pluggable compression algorithms;
  • 34.
    Links ● https://github.com/afiskon/zson ● https://github.com/afiskon/pg_protobuf ●https://github.com/google/protobuf ● https://eax.me/postgresql-extensions/ (in Russian) ● https://eax.me/cpp-protobuf/ (in Russian) ● https://afiskon.github.io/pgconf2017-talk.html
  • 35.
    We are hiring! ●https://postgrespro.ru/jobs
  • 36.
    Thank you foryour attention! ● a.alekseev@postgrespro.ru ● https://afiskon.github.io/ ● https://postgrespro.com/ ● https://github.com/postgrespro/