Postgres Indexing and Toward Big Data Application Jordan Huang @ Synergies 1
Outline 1. Postgres history 2. Data storage and indexing 3. Toward big data: related techniques 2
Background ● 1982: from UC Berkeley Ingres project. ○ Also : vi editor, tcp-ip stack, freeBSD, spark. ● 1988: a prototype version call post-gres at the 1988’s ACM SIGMOD conference. ● 1989: launch first version post-gres. ● 1994: postgres95 ● 1996: rename postgres95 to PostgresSQL 3
Enable postgres on Mac brew install postgres brew services start postgresql 4
pgcli . 5
Enable pgcli on Mac . pip install pgcli 6
SQL File Could Be Found at https://github.com/good5dog5/Slides/tree/master/intro_2_postgres 7
List database 8
Process view 9
Create test tables 10
Created order matter ! CREATE TABLE post ( post_id SERIAL PRIMARY KEY, thread_id INTEGER NOT NULL REFERENCES thread (thread_id), account_id INTEGER NOT NULL REFERENCES account (account_id), …. ); CREATE TABLE account ( account_id SERIAL PRIMARY KEY, …. ); CREATE TABLE thread ( thread_id SERIAL PRIMARY KEY, ... ); 11
Postgres data storage (PGDATA) 12
插入 fake data i gen_random_data.sql ● 100 個 users ● 1000 個 threads ● 100000 posts 13
Table data size select info.table_name as table_name, pg_size_pretty(pg_total_relation_size(quote_ident(info.table_name))) as total_size, pg_size_pretty(pg_indexes_size(quote_ident(info.table_name))) as index_size, pg_size_pretty(pg_relation_size(quote_ident(info.table_name))) as table_size, pc.oid, pc.relfilenode from information_schema.tables as info left join pg_class as pc on info.table_name = pc.relname where table_schema = 'public' Q: 尚未建 Index, 為何 index_size > 0 ? 14
Table = sequence of pages 15
Each Page Contains ● Page header: 關於此 page 的 metadata ● Tuple: 放實際資料的地方,從 page 的 尾端開始塞 ● ItemID Data : 一個 pointer array,指向 對應的 tuple ● Table 中的某比 record 可以用 (page ID, item ID) 去指涉 16
Query 1: ● 全表掃描 (Sequential scan) 之後再根 據條件返回子集 17
Sequential Scan Filtered subset 18
Query 1: with index ● 避免全表掃描 ● 沒特別指定的話 default 用 B-tree 才存 放 index 結構 19
Indexing 問 values -> 回答 (which page, which itemID) 20
Indexing 問 values -> 回答 (which page, which itemID) 21
B-Tree ● Postgres default indexing structure ● 中介節點存放索引 ● 只有 leaf 才存 values (page itemID) ● leaf 以雙向 linked-list 串接,可加速 range query(? ● Pg 用的是 High-Concurrency b-tree 22
Bitmap Index Scan ● 讀取 index(es) ● 動態建立 bitmap (array) ● Boolean 運算 ● 最後才實際到 pages 拿資料 ● 由於是以 physical order 到 page 拿資 料,若有 order by, 則會需要額外的 sort operation [source] 23
Query 2 24
Query 2: with indexing Was 32 ms 25
Indexing Drawbacks ● Extra index files. ● Update / insert data requires extra write to index files 26
Partial Index Why have index entries for record we don’t want ? 27
Partial Index Why have index entries for record we don’t want ? 28
Partial Index Why have index entries for record we don’t want ? 29
Query 2: with partial index Was 0.882 ms 30
Other index methods in Postgres : hash index ● Store hash of values. ● Reduce size and increase speed of processing. ● 無法做 value comparison ● 需處理碰撞 (碰撞發生時需 rebuild 整個 index) ● 實務上使用價值低 31
Other index methods in Postgres Generalized Inverted Indexes (GIN) ● map many values to one row ● 作全文檢索用 32
Toward Big Data: Related techniques 33
Database Parallelism 34
Database Partition ● 單台機器 ● 大表 -> 多個小表 35
Database Sharding ● 多台機器 ● 小表 -> 分散在多個機器 36
Postgres Parallel features 37
Postgres SELECT in Parallel 38
Reference ● Slide: 10 Reasons why you should prefer PostgreSQL to MySQL ● Slide: Bloat and Fragmentation in PostgreSQL ● Blog : Introduction to PostgreSQL physical storage ● Slide:PostgreSQL Internals Through Pictures ● Slide: Huge pages and databases: Working with abundant memory in modern servers ● Slide : Parallelism in PostgreSQL 11 ● Book : SQL Performance Explained ● Video: PostgreSQL Indexing : How, why, and when. 39
Appendix 40
DDL Define the DB structure and DB scheme 1. CREATE 2. ALERT 3. DROP 4. TRUNCATE 5. COMMENT 6. RENAME 41
DML Manage data within schema objects 1. SELECT 2. INSERT 3. UPDATE 4. DELETE 5. MERGE 6. CALL : Call a PL / SQL or java subprogram 7. EXPLAIN / PLAIN 8. LOCK TABLE: control concurrency 42
DCL Access control management 1. GRANT: 發給 user access right 2. REVOKE: 收回 user’s access right 43
TCL Manage the changes made by DML, It allows statements to be grouped together into logical transaction. 1. COMMIT: 發給 user access right 2. SAVEPOINT: 收回 user’s access right 3. ROLLBACK 4. SET TRANSACTION 44
MVCC (Multi Version Concurrency Control) 45
MVCC (Multi Version Concurrency Control) 46
DB Indexing Think of them as being much like a book index. A book index is useful because it's sorted, so you find something fast. It points to the pages the item is on. Same for a db. The table isn't usually sorted, but the index is and points to the required rows If you're a book editor and the author adds, edits, or removes a section, you need to update the index Want to lookup a word by it's suffix? You're gonna need to scan the whole book, or have an index of reversed words Database indexes work the same way 47
DB Indexing Index Leaf Nodes and Corresponding Table Data 48

Postgres indexing and toward big data application

  • 1.
    Postgres Indexing andToward Big Data Application Jordan Huang @ Synergies 1
  • 2.
    Outline 1. Postgres history 2.Data storage and indexing 3. Toward big data: related techniques 2
  • 3.
    Background ● 1982: fromUC Berkeley Ingres project. ○ Also : vi editor, tcp-ip stack, freeBSD, spark. ● 1988: a prototype version call post-gres at the 1988’s ACM SIGMOD conference. ● 1989: launch first version post-gres. ● 1994: postgres95 ● 1996: rename postgres95 to PostgresSQL 3
  • 4.
    Enable postgres onMac brew install postgres brew services start postgresql 4
  • 5.
  • 6.
    Enable pgcli onMac . pip install pgcli 6
  • 7.
    SQL File CouldBe Found at https://github.com/good5dog5/Slides/tree/master/intro_2_postgres 7
  • 8.
  • 9.
  • 10.
  • 11.
    Created order matter! CREATE TABLE post ( post_id SERIAL PRIMARY KEY, thread_id INTEGER NOT NULL REFERENCES thread (thread_id), account_id INTEGER NOT NULL REFERENCES account (account_id), …. ); CREATE TABLE account ( account_id SERIAL PRIMARY KEY, …. ); CREATE TABLE thread ( thread_id SERIAL PRIMARY KEY, ... ); 11
  • 12.
  • 13.
    插入 fake data igen_random_data.sql ● 100 個 users ● 1000 個 threads ● 100000 posts 13
  • 14.
    Table data size select info.table_nameas table_name, pg_size_pretty(pg_total_relation_size(quote_ident(info.table_name))) as total_size, pg_size_pretty(pg_indexes_size(quote_ident(info.table_name))) as index_size, pg_size_pretty(pg_relation_size(quote_ident(info.table_name))) as table_size, pc.oid, pc.relfilenode from information_schema.tables as info left join pg_class as pc on info.table_name = pc.relname where table_schema = 'public' Q: 尚未建 Index, 為何 index_size > 0 ? 14
  • 15.
    Table = sequenceof pages 15
  • 16.
    Each Page Contains ●Page header: 關於此 page 的 metadata ● Tuple: 放實際資料的地方,從 page 的 尾端開始塞 ● ItemID Data : 一個 pointer array,指向 對應的 tuple ● Table 中的某比 record 可以用 (page ID, item ID) 去指涉 16
  • 17.
    Query 1: ● 全表掃描(Sequential scan) 之後再根 據條件返回子集 17
  • 18.
  • 19.
    Query 1: withindex ● 避免全表掃描 ● 沒特別指定的話 default 用 B-tree 才存 放 index 結構 19
  • 20.
    Indexing 問 values ->回答 (which page, which itemID) 20
  • 21.
    Indexing 問 values ->回答 (which page, which itemID) 21
  • 22.
    B-Tree ● Postgres defaultindexing structure ● 中介節點存放索引 ● 只有 leaf 才存 values (page itemID) ● leaf 以雙向 linked-list 串接,可加速 range query(? ● Pg 用的是 High-Concurrency b-tree 22
  • 23.
    Bitmap Index Scan ●讀取 index(es) ● 動態建立 bitmap (array) ● Boolean 運算 ● 最後才實際到 pages 拿資料 ● 由於是以 physical order 到 page 拿資 料,若有 order by, 則會需要額外的 sort operation [source] 23
  • 24.
  • 25.
    Query 2: withindexing Was 32 ms 25
  • 26.
    Indexing Drawbacks ● Extraindex files. ● Update / insert data requires extra write to index files 26
  • 27.
    Partial Index Why haveindex entries for record we don’t want ? 27
  • 28.
    Partial Index Why haveindex entries for record we don’t want ? 28
  • 29.
    Partial Index Why haveindex entries for record we don’t want ? 29
  • 30.
    Query 2: withpartial index Was 0.882 ms 30
  • 31.
    Other index methodsin Postgres : hash index ● Store hash of values. ● Reduce size and increase speed of processing. ● 無法做 value comparison ● 需處理碰撞 (碰撞發生時需 rebuild 整個 index) ● 實務上使用價值低 31
  • 32.
    Other index methodsin Postgres Generalized Inverted Indexes (GIN) ● map many values to one row ● 作全文檢索用 32
  • 33.
    Toward Big Data:Related techniques 33
  • 34.
  • 35.
    Database Partition ● 單台機器 ●大表 -> 多個小表 35
  • 36.
    Database Sharding ● 多台機器 ●小表 -> 分散在多個機器 36
  • 37.
  • 38.
    Postgres SELECT inParallel 38
  • 39.
    Reference ● Slide: 10Reasons why you should prefer PostgreSQL to MySQL ● Slide: Bloat and Fragmentation in PostgreSQL ● Blog : Introduction to PostgreSQL physical storage ● Slide:PostgreSQL Internals Through Pictures ● Slide: Huge pages and databases: Working with abundant memory in modern servers ● Slide : Parallelism in PostgreSQL 11 ● Book : SQL Performance Explained ● Video: PostgreSQL Indexing : How, why, and when. 39
  • 40.
  • 41.
    DDL Define the DBstructure and DB scheme 1. CREATE 2. ALERT 3. DROP 4. TRUNCATE 5. COMMENT 6. RENAME 41
  • 42.
    DML Manage data withinschema objects 1. SELECT 2. INSERT 3. UPDATE 4. DELETE 5. MERGE 6. CALL : Call a PL / SQL or java subprogram 7. EXPLAIN / PLAIN 8. LOCK TABLE: control concurrency 42
  • 43.
    DCL Access control management 1.GRANT: 發給 user access right 2. REVOKE: 收回 user’s access right 43
  • 44.
    TCL Manage the changesmade by DML, It allows statements to be grouped together into logical transaction. 1. COMMIT: 發給 user access right 2. SAVEPOINT: 收回 user’s access right 3. ROLLBACK 4. SET TRANSACTION 44
  • 45.
    MVCC (Multi VersionConcurrency Control) 45
  • 46.
    MVCC (Multi VersionConcurrency Control) 46
  • 47.
    DB Indexing Think ofthem as being much like a book index. A book index is useful because it's sorted, so you find something fast. It points to the pages the item is on. Same for a db. The table isn't usually sorted, but the index is and points to the required rows If you're a book editor and the author adds, edits, or removes a section, you need to update the index Want to lookup a word by it's suffix? You're gonna need to scan the whole book, or have an index of reversed words Database indexes work the same way 47
  • 48.
    DB Indexing Index LeafNodes and Corresponding Table Data 48