Postgres indexing and toward big data application

Postgres Indexing and Toward Big Data Application Jordan Huang @ Synergies 1

Outline 1. Postgres history 2. Data storage and indexing 3. Toward big data: related techniques 2

Background ● 1982: from UC Berkeley Ingres project. ○ Also : vi editor, tcp-ip stack, freeBSD, spark. ● 1988: a prototype version call post-gres at the 1988’s ACM SIGMOD conference. ● 1989: launch first version post-gres. ● 1994: postgres95 ● 1996: rename postgres95 to PostgresSQL 3

Enable postgres on Mac brew install postgres brew services start postgresql 4

Enable pgcli on Mac . pip install pgcli 6

SQL File Could Be Found at https://github.com/good5dog5/Slides/tree/master/intro_2_postgres 7

Created order matter ! CREATE TABLE post ( post_id SERIAL PRIMARY KEY, thread_id INTEGER NOT NULL REFERENCES thread (thread_id), account_id INTEGER NOT NULL REFERENCES account (account_id), …. ); CREATE TABLE account ( account_id SERIAL PRIMARY KEY, …. ); CREATE TABLE thread ( thread_id SERIAL PRIMARY KEY, ... ); 11

Postgres data storage (PGDATA) 12

插入 fake data i gen_random_data.sql ● 100 個 users ● 1000 個 threads ● 100000 posts 13

Table data size select info.table_name as table_name, pg_size_pretty(pg_total_relation_size(quote_ident(info.table_name))) as total_size, pg_size_pretty(pg_indexes_size(quote_ident(info.table_name))) as index_size, pg_size_pretty(pg_relation_size(quote_ident(info.table_name))) as table_size, pc.oid, pc.relfilenode from information_schema.tables as info left join pg_class as pc on info.table_name = pc.relname where table_schema = 'public' Q: 尚未建 Index, 為何 index_size > 0 ? 14

Each Page Contains ● Page header: 關於此 page 的 metadata ● Tuple: 放實際資料的地方，從 page 的尾端開始塞 ● ItemID Data : 一個 pointer array，指向對應的 tuple ● Table 中的某比 record 可以用 (page ID, item ID) 去指涉 16

Query 1: ● 全表掃描 (Sequential scan) 之後再根據條件返回子集 17

Sequential Scan Filtered subset 18

Query 1: with index ● 避免全表掃描 ● 沒特別指定的話 default 用 B-tree 才存放 index 結構 19

Indexing 問 values -> 回答 (which page, which itemID) 20

Indexing 問 values -> 回答 (which page, which itemID) 21

B-Tree ● Postgres default indexing structure ● 中介節點存放索引 ● 只有 leaf 才存 values (page itemID) ● leaf 以雙向 linked-list 串接，可加速 range query(? ● Pg 用的是 High-Concurrency b-tree 22

Bitmap Index Scan ● 讀取 index(es) ● 動態建立 bitmap (array) ● Boolean 運算 ● 最後才實際到 pages 拿資料 ● 由於是以 physical order 到 page 拿資料，若有 order by, 則會需要額外的 sort operation [source] 23

Query 2: with indexing Was 32 ms 25

Indexing Drawbacks ● Extra index files. ● Update / insert data requires extra write to index files 26

Partial Index Why have index entries for record we don’t want ? 27

Query 2: with partial index Was 0.882 ms 30

Other index methods in Postgres : hash index ● Store hash of values. ● Reduce size and increase speed of processing. ● 無法做 value comparison ● 需處理碰撞 (碰撞發生時需 rebuild 整個 index) ● 實務上使用價值低 31

Other index methods in Postgres Generalized Inverted Indexes (GIN) ● map many values to one row ● 作全文檢索用 32

Toward Big Data: Related techniques 33

Database Partition ● 單台機器 ● 大表 -> 多個小表 35

Database Sharding ● 多台機器 ● 小表 -> 分散在多個機器 36

Postgres SELECT in Parallel 38

Reference ● Slide: 10 Reasons why you should prefer PostgreSQL to MySQL ● Slide: Bloat and Fragmentation in PostgreSQL ● Blog : Introduction to PostgreSQL physical storage ● Slide:PostgreSQL Internals Through Pictures ● Slide: Huge pages and databases: Working with abundant memory in modern servers ● Slide : Parallelism in PostgreSQL 11 ● Book : SQL Performance Explained ● Video: PostgreSQL Indexing : How, why, and when. 39

DDL Define the DB structure and DB scheme 1. CREATE 2. ALERT 3. DROP 4. TRUNCATE 5. COMMENT 6. RENAME 41

DML Manage data within schema objects 1. SELECT 2. INSERT 3. UPDATE 4. DELETE 5. MERGE 6. CALL : Call a PL / SQL or java subprogram 7. EXPLAIN / PLAIN 8. LOCK TABLE: control concurrency 42

DCL Access control management 1. GRANT: 發給 user access right 2. REVOKE: 收回 user’s access right 43

TCL Manage the changes made by DML, It allows statements to be grouped together into logical transaction. 1. COMMIT: 發給 user access right 2. SAVEPOINT: 收回 user’s access right 3. ROLLBACK 4. SET TRANSACTION 44

MVCC (Multi Version Concurrency Control) 45

MVCC (Multi Version Concurrency Control) 46

DB Indexing Think of them as being much like a book index. A book index is useful because it's sorted, so you find something fast. It points to the pages the item is on. Same for a db. The table isn't usually sorted, but the index is and points to the required rows If you're a book editor and the author adds, edits, or removes a section, you need to update the index Want to lookup a word by it's suffix? You're gonna need to scan the whole book, or have an index of reversed words Database indexes work the same way 47

DB Indexing Index Leaf Nodes and Corresponding Table Data 48

Postgres indexing and toward big data application

More Related Content

What's hot

Similar to Postgres indexing and toward big data application

Recently uploaded

Postgres indexing and toward big data application