GPGPU Accelerates PostgreSQL NEC OSS Promotion Center The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> (Tw: @kkaigai)
Self Introduction ▌Name: KaiGai Kohei ▌Company: NEC OSS Promotion Center ▌Like: Processor with many cores ▌Dislike: Processor with little cores ▌Background: HPC  OSS/Linux  SAP  GPU/PostgreSQL ▌Tw: @kkaigai ▌My Jobs  SELinux (2004~) • Lockless AVC、JFFS2 XATTR, ...  PostgreSQL (2006~) • SE-PostgreSQL, Security Barrier View, Writable FDW, ...  PG-Strom (2012~) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
Approach for Performance Improvement DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3 Scale-Out Scale-Up Homogeneous Scale-up Heterogeneous Scale-Up +
Characteristics of GPU (Graphic Processor Unit) ▌Characteristics  Larger percentage of ALUs on chip  Relatively smaller percentage of cache and control logic Advantages to simple calculation in parallel, but not complicated logic  Much higher number of cores per price • GTX750Ti (640core) with $150 GPU CPU Model Nvidia Tesla K20X Intel Xeon E5-2670 v3 Architecture Kepler Haswell Launch Nov-2012 Sep-2014 # of transistors 7.1billion 3.84billion # of cores 2688 (simple) 12 (functional) Core clock 732MHz 2.6GHz, up to 3.5GHz Peak Flops (single precision) 3.95TFLOPS 998.4GFLOPS (with AVX2) DRAM size 6GB, GDDR5 768GB/socket, DDR4 Memory band 250GB/s 68GB/s Power consumption 235W 135W Price $3,000 $2,094 DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4 SOURCE: CUDA C Programming Guide (v6.5)
How GPU works – Example of reduction algorithm ●item[0] step.1 step.2 step.4step.3 Computing the sum of array: 𝑖𝑡𝑒𝑚[𝑖] 𝑖=0…𝑁−1 with N-cores of GPU ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter core synchronization by HW support
Semiconductor Trend (1/2) – Towards Heterogeneous ▌Moves to CPU/GPU integrated architecture from multicore CPU ▌Free lunch for SW by HW improvement will finish soon  No utilization of semiconductor capability, unless SW is not designed with conscious of HW characteristics. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6 SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
Semiconductor Trend (2/2) – Dark Silicon Problem ▌Background of CPU/GPU integrated architecture  Increase of transistor density > Reduction of power consumption  Unable to supply power for all the logic because of chip cooling A chip has multiple logics with different features, to save the peak power consumption. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7 SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
RDBMS and its bottleneck (1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8 Storage Processor Data RAM Data Size > RAM Data Size < RAM Storage Processor Data RAM In the future? Processor Wide Band RAM Non- volatile RAM Data
World of current memory bottleneck Join, Aggregation, Sort, Projection, ... [strategy] • burstable access pattern • parallel algorithm World of traditional disk-i/o bottleneck SeqScan, IndexScan, ... [strategy] • reduction of i/o (size, count) • distribution of disk (RAID) RDBMS and its bottleneck (2/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9 Processor RAM Storage bandwidth: multiple hundreds GB/s bandwidth: multiple GB/s
PG-Strom ▌What is PG-Strom  An extension designed for PostgreSQL  Off-loads a part of SQL workloads to GPU for parallel/rapid execution  Three workloads are supported: Full-Scan, Hash-Join, Aggregate (At the moment of Nov-2014, beta version) ▌Concept  Automatic GPU native code generation from SQL query, by JIT compile  Asynchronous execution with CPU/GPU co-operation. ▌Advantage  Works fully transparently from users • It allows to use peripheral software of PostgreSQL, including SQL syntax, backup, HA or drivers.  Performance improvement with GPU+OSS; very low cost ▌Attention  Right now, in-memory data store is assumed DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
PostgreSQL PG-Strom Architecture of PG-Strom (1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11 GPU Code Generator Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg GPU Program Manager PG-Strom OpenCL Server Message Queue
Architecture of PG-Strom (2/2) ▌Elemental technology① OpenCL  Parallel computing framework on heterogeneous processors  Can use for CPU parallel, not only GPU of NVIDIA/AMD  Includes run-time compiler in the language specification ▌Elemental technology② Custom-Scan Interface  Feature to implement scan/join by extensions, as if it is built-in logic for SQL processing in PostgreSQL.  A part of functionalities got merged to v9.5, discussions are in-progress for full functionalities. ▌Elemental technology③ Row oriented data structure  Shared buffer of PostgreSQL as DMA source  Data format translation between row and column, even though column format is optimal for GPU performance. Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Elemental technology① OpenCL (1/2) ▌Characteristics of OpenCL  Use abstracted “OpenCL device” for CPU/MIC, not only GPUs • Even though it follows characteristics of GPUs below.... Three types of memory layer (global, local, constant) Concept of workgroup; synchronous execution inter-threads  Just-in-time compile from source code like C-language to the platform specific native code ▌Comparison with CUDA  We don’t need to develop JIT compile functionality by ourselves, and want more people to try, so adopted OpenCL at this moment. Page. 13 CUDA OpenCL Advantage • Detailed optimization • Latest feature of NVIDIA GPU • Driver stability • Multiplatform support • Built-in JIT compiler • CPU parallel support Issues • Unavailable on AMD, Intel • Driver stability The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Elemental technology① OpenCL (2/2) Page. 14 GPU Source code (text) OpenCL runtime OpenCL Interface OpenCL runtime OpenCL runtime OpenCL Devices N x computing units Global Memory Local Memory Constant Memory cl_program object cl_kernel object DMA buffer-3 DMA buffer-2 DMA buffer-1clBuildProgram() clCreateKernel() Command Queue Just-in-Time Compile Look-up kernel functions Input a pair of kernel and data into command queue The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Automatic GPU native code generation Page. 15 postgres=# SET pg_strom.show_device_kernel = on; SET postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat; QUERY PLAN -------------------------------------------------------------------------------------------- HashAggregate Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)) Group Key: t0.cat -> Custom (GpuPreAgg) Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer, NULL::double precision, NULL::double precision, NULL::text, pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x) Bulkload: On Kernel Source: #include "opencl_common.h" #include "opencl_gpupreagg.h" #include "opencl_textlib.h" : <...snip...> static bool gpupreagg_qual_eval(__private cl_int *errcode, __global kern_parambuf *kparams, __global kern_data_store *kds, __global kern_data_store *ktoast, size_t kds_index) { pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index); pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index); return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8)); } The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
How PG-Strom processes SQL workloads① – Hash-Join Page. 16 Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next step Next step All CPU does is just references the result of relations join Hash table search by CPU Projection by CPU Parallel Projection Parallel Hash- table search Existing Hash-Join implementation GpuHashJoin implementation The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Benchmark (1/2) – Simple Tables Join [condition of measurement] ▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...]; ▌ All the tables are preliminary loaded ▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980 Page. 17 178.7 334.0 513.6 713.1 941.4 1181.3 1452.2 1753.4 29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Simple Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
How PG-Strom processes SQL workloads② – Aggregate ▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU ▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size, GPU has advantage to make “partial aggregate” for each million rows. ▌Key performance factor is reducing the job of CPU GroupAggregate Sort SeqScan Tbl_1 Result Set several rows several millions rows several millions rows count(*), avg(X), Y X, Y X, Y X, Y, Z (table definition) GroupAggregate Sort GpuScan Tbl_1 Result Set several rows several hundreds rows several millions rows sum(nrows), avg_ex(nrows, psum), Y Y, nrows, psum_X X, Y X, Y, Z (table definition) GpuPreAgg Y, nrows, psum_X several hundreds rows SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y; The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
Benchmark (2/2) – Aggregation + Tables Join [condition of measurement] ▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT; ▌ Other conditions are same with the previous measurement Page. 19 157.4 238.2 328.3 421.7 525.5 619.3 712.8 829.2 44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7 0 100 200 300 400 500 600 700 800 900 1000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Aggregation + Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
As an aside... Let’s back to the development history prior to the remaining elemental technology DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
Development History of PG-Strom DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21 2011 Feb KaiGai moved to Germany May PGconf 2011 2012 Jan The first prototype May Tried to use pseudo code Aug Proposition of background worker and writable FDW 2013 Jul NEC admit PG-Strom development Nov Proposition of CustomScan API 2014 Feb Moved to OpenCL from CUDA Jun GpuScan, GpuHashJoin and GpuSort were implemented Sep Drop GpuSort, instead of GpuPreAgg Nov working beta-1 gets available First prototype announced “Strom” comes from Germany term; where I lived in at that time
Inspiration – May-2011 DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22 http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language PGconf 2011 @ Ottawa, Canada You’re brave. Unavailable to run on regular query, not only PL functions?
Inspired by PgOpencl Project DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23 A B C D E summary: GPU works well towards large BLOB, like image data. Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011 Tim Child (3DMashUp) Even small width of values, GPU will work well if we have transposition.
Made a prototype – Christmas vacation in 2011  CUDA based ( Now, OpenCL)  FDW (Foreign Data Wrapper) based ( Now, CustomScan API)  Column oriented data structure ( Now, row-oriented data) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24 CPU vanilla PostgreSQL PostgreSQL + PG-Strom Iterationofrowfetch andevaluation multiple rows at once Async DMA & Async Kernek Exec CUDA Compiler : Fetch a row from buffer : Evaluation of WHERE clause CPU GPU Synchronization SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10; code for GPU Auto GPU code generation, Just-in-time compile Query response time reduction GPU Device
 A set of APIs to show external data source as like a table of PostgreSQL  FDW driver generates rows on demand when foreign tables are referenced.  Extension can have arbitrary data structure and logic to process the data.  It is also allowed to scan internal data with GPU acceleration! What is FDW (Foreign Data Wrapper) QueryExecutor Regular Table Foreign Table Foreign Table Foreign Table File FDW Oracle FDW PG-Strom FDW storage Regular Table storage QueryPlanner QueryParser Exec Exec Exec Exec SQL Query DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25 CSV Files ....... .... .... ... ... ... . ... ... . .... .. .... .. . .. .. . . . .. . Other DBMS
Implementation at that time ▌Problem  Only foreign tables can be accelerated with GPUs.  Only full table-scan can be supported.  Foreign tables were read-only at that time.  Slow-down of 1st time query because of device initialization. summary: Not easy to use PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26 ForeignTable(pgstrom) value a[] rowmap value b[] value c[] value d[] <not used> <not used> Table: my_schema.ft1.b.cs 10300 {10.23, 7.54, 5.43, … } 10100 {2.4, 5.6, 4.95, … } ② Calculation ① Transfer ③ Write-Back Table: my_schema.ft1.c.cs {‘2010-10-21’, …} {‘2011-01-23’, …} {‘2011-08-17’, …} 10100 10200 10300 Shadow tables on behalf of each column of the foreign table managed by PG-Strom. Each items are stored closely in physical, using large array data type of element data.
PostgreSQL Enhancement (1/2) – Writable FDW ▌Foreign tables had supported read access only (~v9.2)  Data load onto shadow tables was supported using special function Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27 PostgreSQL Data Source Query Planning SELECT DELETE UPDATE INSERT FDW Driver Enhancement of this interface
PostgreSQL Enhancement (2/2) – Background Worker ▌Slowdown on first time of each query  Time for JIT compile of GPU code  caching the built binaries  Time for initialization of GPU devices  initialization once by worker process Good side effect: it works with run-time that doesn’t support concurrent usage. ▌Background Worker  An interface allows to launch background processes managed by extensions  Dynamic registration gets supported on v9.4 also. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28 PostmasterProcess PostgreSQL Backend Process Built-in BG workers Extension’s BG workers fork(2) on connection fork(2) on startup IPC:sharedmemory,... fork(2) on startup / demand 5432 Port
Design Pivot ▌Is the FDW really appropriate framework for PG-Strom?  It originated from a feature to show external data as like a table. [benefit] • It is already supported on PostgreSQL [Issue] • Less transparency for users / applications • More efficient execution path than GPU; like index-scan if available • Workloads expects for full-table scan; like tables join ▌CUDA or OpenCL?  First priority is many trials by wider range of people DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29 CUDA OpenCL Good • Fine grained optimization • Latest feature of NVIDIA GPU • Reliability of OpenCL drivers • Multiplatform support • Built-in JIT compiler • CPU parallel support Bad • No support on AMD, Intel • Reliability of OpenCL drivers
Elemental Technology② – Custom-Scan Interface Page. 30 Table.A Table to be scanned clause id > 200 Path① SeqScan cost=400 Path② TidScan unavailable Path③ IndexScan cost=10 Path④ CustomScan (GpuScan) cost=40 Built-in Logics Table.X Table.Y Tables to be joined Path① NestLoop cost=8000 Path② HashJoin cost=800 Path③ MergeJoin cost=1200 Path④ CustomScan (GpuHashJoin) cost=500 Built-in Logics clause x.id = y.id Path③ IndexScan cost=10 Table.A WINNER! Table.X Table.Y Path④ CustomScan (GpuHashJoin) cost=500 clause x.id = y.id WINNER! It allows extensions to offer alternative scan or join logics, in addition to the built-in ones The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL clause id > 200
Yes!! Custom-Scan Interface got merged to v9.5 Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL Minimum consensus is, an interface that allows extensions to implement custom logic to replace built-in scan logics. ↓ Then, we will discuss the enhancement of the interface for tables join on the developer’s community.
Enhancement of Custom-Scan Interface The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32 Table.X Table.Y Tables to be joined Path ① NestLoop Path ② HashJoin Path ③ MergeJoin Path ④ CustomScan (GpuHashJoin) Built-in logics clause x.id = y.id X Y X Y NestLoop HashJoin X Y MergeJoin X Y GpuHashJoin Result set of tables join It looks like a relation scan on the result set of tables join  It replaces a node to be join by foreign-/custom-scan.  Example: A FDW that scan on the result set of remote join. ▌Join replacement by foreign-/custom-scan
Abuse of Custom-Scan Interface ▌Usual interface contract  GpuScan/SeqScan fetches rows, then passed to upper node and more ...  TupleTableSlot is used to exchange record row by row manner. ▌Beyond the interface contract  In case when both of parent and child nodes are managed by same extension, nobody prohibit to exchange chunk of data with its own data structure.  “Bulkload: On”  multiple (usually, 10K~100K) rows at once Page. 33 postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1; QUERY PLAN ----------------------------------------------------------------------------- Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106) hash clause 1: (t0.aid = t1.aid) Bulkload: On -> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73) -> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37) hash keys: aid Buckets: 46000 Batches: 1 Memory Usage: 99.97% -> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37) Planning time: 0.220 ms (9 rows) The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PostgreSQL PG-Strom Element Technology③ – Row oriented data structure Page. 34 Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg DMA Transfer (via PCI-E bus) GPU Program Manager PG-Strom OpenCL Server Message Queue T-Tree Columnar Cache GPU Code Generator Cache construction (rowcolumn) Materialize the result (columnrow) Prior implementation had table cache with column-oriented data structure The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Why GPU well fit column-oriented data structure Page. 35 SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made Core Core Core Core Core Core Core Core Core Core SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels coalesced memory access Global Memory (DRAM) Wide Memory Bandwidth (256-384bits) WARP: A set of processor cores that share instruction pointer. It is usually 32. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Format of PostgreSQL tuples Page. 36 struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; /* current TID of this or newer tuple */ ItemPointerData t_ctid; /* number of attributes + various flags */ uint16 t_infomask2; /* various flag bits, see below */ uint16 t_infomask; /* sizeof header incl. bitmap, padding */ uint8 t_hoff; /* ^ - 23 bytes - ^ */ /* bitmap of NULLs -- VARIABLE LENGTH */ bits8 t_bits[1]; /* MORE DATA FOLLOWS AT END OF STRUCT */ }; HeapTupleHeader NULL bitmap (if has null) OID of row (if exists) Padding 1st Column 2nd Column 4th Column Nth Column No data if NULL Variable length fields makes unavailable to predicate offset of the later fields. No NULL Bitmap if all the fields are not NULL We usually have no OID of row The worst data structure for GPU processor The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Nightmare of columnar cache (1/2) ▌列指向キャッシュと行列変換 Page. 37 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ------------------------------------------------------------------------------ Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 144 total time to load: 584.67ms total time to materialize: 7245.14ms  70% of total execution time!! average time in send-mq: 37us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470 DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144 kernel exec: total: 441.71ms, avg: 3067us, count: 144 -> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1) -> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1) Execution time: 10525.754 ms The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Nightmare of columnar cache (2/2) Page. 38 Translation from table (row-format) to the columnar-cache (only at once) Translation from PG- Strom internal (column-format) to TupleTableSlot (row-format) for data exchange in PostgreSQL [Hash-Join] Search the relevant records on inner/outer relations. Breakdown of execution time You cannot see the wood for the trees (木を見て森を見ず) Optimization in GPU also makes additional data-format translation cost (not ignorable) on the interface of PostgreSQL The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
A significant point in GPU acceleration (1/2) Page. 39 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer T-Tree columnar cache Result Buffer TupleTableSlot Task of CPU RowColumn translation on cache construction Very heavy loads CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to column-format, only referenced data shall be transformed. Task of CPU ColumnRow translation everytime when we returns the result to PostgreSQL. Very heavy loads Task of GPU Operations on the data with column-format. It is optimized data according to GPU’s nature In case of column-format The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
A significant point in GPU acceleration (2/2) Page. 40 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer Result Buffer TupleTableSlot Task of CPU Visibility check prior to DMA translation Light Load CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to row-format, all data shall be transformed. A little heavy load Task of CPU Set pointer of the result buffer Very Light Load Task of GPU Operations on row-data. Although not an optimal data, massive cores covers its performance In case of row-format No own columnar cache The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Integration with shared-buffer of PostgreSQL ▌列指向キャッシュと行列変換 Page. 41 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ----------------------------------------------------------- Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 145 total time for inner load: 29.80ms total time for outer load: 812.50ms total time to materialize: 1527.95ms  Reduction dramatically average time in send-mq: 61us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619 DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290 proj kernel exec: total: 264.47ms, avg: 1823us, count: 145 main kernel exec: total: 622.83ms, avg: 4295us, count: 145 -> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1) -> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1) Execution time: 5161.017 ms  Much better response time Performance degrading little bit The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PG-Strom’s current ▌Supported logics  GpuScan ... Full-table scan with qualifiers  GpuHashJoin ... Hash-Join with GPUs  GpuPreAgg ... Preprocess of aggregation with GPUs ▌Supported data-types  Integer (smallint, integer, bigint)  Floating-point (real, float)  String (text, varchar(n), char(n))  Date and time (date, time, timestamp)  NUMERIC ▌Supported functions  arithmetic operators for above data-types  comparison operators for above data-types  mathematical functions on floating-points  Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PG-Strom’s future ▌Logics will be supported  Sort  Aggregate Push-down  ... anything else? ▌Data types will be supported  ... anything else? ▌Functions will be supported  LIKE, Regular Expression?  Date/Time extraction?  PostGIS?  Geometrics?  User defined functions? (like pgOpenCL) ▌Parallel Scan Page. 43 Shared Buffer block-0 block-Nblock-N/3 block-2N/3 Range Partitioning & Parallel Scan Current Future Not easy to supply GPU enough data with single CPU core in spite of on-memory store. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Our target segment ▌1st target application: BI tools are expected ▌Up-to 1.0TB data: SMB company or branches of large company  Because it needs to keep the data in-memory ▌GPU and PostgreSQL: both of them are inexpensive. Page. 44 High-end data analysis appliance Commercial or OSS RDBMS large response-time / batch process small response-time / real-time smallerdata-sizelargerdata-size < 1.0TB > 1.0TB Cost advantage Performance advantage PG-Strom Hadoop? The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
(Ref) ROLAP and MOLAP ▌Expected PG-Strom usage  Backend RDBMS for ROLAP / acceleration of ad-hoc queries  Cube construction for MOLAP / acceleration of batch processing Page. 45 ERP SCM CRM Finance DWH DWH world of transaction (OLTP) world of analytics (OLAP) ETL ROLAP: DB summarizes the data on demand by BI tools MOLAP: BI tools reference information cube; preliminary constructed The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Expected Usage (1/2) – OLTP/OLAP Integration ▌Why separated OLTP/OLAP system  Integration of multiple data source  Optimization for analytic workloads ▌How PG-Strom will improve...  Parallel execution of Join / Aggregate; being key of performance  Elimination or reduction of OLAP / ETL, thus smaller amount of TCO DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46 ERPCRMSCM BI OLTP database OLAP database ETL OLAP CubesMaster / Fact Tables BI PG-Strom Performance Key • Join master and fact tables • Aggregation functions
Expected Usage (2/2) – Let’s investigate with us ▌Which region PG-Strom can work for? ▌Which workloads PG-Strom shall fit? ▌What workloads you concern about?  PG-Strom Project want to lean from the field. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
How to use (1/3) – Installation Prerequisites ▌OS: Linux (RHEL 6.x was validated) ▌PostgreSQL 9.5devel (with Custom-Plan Interface) ▌PG-Strom Module ▌OpenCL Driver (like nVIDIA run-time) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48 shared_preload_libraries = '$libdir/pg_strom‘ shared_buffers = <enough size to load whole of the database> Minimum configuration of PG-Strom postgres=# SET pg_strom.enabled = on; SET Turn on/off PG-Strom at run-time
How to use (2/3) – Build, Install and Starup DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49 [kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom [kaigai@saba ~]$ cd pg_strom [kaigai@saba pg_strom]$ make && make install [kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf [kaigai@saba ~]$ pg_ctl start server starting [kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server" LOG: starting background worker process "PG-Strom OpenCL Server" LOG: database system was shut down at 2014-11-09 17:45:51 JST LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB) LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB) LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed LOG: PG-Strom: Device "GeForce GTX 980" was installed LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB) LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB) LOG: Starting PG-Strom OpenCL Server LOG: PG-Strom: 24 of server threads are up
How to use (3/3) – Deployment on AWS Page. 50 Search by “strom” ! AWS GPU Instance (g2.2xlarge) CPU Xeon E5-2670 (8 xCPU) RAM 15GB GPU NVIDIA GRID K2 (1536core) Storage 60GB of SSD Price $0.898/hour, $646.56/mon (*) Price for on-demand instance on Tokyo region at Nov-2014 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Future of PG-Strom DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
Dilemma of Innovation DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52 SOURCE: The Innovator's Dilemma, Clayton M. Christensen Performance demanded at the high end of the market Performance demanded at the low end of the marker or in a new emerging segment ProductPerformance Time
Move forward with community DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
(Additional comment after the conference) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54 check it out! https://github.com/pg-strom/devel Oops, I forgot to say in the conference! PG-Strom is open source.
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)

GPGPU Accelerates PostgreSQL (English)

  • 1.
    GPGPU Accelerates PostgreSQL NECOSS Promotion Center The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> (Tw: @kkaigai)
  • 2.
    Self Introduction ▌Name: KaiGaiKohei ▌Company: NEC OSS Promotion Center ▌Like: Processor with many cores ▌Dislike: Processor with little cores ▌Background: HPC  OSS/Linux  SAP  GPU/PostgreSQL ▌Tw: @kkaigai ▌My Jobs  SELinux (2004~) • Lockless AVC、JFFS2 XATTR, ...  PostgreSQL (2006~) • SE-PostgreSQL, Security Barrier View, Writable FDW, ...  PG-Strom (2012~) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
  • 3.
    Approach for PerformanceImprovement DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3 Scale-Out Scale-Up Homogeneous Scale-up Heterogeneous Scale-Up +
  • 4.
    Characteristics of GPU(Graphic Processor Unit) ▌Characteristics  Larger percentage of ALUs on chip  Relatively smaller percentage of cache and control logic Advantages to simple calculation in parallel, but not complicated logic  Much higher number of cores per price • GTX750Ti (640core) with $150 GPU CPU Model Nvidia Tesla K20X Intel Xeon E5-2670 v3 Architecture Kepler Haswell Launch Nov-2012 Sep-2014 # of transistors 7.1billion 3.84billion # of cores 2688 (simple) 12 (functional) Core clock 732MHz 2.6GHz, up to 3.5GHz Peak Flops (single precision) 3.95TFLOPS 998.4GFLOPS (with AVX2) DRAM size 6GB, GDDR5 768GB/socket, DDR4 Memory band 250GB/s 68GB/s Power consumption 235W 135W Price $3,000 $2,094 DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4 SOURCE: CUDA C Programming Guide (v6.5)
  • 5.
    How GPU works– Example of reduction algorithm ●item[0] step.1 step.2 step.4step.3 Computing the sum of array: 𝑖𝑡𝑒𝑚[𝑖] 𝑖=0…𝑁−1 with N-cores of GPU ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter core synchronization by HW support
  • 6.
    Semiconductor Trend (1/2)– Towards Heterogeneous ▌Moves to CPU/GPU integrated architecture from multicore CPU ▌Free lunch for SW by HW improvement will finish soon  No utilization of semiconductor capability, unless SW is not designed with conscious of HW characteristics. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6 SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
  • 7.
    Semiconductor Trend (2/2)– Dark Silicon Problem ▌Background of CPU/GPU integrated architecture  Increase of transistor density > Reduction of power consumption  Unable to supply power for all the logic because of chip cooling A chip has multiple logics with different features, to save the peak power consumption. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7 SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
  • 8.
    RDBMS and itsbottleneck (1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8 Storage Processor Data RAM Data Size > RAM Data Size < RAM Storage Processor Data RAM In the future? Processor Wide Band RAM Non- volatile RAM Data
  • 9.
    World of currentmemory bottleneck Join, Aggregation, Sort, Projection, ... [strategy] • burstable access pattern • parallel algorithm World of traditional disk-i/o bottleneck SeqScan, IndexScan, ... [strategy] • reduction of i/o (size, count) • distribution of disk (RAID) RDBMS and its bottleneck (2/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9 Processor RAM Storage bandwidth: multiple hundreds GB/s bandwidth: multiple GB/s
  • 10.
    PG-Strom ▌What is PG-Strom An extension designed for PostgreSQL  Off-loads a part of SQL workloads to GPU for parallel/rapid execution  Three workloads are supported: Full-Scan, Hash-Join, Aggregate (At the moment of Nov-2014, beta version) ▌Concept  Automatic GPU native code generation from SQL query, by JIT compile  Asynchronous execution with CPU/GPU co-operation. ▌Advantage  Works fully transparently from users • It allows to use peripheral software of PostgreSQL, including SQL syntax, backup, HA or drivers.  Performance improvement with GPU+OSS; very low cost ▌Attention  Right now, in-memory data store is assumed DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
  • 11.
    PostgreSQL PG-Strom Architecture of PG-Strom(1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11 GPU Code Generator Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg GPU Program Manager PG-Strom OpenCL Server Message Queue
  • 12.
    Architecture of PG-Strom(2/2) ▌Elemental technology① OpenCL  Parallel computing framework on heterogeneous processors  Can use for CPU parallel, not only GPU of NVIDIA/AMD  Includes run-time compiler in the language specification ▌Elemental technology② Custom-Scan Interface  Feature to implement scan/join by extensions, as if it is built-in logic for SQL processing in PostgreSQL.  A part of functionalities got merged to v9.5, discussions are in-progress for full functionalities. ▌Elemental technology③ Row oriented data structure  Shared buffer of PostgreSQL as DMA source  Data format translation between row and column, even though column format is optimal for GPU performance. Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 13.
    Elemental technology① OpenCL(1/2) ▌Characteristics of OpenCL  Use abstracted “OpenCL device” for CPU/MIC, not only GPUs • Even though it follows characteristics of GPUs below.... Three types of memory layer (global, local, constant) Concept of workgroup; synchronous execution inter-threads  Just-in-time compile from source code like C-language to the platform specific native code ▌Comparison with CUDA  We don’t need to develop JIT compile functionality by ourselves, and want more people to try, so adopted OpenCL at this moment. Page. 13 CUDA OpenCL Advantage • Detailed optimization • Latest feature of NVIDIA GPU • Driver stability • Multiplatform support • Built-in JIT compiler • CPU parallel support Issues • Unavailable on AMD, Intel • Driver stability The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 14.
    Elemental technology① OpenCL(2/2) Page. 14 GPU Source code (text) OpenCL runtime OpenCL Interface OpenCL runtime OpenCL runtime OpenCL Devices N x computing units Global Memory Local Memory Constant Memory cl_program object cl_kernel object DMA buffer-3 DMA buffer-2 DMA buffer-1clBuildProgram() clCreateKernel() Command Queue Just-in-Time Compile Look-up kernel functions Input a pair of kernel and data into command queue The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 15.
    Automatic GPU nativecode generation Page. 15 postgres=# SET pg_strom.show_device_kernel = on; SET postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat; QUERY PLAN -------------------------------------------------------------------------------------------- HashAggregate Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)) Group Key: t0.cat -> Custom (GpuPreAgg) Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer, NULL::double precision, NULL::double precision, NULL::text, pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x) Bulkload: On Kernel Source: #include "opencl_common.h" #include "opencl_gpupreagg.h" #include "opencl_textlib.h" : <...snip...> static bool gpupreagg_qual_eval(__private cl_int *errcode, __global kern_parambuf *kparams, __global kern_data_store *kds, __global kern_data_store *ktoast, size_t kds_index) { pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index); pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index); return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8)); } The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 16.
    How PG-Strom processesSQL workloads① – Hash-Join Page. 16 Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next step Next step All CPU does is just references the result of relations join Hash table search by CPU Projection by CPU Parallel Projection Parallel Hash- table search Existing Hash-Join implementation GpuHashJoin implementation The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 17.
    Benchmark (1/2) –Simple Tables Join [condition of measurement] ▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...]; ▌ All the tables are preliminary loaded ▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980 Page. 17 178.7 334.0 513.6 713.1 941.4 1181.3 1452.2 1753.4 29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Simple Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 18.
    How PG-Strom processesSQL workloads② – Aggregate ▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU ▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size, GPU has advantage to make “partial aggregate” for each million rows. ▌Key performance factor is reducing the job of CPU GroupAggregate Sort SeqScan Tbl_1 Result Set several rows several millions rows several millions rows count(*), avg(X), Y X, Y X, Y X, Y, Z (table definition) GroupAggregate Sort GpuScan Tbl_1 Result Set several rows several hundreds rows several millions rows sum(nrows), avg_ex(nrows, psum), Y Y, nrows, psum_X X, Y X, Y, Z (table definition) GpuPreAgg Y, nrows, psum_X several hundreds rows SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y; The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
  • 19.
    Benchmark (2/2) –Aggregation + Tables Join [condition of measurement] ▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT; ▌ Other conditions are same with the previous measurement Page. 19 157.4 238.2 328.3 421.7 525.5 619.3 712.8 829.2 44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7 0 100 200 300 400 500 600 700 800 900 1000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Aggregation + Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 20.
    As an aside... Let’sback to the development history prior to the remaining elemental technology DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
  • 21.
    Development History ofPG-Strom DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21 2011 Feb KaiGai moved to Germany May PGconf 2011 2012 Jan The first prototype May Tried to use pseudo code Aug Proposition of background worker and writable FDW 2013 Jul NEC admit PG-Strom development Nov Proposition of CustomScan API 2014 Feb Moved to OpenCL from CUDA Jun GpuScan, GpuHashJoin and GpuSort were implemented Sep Drop GpuSort, instead of GpuPreAgg Nov working beta-1 gets available First prototype announced “Strom” comes from Germany term; where I lived in at that time
  • 22.
    Inspiration – May-2011 DBTech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22 http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language PGconf 2011 @ Ottawa, Canada You’re brave. Unavailable to run on regular query, not only PL functions?
  • 23.
    Inspired by PgOpenclProject DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23 A B C D E summary: GPU works well towards large BLOB, like image data. Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011 Tim Child (3DMashUp) Even small width of values, GPU will work well if we have transposition.
  • 24.
    Made a prototype– Christmas vacation in 2011  CUDA based ( Now, OpenCL)  FDW (Foreign Data Wrapper) based ( Now, CustomScan API)  Column oriented data structure ( Now, row-oriented data) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24 CPU vanilla PostgreSQL PostgreSQL + PG-Strom Iterationofrowfetch andevaluation multiple rows at once Async DMA & Async Kernek Exec CUDA Compiler : Fetch a row from buffer : Evaluation of WHERE clause CPU GPU Synchronization SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10; code for GPU Auto GPU code generation, Just-in-time compile Query response time reduction GPU Device
  • 25.
     A setof APIs to show external data source as like a table of PostgreSQL  FDW driver generates rows on demand when foreign tables are referenced.  Extension can have arbitrary data structure and logic to process the data.  It is also allowed to scan internal data with GPU acceleration! What is FDW (Foreign Data Wrapper) QueryExecutor Regular Table Foreign Table Foreign Table Foreign Table File FDW Oracle FDW PG-Strom FDW storage Regular Table storage QueryPlanner QueryParser Exec Exec Exec Exec SQL Query DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25 CSV Files ....... .... .... ... ... ... . ... ... . .... .. .... .. . .. .. . . . .. . Other DBMS
  • 26.
    Implementation at thattime ▌Problem  Only foreign tables can be accelerated with GPUs.  Only full table-scan can be supported.  Foreign tables were read-only at that time.  Slow-down of 1st time query because of device initialization. summary: Not easy to use PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26 ForeignTable(pgstrom) value a[] rowmap value b[] value c[] value d[] <not used> <not used> Table: my_schema.ft1.b.cs 10300 {10.23, 7.54, 5.43, … } 10100 {2.4, 5.6, 4.95, … } ② Calculation ① Transfer ③ Write-Back Table: my_schema.ft1.c.cs {‘2010-10-21’, …} {‘2011-01-23’, …} {‘2011-08-17’, …} 10100 10200 10300 Shadow tables on behalf of each column of the foreign table managed by PG-Strom. Each items are stored closely in physical, using large array data type of element data.
  • 27.
    PostgreSQL Enhancement (1/2)– Writable FDW ▌Foreign tables had supported read access only (~v9.2)  Data load onto shadow tables was supported using special function Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27 PostgreSQL Data Source Query Planning SELECT DELETE UPDATE INSERT FDW Driver Enhancement of this interface
  • 28.
    PostgreSQL Enhancement (2/2)– Background Worker ▌Slowdown on first time of each query  Time for JIT compile of GPU code  caching the built binaries  Time for initialization of GPU devices  initialization once by worker process Good side effect: it works with run-time that doesn’t support concurrent usage. ▌Background Worker  An interface allows to launch background processes managed by extensions  Dynamic registration gets supported on v9.4 also. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28 PostmasterProcess PostgreSQL Backend Process Built-in BG workers Extension’s BG workers fork(2) on connection fork(2) on startup IPC:sharedmemory,... fork(2) on startup / demand 5432 Port
  • 29.
    Design Pivot ▌Is theFDW really appropriate framework for PG-Strom?  It originated from a feature to show external data as like a table. [benefit] • It is already supported on PostgreSQL [Issue] • Less transparency for users / applications • More efficient execution path than GPU; like index-scan if available • Workloads expects for full-table scan; like tables join ▌CUDA or OpenCL?  First priority is many trials by wider range of people DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29 CUDA OpenCL Good • Fine grained optimization • Latest feature of NVIDIA GPU • Reliability of OpenCL drivers • Multiplatform support • Built-in JIT compiler • CPU parallel support Bad • No support on AMD, Intel • Reliability of OpenCL drivers
  • 30.
    Elemental Technology② –Custom-Scan Interface Page. 30 Table.A Table to be scanned clause id > 200 Path① SeqScan cost=400 Path② TidScan unavailable Path③ IndexScan cost=10 Path④ CustomScan (GpuScan) cost=40 Built-in Logics Table.X Table.Y Tables to be joined Path① NestLoop cost=8000 Path② HashJoin cost=800 Path③ MergeJoin cost=1200 Path④ CustomScan (GpuHashJoin) cost=500 Built-in Logics clause x.id = y.id Path③ IndexScan cost=10 Table.A WINNER! Table.X Table.Y Path④ CustomScan (GpuHashJoin) cost=500 clause x.id = y.id WINNER! It allows extensions to offer alternative scan or join logics, in addition to the built-in ones The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL clause id > 200
  • 31.
    Yes!! Custom-Scan Interfacegot merged to v9.5 Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL Minimum consensus is, an interface that allows extensions to implement custom logic to replace built-in scan logics. ↓ Then, we will discuss the enhancement of the interface for tables join on the developer’s community.
  • 32.
    Enhancement of Custom-ScanInterface The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32 Table.X Table.Y Tables to be joined Path ① NestLoop Path ② HashJoin Path ③ MergeJoin Path ④ CustomScan (GpuHashJoin) Built-in logics clause x.id = y.id X Y X Y NestLoop HashJoin X Y MergeJoin X Y GpuHashJoin Result set of tables join It looks like a relation scan on the result set of tables join  It replaces a node to be join by foreign-/custom-scan.  Example: A FDW that scan on the result set of remote join. ▌Join replacement by foreign-/custom-scan
  • 33.
    Abuse of Custom-ScanInterface ▌Usual interface contract  GpuScan/SeqScan fetches rows, then passed to upper node and more ...  TupleTableSlot is used to exchange record row by row manner. ▌Beyond the interface contract  In case when both of parent and child nodes are managed by same extension, nobody prohibit to exchange chunk of data with its own data structure.  “Bulkload: On”  multiple (usually, 10K~100K) rows at once Page. 33 postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1; QUERY PLAN ----------------------------------------------------------------------------- Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106) hash clause 1: (t0.aid = t1.aid) Bulkload: On -> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73) -> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37) hash keys: aid Buckets: 46000 Batches: 1 Memory Usage: 99.97% -> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37) Planning time: 0.220 ms (9 rows) The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 34.
    PostgreSQL PG-Strom Element Technology③ –Row oriented data structure Page. 34 Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg DMA Transfer (via PCI-E bus) GPU Program Manager PG-Strom OpenCL Server Message Queue T-Tree Columnar Cache GPU Code Generator Cache construction (rowcolumn) Materialize the result (columnrow) Prior implementation had table cache with column-oriented data structure The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 35.
    Why GPU wellfit column-oriented data structure Page. 35 SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made Core Core Core Core Core Core Core Core Core Core SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels coalesced memory access Global Memory (DRAM) Wide Memory Bandwidth (256-384bits) WARP: A set of processor cores that share instruction pointer. It is usually 32. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 36.
    Format of PostgreSQLtuples Page. 36 struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; /* current TID of this or newer tuple */ ItemPointerData t_ctid; /* number of attributes + various flags */ uint16 t_infomask2; /* various flag bits, see below */ uint16 t_infomask; /* sizeof header incl. bitmap, padding */ uint8 t_hoff; /* ^ - 23 bytes - ^ */ /* bitmap of NULLs -- VARIABLE LENGTH */ bits8 t_bits[1]; /* MORE DATA FOLLOWS AT END OF STRUCT */ }; HeapTupleHeader NULL bitmap (if has null) OID of row (if exists) Padding 1st Column 2nd Column 4th Column Nth Column No data if NULL Variable length fields makes unavailable to predicate offset of the later fields. No NULL Bitmap if all the fields are not NULL We usually have no OID of row The worst data structure for GPU processor The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 37.
    Nightmare of columnarcache (1/2) ▌列指向キャッシュと行列変換 Page. 37 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ------------------------------------------------------------------------------ Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 144 total time to load: 584.67ms total time to materialize: 7245.14ms  70% of total execution time!! average time in send-mq: 37us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470 DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144 kernel exec: total: 441.71ms, avg: 3067us, count: 144 -> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1) -> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1) Execution time: 10525.754 ms The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 38.
    Nightmare of columnarcache (2/2) Page. 38 Translation from table (row-format) to the columnar-cache (only at once) Translation from PG- Strom internal (column-format) to TupleTableSlot (row-format) for data exchange in PostgreSQL [Hash-Join] Search the relevant records on inner/outer relations. Breakdown of execution time You cannot see the wood for the trees (木を見て森を見ず) Optimization in GPU also makes additional data-format translation cost (not ignorable) on the interface of PostgreSQL The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 39.
    A significant pointin GPU acceleration (1/2) Page. 39 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer T-Tree columnar cache Result Buffer TupleTableSlot Task of CPU RowColumn translation on cache construction Very heavy loads CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to column-format, only referenced data shall be transformed. Task of CPU ColumnRow translation everytime when we returns the result to PostgreSQL. Very heavy loads Task of GPU Operations on the data with column-format. It is optimized data according to GPU’s nature In case of column-format The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 40.
    A significant pointin GPU acceleration (2/2) Page. 40 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer Result Buffer TupleTableSlot Task of CPU Visibility check prior to DMA translation Light Load CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to row-format, all data shall be transformed. A little heavy load Task of CPU Set pointer of the result buffer Very Light Load Task of GPU Operations on row-data. Although not an optimal data, massive cores covers its performance In case of row-format No own columnar cache The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 41.
    Integration with shared-bufferof PostgreSQL ▌列指向キャッシュと行列変換 Page. 41 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ----------------------------------------------------------- Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 145 total time for inner load: 29.80ms total time for outer load: 812.50ms total time to materialize: 1527.95ms  Reduction dramatically average time in send-mq: 61us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619 DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290 proj kernel exec: total: 264.47ms, avg: 1823us, count: 145 main kernel exec: total: 622.83ms, avg: 4295us, count: 145 -> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1) -> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1) Execution time: 5161.017 ms  Much better response time Performance degrading little bit The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 42.
    PG-Strom’s current ▌Supported logics GpuScan ... Full-table scan with qualifiers  GpuHashJoin ... Hash-Join with GPUs  GpuPreAgg ... Preprocess of aggregation with GPUs ▌Supported data-types  Integer (smallint, integer, bigint)  Floating-point (real, float)  String (text, varchar(n), char(n))  Date and time (date, time, timestamp)  NUMERIC ▌Supported functions  arithmetic operators for above data-types  comparison operators for above data-types  mathematical functions on floating-points  Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 43.
    PG-Strom’s future ▌Logics willbe supported  Sort  Aggregate Push-down  ... anything else? ▌Data types will be supported  ... anything else? ▌Functions will be supported  LIKE, Regular Expression?  Date/Time extraction?  PostGIS?  Geometrics?  User defined functions? (like pgOpenCL) ▌Parallel Scan Page. 43 Shared Buffer block-0 block-Nblock-N/3 block-2N/3 Range Partitioning & Parallel Scan Current Future Not easy to supply GPU enough data with single CPU core in spite of on-memory store. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 44.
    Our target segment ▌1sttarget application: BI tools are expected ▌Up-to 1.0TB data: SMB company or branches of large company  Because it needs to keep the data in-memory ▌GPU and PostgreSQL: both of them are inexpensive. Page. 44 High-end data analysis appliance Commercial or OSS RDBMS large response-time / batch process small response-time / real-time smallerdata-sizelargerdata-size < 1.0TB > 1.0TB Cost advantage Performance advantage PG-Strom Hadoop? The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 45.
    (Ref) ROLAP andMOLAP ▌Expected PG-Strom usage  Backend RDBMS for ROLAP / acceleration of ad-hoc queries  Cube construction for MOLAP / acceleration of batch processing Page. 45 ERP SCM CRM Finance DWH DWH world of transaction (OLTP) world of analytics (OLAP) ETL ROLAP: DB summarizes the data on demand by BI tools MOLAP: BI tools reference information cube; preliminary constructed The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 46.
    Expected Usage (1/2)– OLTP/OLAP Integration ▌Why separated OLTP/OLAP system  Integration of multiple data source  Optimization for analytic workloads ▌How PG-Strom will improve...  Parallel execution of Join / Aggregate; being key of performance  Elimination or reduction of OLAP / ETL, thus smaller amount of TCO DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46 ERPCRMSCM BI OLTP database OLAP database ETL OLAP CubesMaster / Fact Tables BI PG-Strom Performance Key • Join master and fact tables • Aggregation functions
  • 47.
    Expected Usage (2/2)– Let’s investigate with us ▌Which region PG-Strom can work for? ▌Which workloads PG-Strom shall fit? ▌What workloads you concern about?  PG-Strom Project want to lean from the field. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
  • 48.
    How to use(1/3) – Installation Prerequisites ▌OS: Linux (RHEL 6.x was validated) ▌PostgreSQL 9.5devel (with Custom-Plan Interface) ▌PG-Strom Module ▌OpenCL Driver (like nVIDIA run-time) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48 shared_preload_libraries = '$libdir/pg_strom‘ shared_buffers = <enough size to load whole of the database> Minimum configuration of PG-Strom postgres=# SET pg_strom.enabled = on; SET Turn on/off PG-Strom at run-time
  • 49.
    How to use(2/3) – Build, Install and Starup DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49 [kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom [kaigai@saba ~]$ cd pg_strom [kaigai@saba pg_strom]$ make && make install [kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf [kaigai@saba ~]$ pg_ctl start server starting [kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server" LOG: starting background worker process "PG-Strom OpenCL Server" LOG: database system was shut down at 2014-11-09 17:45:51 JST LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB) LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB) LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed LOG: PG-Strom: Device "GeForce GTX 980" was installed LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB) LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB) LOG: Starting PG-Strom OpenCL Server LOG: PG-Strom: 24 of server threads are up
  • 50.
    How to use(3/3) – Deployment on AWS Page. 50 Search by “strom” ! AWS GPU Instance (g2.2xlarge) CPU Xeon E5-2670 (8 xCPU) RAM 15GB GPU NVIDIA GRID K2 (1536core) Storage 60GB of SSD Price $0.898/hour, $646.56/mon (*) Price for on-demand instance on Tokyo region at Nov-2014 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 51.
    Future of PG-Strom DBTech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
  • 52.
    Dilemma of Innovation DBTech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52 SOURCE: The Innovator's Dilemma, Clayton M. Christensen Performance demanded at the high end of the market Performance demanded at the low end of the marker or in a new emerging segment ProductPerformance Time
  • 53.
    Move forward withcommunity DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
  • 54.
    (Additional comment afterthe conference) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54 check it out! https://github.com/pg-strom/devel Oops, I forgot to say in the conference! PG-Strom is open source.