L3 Cache DBA Level 400 C P U L2 Cache L1 Cache Core Core L2 Cache L1 Cache
About me An independent SQL Consultant A user of SQL Server from version 2000 onwards with 12+ years experience. Speaker, both at UK user group events and at conferences. I have a passion for understanding how the database engine works at a deep level.
“Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”
Demonstration #1 A Simple Query That Defies Conventional Wisdom
Demonstration #1 Repeated With ‘Bigger’ Hardware CPU 6 core 2.0 Ghz (Sandybridge) Warm large object cache used in all tests to remove storage as a factor. CPU 6 core 2.0 Ghz (Sandybridge) 48 Gb quad channel 1333 Mhz DDR3 memory Hyper-threading enabled, unless specified otherwise.
Which SELECT Statement Has The Lowest Elapsed Time ? 17.41Mb column store Vs. 51.7Mb column store WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigNoSort FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSort SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigNoSort] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigSorted FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED INDEX ccsi ON FactInternetSalesBigNoSorted ( OrderDateKey ) CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSorted WITH (DROP_EXISTING = ON) SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigSorted] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter The fastest ? The fastest ?
The Case of The Two Column Store Index Sizes SQL Server query tuning 101 The optimizer will always use the smaller data structures it can find to satisfy the query, right ?
How Well Do The Queries Using The Two Column Stores Scale ? 0 10000 20000 30000 40000 50000 60000 70000 80000 2 4 6 8 10 12 14 16 18 20 22 24 Time(ms) Degree of Parallelism Non-sorted column store Sorted column store Data creation statement scaled using top 300,000 to create 1,095,600,000 rows.
Can We Use All Available CPU Resource ? 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 22 24 PercentageCPUUtilization Degree of Parallelism Non-sorted Sorted Memory access should be 100% CPU intensive ?!?
Wait Statistics Do Not Help Here ! Stats are for the query ran with a DOP of 24, a warm column store object pool and the column store created on pre-sorted data.
Spin Locks Do Not Provide Any Clues Either Executes in 775 ms for a warm column store object pool 12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles Total spins 293,491 SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
Well Documented Tools Dynamic management views and functions Performance counters Extended events Not all wait events, latches and spin locks are well documented if documented at all. Takeaway: These tools are not going to tell us definitively where our CPU time is going !!!
The Problem With Query Plan Costs Assumptions: The buffer cache is cold. IO cannot be performed in parallel. Data in different columns is never correlated ( improved on in SQL 2014 ). Hash distribution is always uniform. Etc. . . . Based on the amount of time it took a developers machine to complete certain operations, a Dell OptiPlex ( according to legend ).
Response Time Wait Time + Service Time We need something to help find out what is happening here.
A Not So Well Documented Tool
Introducing the Windows Performance Toolkit Comes as part of the Windows Assessment and Deployment Kit. Traces are created via xperf and windows performance recorder. It utilises Event Tracing for Windows (ETW) Visualise traces using windows performance analyser. The real power is being able to stack walk the database engine
Public Symbols These are labels contained in .pdb files that provide information on what programming construct generated a piece of machine code, also known as debug symbols. Caution: public symbols are version specific, down to CU level !
Obtaining An ETW Trace Stack Walking The Database Engine xperf –on base –stackwalk profile xperf –d stackwalk.etl WPA SQL Statement
Basic xperf Command Line Syntax xperf –on < kernel flags | kernel groups > [ -stackwalk < stackwalk kernel providers ] Kernel groups are groups of flags and not to be confused with Windows kernel groups Takeaway: kernel groups make life easier
ETW Tracing Is Generally Light Weight This comes from the Premier Field Engineer Blog: Identifying the cause of SQL Server IO bottlenecks using XPerf, it reduced my IO throughput from 3300 Mb/s to 300 Mb/s XPERF -on PROC_THREAD+LOADER+ FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+ FILENAME+FILE_IO+FILE_IO_INIT+ DISK_IO+HARD_FAULTS+DPC+ INTERRUPT+CSWITCH+PROFILE+DRIVERS+DISPATCHER -stackwalk MiniFilterPreOpInit+MiniFilterPostOpInit+ CSWITCH+PROFILE+ThreadCreate+ReadyThread+ DiskReadInit+DiskWriteInit+DiskFlushInit+ FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular Takeaway: start off with the smallest set of kernel providers you can get away with and add incrementally.
The Database Engine Composition ( SQL Server 2012 onwards ) Database Engine Language Processing: sqllang.dll Runtime: sqlmin.dll, sqltst.dll, qds.dll SQLOS: sqldk.dll, sqlos.dll
Demonstration #2 Stack Walking an ETW Trace Of Our Simple Query
Demonstration #3 Investigation IO Sizes Via an ETW Trace File
Demonstration #4 Quantifying The Cost Of Extracting Row Values During A Table Scan
North bridge Core Core 2 Architecture L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core Core L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core C P U South bridge Front side bus
Core 2 Architecture  Latency when talking to IO and memory controllers even before accessing memory or IO.  Single level TLB.  Hyper-threading via NetBurst, this delivered poor performance.  Design was not module, 4 CPU was 2 x 2 core CPUs “Bolted together”.  Only last generation “Dunnington” had a L3 cache.
Core Core i Series Generation 1 ‘Nehalem’ L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core TLB Memory bus C P U QPI. . . Un-core  Integrated memory controller.  NetBurst replaced with a new hyper threading technology.  The Lookaside Buffer ( TLB ) for caching logical to physical memory mappings has two levels.  Front side bus replaced by the Quick Path Inter connector (QPI).  Genuine modular design.
Core Core i Series Generation 2 ‘Sandybridge’ L3 Cache L1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K L1 Data Cache 32KB Core Bi-directional ring bus PCIe 2.0 controllerTLBMemory bus C P U QPI… Un-core  Integrated PCIe 2.0 controller.  Level 0 uop cache.  L3 cache connected to core via bi-directional ring bus.  Advanced Vector eXtensions v1.  Data Direct IO.
Cache Lines  Unit of transfer between memory and the CPU.  Cache lines are used to create “Cache entries”, these are tagged with the requested memory location.  Data structures should be cache line aligned in order to avoid split register operations.  If different cores access the same cache line for writes, this effectively ‘Bounces’ the L2 cache => Try to avoid threads sharing cache lines. CPU Cache Line – 64 bits
CPU Cache, Memory and IO Sub System, Latency Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us but, memory is the new flash storage and CPU cache is the new RAM . . . Slide borrowed from Thomas Kejser with his kind permission.
4 4 4 11 11 11 14 18 38 167 0 50 100 150 200 L1 Cache sequential access L1 Cache In Page Random access L1 Cache In Full Random access L2 Cache sequential access L2 Cache In Page Random access L2 Cache Full Random access L3 Cache sequential access L3 Cache In Page Random access L3 Cache Full Random access Main memory The CPU Cache Hierarchy Latencies In CPU Cycles Memory
 Leverage the pre-fetcher as much as possible.  Larger CPU caches  L4 Cache => Crystalwell eDram  DDR4 memory  By pass main memory  Stacked memory:  Hybrid memory cubes ( Micron, Intel etc ).  High bandwidth memory ( AMD, Hynix ). Main Memory Is Holding The CPU Back, Solutions . . . Main memory CPU
What The Pre-Fetcher Loves Column Store index Sequential scan
What The Pre-Fetcher Hates ! Hash Table Can this be improved ?
By Passing Main Memory With Data Direct IO C P U Core L3 Cache The Old World The New World With Data Direct IO Core C P U Core L3 Cache Core
Transactions/s With and Without Data Direct IO 0 10 20 30 40 50 60 70 80 90 2 x 10 GBe 2 x 10 GBe 4 x 10 GBe 4 x 10 GBe 6 x 10 GBe 6 x 10 GBe 8 x 10 GBe 8 x 10 GBe Transactions/Sec(Mu) Single Socket IO Performance Xeon 5600 Xeon E5
The Case Of The Two Column Store Index Sizes Could the CPU cycles required to access the CPU cache versus main memory be a factor ? . . . lets dig deeper
Call Stack For Query Against Column Store On Non-Pre-Sorted Data Hash agg lookup weight 65,329.87 Column Store scan weight 28,488.73
Control flow Data flow Where Is The Bottleneck In The Plan ? The stack trace is indicating that the Bottleneck is right here
Does The Stream Aggregate Perform Any Better ?  Query takes seven minutes and nine seconds to run.  Performance is killed by a huge row mode sort prior to the stream aggregate !!! ( sorted column store )
Call Stack For Query Against Column Store On Pre-Sorted Data Hash agg lookup weight:  now 275.00  before 65,329.87 Column Store scan weight  now 45,764.07  before 28,488.73
The Case Of The Two Column Stores The hash aggregate using the column stored created on pre-sorted data is very CPU efficient. Why ?
SQL Server and Large Memory Pages, How Does It Work ? L3 Cache Power and Clock Core Bi-directional ring bus PCI TLB C P U QPI Un-core Core Page Translation Table Memory Controller DTLB ( 1 st level ) STLB ( 2 nd level ) ~10s of CPU cycles 160+ CPU cycles
The TLB Without Large Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 4Kb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mappings covered* * Nehalem through to Ivybridge architectures
The TLB With Large Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 2Mb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mapping coverage is increased to 64Mb !!! Fewer trips off the CPU to the page table
What Difference Do Large Pages Make ? 130s elapsed time 28% saving Large page support off Large page support on Page Lookups/s 222220 263766 0 50000 100000 150000 200000 250000 300000 93s elapsed time Single 6 core socket
CPU Pipeline Architecture C P U Front end Back end  A ‘Pipeline’ of logical slots runs through the processor.  Front end can issue four micro ops per clock cycle.  Back end can retire up to four micro operations per clock cycle. AllocationRetirement
Pipeline ‘Bubbles’ Are Bad ! C P U Front end Back end  Empty slots are referred to as ‘Bubbles’.  Causes of front end bubbles:  Bad speculation.  CPU stalls.  Data dependencies A = B + C E = A + D  Back end bubbles can be due to excessive demand for specific types of execution unit. AllocationRetirement Bubble Bubble Bubble Bubble
A Basic NUMA Architecture Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Remote memory access Local memory access Local memory access NUMA Node 0 NUMA Node 1
IOHub Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards ) CPU 1 CPU 3 CPU 0 CPU 2 CPU 6 CPU 7 CPU 4 CPU 5 CPU 2 CPU 3 CPU 0 CPU 1 IO Hub IO Hub IOHubIOHub IOHub With 18 core Xeon’s in the offing, these topologies will become increasingly rare.
Remote Node Memory Access Comes At A Cost An additional 20% overhead when accessing ‘Foreign’ memory ! ( from coreinfo )
Local Vs Remote Memory Access and Thread Locality How does SQLOS schedule hyper- threads in relation to physical cores ? ( 6 cores per socket ) CPU socket 0 CPU socket 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access n row B-tree L2 L3 Last level cache miss n row B-tree 1. Session 1 performs an index seek, pages not in CPU cache 2. A CPU stall takes place ( 160 clock cycles+ ) whilst the page is retrieved from memory. 3. The ‘Dead’ CPU stall cycles gives the physical core the opportunity to run a 2nd thread. Core L1
Hyper-Threading Scalability For The ‘Sorted’ Column Store Index 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59 0 10 20 30 40 50 60 70 80 90 100 CPUUtilisation Degree of Parallelism  5 % CPU utilisation per core  60% of each core utilised by first hyper thread
L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core SQL OS Layer One Scheduler Per Logical Processor  The OS has no concept of SQL resources such as latches.  ‘Hard’ context switches into soft user mode context switches.  SQL OS scheduler threads by prioritizing L2 cache hits and reuse over ‘Fairness’. The Rationale Behind SQL OS
The LOGCACHE_ACCESS Spin Lock Buffer Offset (cache line) Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG Signal thread which issued commit T0 Tn Slot 127 Slot 126 LOG FLUSHQ Slide borrowed from Thomas Kejser with his kind permission. This spin lock is the ultimate bottleneck for ACID transactions in SQL Server
Core Core Log writer thread The log writer needs to be able to free the spin lock cache line as fast as possible. Spinlocks and Context Switches Cache line
Keeping The Rest Of The Database Engine Away From The Log Writer Thread The log writer thread is always be assigned to the first free CPU core.
A Simple Test Create LOGCACHE_ACCESS Spin Lock Pressure  Create a table that ensures there is one row per page, in order to eliminate PAGELATCH_EX latch activity.  Insert 1000 rows into the table, the SPIDS for 22 sessions should fall in the range of 1 -> 1000.  Procedure to perform a single row update 10,000 times for the SPID of the session it is running in.
Results 234,915,993 10,554,325 0 50000000 100000000 150000000 200000000 250000000 Default CPU affinity mask Core 0 removed from affinity mask Order of magnitude saving !!!
 The log writer thread is always assigned to core 0.  By isolating the rest of the database engine from core 0, the log writer does not have to contend with so many context switches when handing off and receiving back the log cache access spin lock. How Has This Drop In Spins Been Achieved ?
CPU +
Segment Hash Key Value Scan Segment Hash Key Value Scan Column Store on Non-Pre-Sorted Data Hash table is likely to be at the high latency end of the cache hierarchy. Hypothesis Sequential access  Random access  Column Store on Pre-Sorted Data Hash table is likely to be at the low latency end of the cache hierarchy.
The Case Of The 60% CPU Utilisation Ceiling Column store scans are pre-fetcher friendly, hash joins and hash aggregates are not Can a hash join or aggregate keep up with a column store scan ?
Introducing Intel VTune Amplifier XE Investigating events at the CPU cache, clock cycle and instruction level requires software outside the standard Windows and SQL Server tool set. Refer to Appendix D for an overview of what “General exploration” provides.
VTune Can Go One Better Than WPT, Down To CPU Cycle Level ! 181,578,272,367 versus 466,000,699 clock cycles !!!
This Is What The CPU Stall Picture Looks Like Against DOP 2 4 6 8 10 12 14 16 18 20 22 24 Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763 Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082 0 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 LLCMisses Degree of Parallelism Non-sorted Sorted
Does The OrderDateKey Column Fit In The L3 Cache ? Table Name Column Name Size (Mb) FactInternetSalesBigNoSort OrderDateKey 1786182 Price1 3871 Price2 3871 Price3 3871 FactInternetSalesBigSorted OrderDateKey 738 Price1 2965127 Price2 2965127 Price3 2965127 No , L3 cache is 20Mb in size
The Case Of The Two Column Store Index Sizes: Conclusion Turning the memory access on the hash aggregate table from random to sequential probes = CPU savings > cost of scanning an enlarged column store
Skew Is Also A Factor In Batch Mode Hash Join PerformanceBatch Mode Hash Joins 70 Row Mode Batch Mode Expensive to repartition inputs Data skew reduces parallelism Hash join B1 B2 B3 Bn Hash table (shared) Thread Thread Build input Thread Thread Thread Probe input B1 B2 B4 Bm B3 Build input Hash join Exchange Exchange Probe input Exchange Exchange Data skew speeds up processing No repartitioning Slide borrowed from Thomas Kejser with his kind permission.
The Case Of The 60% CPU Utilisation Ceiling Still not solved. Tuning 101, if you cannot max out the CPU capacity or IOPS bandwidth, there must be some form of contention at play . . .
What Does Digging Deep Into The Call Stack Tell Us ? Throttling !
The Case Of The 60% CPU Utilisation Ceiling: Conclusion The hash aggregate cannot keep up with the column store scan. The batch engine therefore throttles the column store scan by calling sleep system calls !!!. The integration services engine does something very similar, refer to “Back pressure”.
The Case of The CPU Pressure Point Where are the pressure points on the CPU and what can be done to resolve this ?.
Making Efficient Use Of The CPU In The “In Memory” World C P U D A T A F L O W D A T A F L O WFront endBack end Backend Pressure Retirement throttled due to pressure on back end resources ( port saturation ) Frontend Pressure Front end issuing < 4 uops per cycle whilst backend is ready to accept uops ( CPU stalls, bad speculation, data dependencies )
Lots Of KPIs To Choose From, Which To Select ?  CPU Cycles Per Retired Instruction (CPI) This should ideally be 0.25, anything approaching 1.0 is bad.  Front end bound Reflects the front end under supplying the back end with work, this figure should be as close to zero as possible.  Back end bound The Back end cannot accept work from the front end because there is excessive demand for specific execution units.
These Are The Pressure Point Statistics For The ‘Sorted’ Column Store 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 14 16 18 20 22 24 KPIValue Degree of Parallelism CPI Front end Bound Back end bound Refer to Appendix C for the formulae from which these metrics are derived.
Which Parts Of The Database Engine Are Suffering Backend Pressure ? Results obtained for a degree of parallelism of 24.
256-bit FMUL Blend Back end Pressure Can Manifest As “Port Saturation” ( *Sandybridge core) CBagAggregateExpression:: TryAggregateUsingQE_Pure * SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
Ports Saturation Analysis 0.46 0.45 0.16 0.17 0.1 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 Hit Ratio Port 0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.
Back end Pressure Can we help the back end keep up with the front end ?
Single Instruction Multiple Data ( SIMD )  A class of CPU instruction that can process multiple data points to be processed simultaneously.  A form of vectorised processing.  Once CPU stalls are minimised, the challenge becomes processing data on the CPU ( rate of instruction retirement ) as fast as possible.
Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) B(1) + C(1) = + = + = + = Using conventional processing, adding two arrays together, each comprising of four elements requires four instructions. A(2) A(3) A(4) B(2) B(3) B(4) C(2) C(3) C(4)
Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) A(2) A(3) A(4) B(1) B(2) B(3) B(4) + C(1) C(2) C(3) C(4) = Using SIMD – “Single instruction multiple data” commands, the addition can be performed using a single instruction.
Intel Advanced Vector eXtensions 2011 2012 2013 2014 2015 Future Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake 87 GFLOPS 185 GFLOPS ~225 GFLOPS ~500 GFLOPS tbd GFLOPS tbd GFLOPS 32 nm SSE 4.2 DDR3 PCIe2 32 nm AVX (256 bit registers) DDR3 PCIe3 22 nm 22 nm AVX2 (new instructions) DDR4 PCIe3 14 nm 14 nm AVX 3.2 (512 bit registers) DDR4 PCIe4 AVX Registers getting wider, instruction set getting richer
Does The SQL Server Database Engine Leverage SIMD Instructions ?   Vtune amplifier does not provide the option to pick out Streaming SIMD Extension (SSE) integer events.  However, for a floating point hash aggregate we would hope to see floating point AVX instructions in use.
The Case Of The CPU Pressure Point: Conclusion The batch mode engine needs to leverage SIMD. If the hash aggregate could ‘Crunch’ more data per clock cycle, it would stand a better chance of keeping up with the column store scan !. Just about all other column store database engines use SIMD, please can the batch engine leverage SIMD Microsoft . . .
What About ?  AMD came up with the 3D Now SIMD instruction set.  Its losing the single threaded performance race to Intel:  Will this change with Excavator ?  AMD innovations:  1st 64 bit implementation of the x86 architecture  1st the break the 1Ghz barrier for a x86 CPU  1st to release native dual and quad core x86 processors
A Look Into The Future: CPU Technology  Moore’s law runs out of steam => transition away from CMOS  L4 caches becoming more prevalent.  Aggressive vectorisation by leveraging GPU cores on the CPU die.  On CPU package NIC functionality.  Rise of the ARM processor in the data centre. CPU
A Look Into The Future: Storage and Memory  Rise of:  NVMe and NVMe over fabrics.  Stacked memory technology.  Increase in NAND flash density through TLC and 3D NAND.  Mass storage and main memory converge.
A Look Into The Future: Database Engines  Column store technology becomes the de facto standard for data warehouses and marts.  Adoption of more MonetDB innovations into popular database engines:  Database cracking  Query recycling
Which Memory Your Data Is In Matters !, Locality Matters ! L3 Cache Here ? L1 Instruction Cache L2 Unified Cache Here ? L1 Data Cache Here ? Core Memory bus Where is my data ? C P U Hopefully not here ?!?
Going Off CPU Results In Hitting A Performance Brick Wall  Affects data and logical memory mapping information.  Cache lines associated with singleton spinlocks needs to be freed up as fast as possible.  The impact of the above has been demonstrated. Throughput/thread Cache Size
The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine C P U Front end  Back end  Performance  Everything is good from a CPU cache hit, fetch and decode perspective.  The backend is struggling to keep up with the front end.  The emphatic answer is to process multiple data items per CPU clock cycle => leverage SIMD.
chris1adkin@yahoo.co.uk http://uk.linkedin.com/in/wollatondba ChrisAdkin8
Appendix A: Instruction Execution And The CPU Front / Back Ends Cache Fetch Decode Execute Branch Predict Decoded Instruction Buffer Execute Reorder And Retire Front end Back end
Appendix B - The CPU Front / Back Ends In Detail Front end Back end
 Front end bound ( smaller is better )  IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)  Bad speculation  (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)  Retiring  UOPS_RETIRE_SLOTS / (4 * Clock ticks)  Back end bound ( ideally, should = 1 - Retiring)  1 – (Front end bound + Bad speculation + Retiring) Appendix C - CPU Pressure Points, Important Calculations
An illustration of what the “General exploration” analysis capability of the tool provides Appendix D - VTune Amplifier General Exploration

Sql server engine cpu cache as the new ram

  • 1.
    L3 Cache DBA Level400 C P U L2 Cache L1 Cache Core Core L2 Cache L1 Cache
  • 2.
    About me An independentSQL Consultant A user of SQL Server from version 2000 onwards with 12+ years experience. Speaker, both at UK user group events and at conferences. I have a passion for understanding how the database engine works at a deep level.
  • 3.
    “Everything fits inmemory, so performance is as good as it will get. It fits in memory therefore end of story”
  • 4.
    Demonstration #1 A SimpleQuery That Defies Conventional Wisdom
  • 5.
    Demonstration #1 RepeatedWith ‘Bigger’ Hardware CPU 6 core 2.0 Ghz (Sandybridge) Warm large object cache used in all tests to remove storage as a factor. CPU 6 core 2.0 Ghz (Sandybridge) 48 Gb quad channel 1333 Mhz DDR3 memory Hyper-threading enabled, unless specified otherwise.
  • 6.
    Which SELECT StatementHas The Lowest Elapsed Time ? 17.41Mb column store Vs. 51.7Mb column store WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigNoSort FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSort SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigNoSort] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigSorted FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED INDEX ccsi ON FactInternetSalesBigNoSorted ( OrderDateKey ) CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSorted WITH (DROP_EXISTING = ON) SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigSorted] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter The fastest ? The fastest ?
  • 7.
    The Case ofThe Two Column Store Index Sizes SQL Server query tuning 101 The optimizer will always use the smaller data structures it can find to satisfy the query, right ?
  • 8.
    How Well DoThe Queries Using The Two Column Stores Scale ? 0 10000 20000 30000 40000 50000 60000 70000 80000 2 4 6 8 10 12 14 16 18 20 22 24 Time(ms) Degree of Parallelism Non-sorted column store Sorted column store Data creation statement scaled using top 300,000 to create 1,095,600,000 rows.
  • 9.
    Can We UseAll Available CPU Resource ? 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 22 24 PercentageCPUUtilization Degree of Parallelism Non-sorted Sorted Memory access should be 100% CPU intensive ?!?
  • 10.
    Wait Statistics DoNot Help Here ! Stats are for the query ran with a DOP of 24, a warm column store object pool and the column store created on pre-sorted data.
  • 11.
    Spin Locks DoNot Provide Any Clues Either Executes in 775 ms for a warm column store object pool 12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles Total spins 293,491 SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
  • 12.
    Well Documented Tools Dynamicmanagement views and functions Performance counters Extended events Not all wait events, latches and spin locks are well documented if documented at all. Takeaway: These tools are not going to tell us definitively where our CPU time is going !!!
  • 13.
    The Problem WithQuery Plan Costs Assumptions: The buffer cache is cold. IO cannot be performed in parallel. Data in different columns is never correlated ( improved on in SQL 2014 ). Hash distribution is always uniform. Etc. . . . Based on the amount of time it took a developers machine to complete certain operations, a Dell OptiPlex ( according to legend ).
  • 14.
    Response Time Wait Time + ServiceTime We need something to help find out what is happening here.
  • 15.
    A Not SoWell Documented Tool
  • 16.
    Introducing the WindowsPerformance Toolkit Comes as part of the Windows Assessment and Deployment Kit. Traces are created via xperf and windows performance recorder. It utilises Event Tracing for Windows (ETW) Visualise traces using windows performance analyser. The real power is being able to stack walk the database engine
  • 17.
    Public Symbols These arelabels contained in .pdb files that provide information on what programming construct generated a piece of machine code, also known as debug symbols. Caution: public symbols are version specific, down to CU level !
  • 18.
    Obtaining An ETWTrace Stack Walking The Database Engine xperf –on base –stackwalk profile xperf –d stackwalk.etl WPA SQL Statement
  • 19.
    Basic xperf CommandLine Syntax xperf –on < kernel flags | kernel groups > [ -stackwalk < stackwalk kernel providers ] Kernel groups are groups of flags and not to be confused with Windows kernel groups Takeaway: kernel groups make life easier
  • 20.
    ETW Tracing IsGenerally Light Weight This comes from the Premier Field Engineer Blog: Identifying the cause of SQL Server IO bottlenecks using XPerf, it reduced my IO throughput from 3300 Mb/s to 300 Mb/s XPERF -on PROC_THREAD+LOADER+ FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+ FILENAME+FILE_IO+FILE_IO_INIT+ DISK_IO+HARD_FAULTS+DPC+ INTERRUPT+CSWITCH+PROFILE+DRIVERS+DISPATCHER -stackwalk MiniFilterPreOpInit+MiniFilterPostOpInit+ CSWITCH+PROFILE+ThreadCreate+ReadyThread+ DiskReadInit+DiskWriteInit+DiskFlushInit+ FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular Takeaway: start off with the smallest set of kernel providers you can get away with and add incrementally.
  • 21.
    The Database EngineComposition ( SQL Server 2012 onwards ) Database Engine Language Processing: sqllang.dll Runtime: sqlmin.dll, sqltst.dll, qds.dll SQLOS: sqldk.dll, sqlos.dll
  • 22.
    Demonstration #2 Stack Walking anETW Trace Of Our Simple Query
  • 23.
  • 24.
    Demonstration #4 Quantifying TheCost Of Extracting Row Values During A Table Scan
  • 25.
    North bridge Core Core 2Architecture L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core Core L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core C P U South bridge Front side bus
  • 26.
    Core 2 Architecture Latency when talking to IO and memory controllers even before accessing memory or IO.  Single level TLB.  Hyper-threading via NetBurst, this delivered poor performance.  Design was not module, 4 CPU was 2 x 2 core CPUs “Bolted together”.  Only last generation “Dunnington” had a L3 cache.
  • 27.
    Core Core i SeriesGeneration 1 ‘Nehalem’ L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core TLB Memory bus C P U QPI. . . Un-core  Integrated memory controller.  NetBurst replaced with a new hyper threading technology.  The Lookaside Buffer ( TLB ) for caching logical to physical memory mappings has two levels.  Front side bus replaced by the Quick Path Inter connector (QPI).  Genuine modular design.
  • 28.
    Core Core i SeriesGeneration 2 ‘Sandybridge’ L3 Cache L1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K L1 Data Cache 32KB Core Bi-directional ring bus PCIe 2.0 controllerTLBMemory bus C P U QPI… Un-core  Integrated PCIe 2.0 controller.  Level 0 uop cache.  L3 cache connected to core via bi-directional ring bus.  Advanced Vector eXtensions v1.  Data Direct IO.
  • 29.
    Cache Lines  Unitof transfer between memory and the CPU.  Cache lines are used to create “Cache entries”, these are tagged with the requested memory location.  Data structures should be cache line aligned in order to avoid split register operations.  If different cores access the same cache line for writes, this effectively ‘Bounces’ the L2 cache => Try to avoid threads sharing cache lines. CPU Cache Line – 64 bits
  • 30.
    CPU Cache, Memoryand IO Sub System, Latency Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us but, memory is the new flash storage and CPU cache is the new RAM . . . Slide borrowed from Thomas Kejser with his kind permission.
  • 31.
    4 4 4 11 11 11 14 18 38 167 0 50 100150 200 L1 Cache sequential access L1 Cache In Page Random access L1 Cache In Full Random access L2 Cache sequential access L2 Cache In Page Random access L2 Cache Full Random access L3 Cache sequential access L3 Cache In Page Random access L3 Cache Full Random access Main memory The CPU Cache Hierarchy Latencies In CPU Cycles Memory
  • 32.
     Leverage thepre-fetcher as much as possible.  Larger CPU caches  L4 Cache => Crystalwell eDram  DDR4 memory  By pass main memory  Stacked memory:  Hybrid memory cubes ( Micron, Intel etc ).  High bandwidth memory ( AMD, Hynix ). Main Memory Is Holding The CPU Back, Solutions . . . Main memory CPU
  • 33.
    What The Pre-FetcherLoves Column Store index Sequential scan
  • 34.
    What The Pre-FetcherHates ! Hash Table Can this be improved ?
  • 35.
    By Passing MainMemory With Data Direct IO C P U Core L3 Cache The Old World The New World With Data Direct IO Core C P U Core L3 Cache Core
  • 36.
    Transactions/s With andWithout Data Direct IO 0 10 20 30 40 50 60 70 80 90 2 x 10 GBe 2 x 10 GBe 4 x 10 GBe 4 x 10 GBe 6 x 10 GBe 6 x 10 GBe 8 x 10 GBe 8 x 10 GBe Transactions/Sec(Mu) Single Socket IO Performance Xeon 5600 Xeon E5
  • 37.
    The Case OfThe Two Column Store Index Sizes Could the CPU cycles required to access the CPU cache versus main memory be a factor ? . . . lets dig deeper
  • 38.
    Call Stack ForQuery Against Column Store On Non-Pre-Sorted Data Hash agg lookup weight 65,329.87 Column Store scan weight 28,488.73
  • 39.
    Control flow Data flow WhereIs The Bottleneck In The Plan ? The stack trace is indicating that the Bottleneck is right here
  • 40.
    Does The StreamAggregate Perform Any Better ?  Query takes seven minutes and nine seconds to run.  Performance is killed by a huge row mode sort prior to the stream aggregate !!! ( sorted column store )
  • 41.
    Call Stack ForQuery Against Column Store On Pre-Sorted Data Hash agg lookup weight:  now 275.00  before 65,329.87 Column Store scan weight  now 45,764.07  before 28,488.73
  • 42.
    The Case OfThe Two Column Stores The hash aggregate using the column stored created on pre-sorted data is very CPU efficient. Why ?
  • 43.
    SQL Server andLarge Memory Pages, How Does It Work ? L3 Cache Power and Clock Core Bi-directional ring bus PCI TLB C P U QPI Un-core Core Page Translation Table Memory Controller DTLB ( 1 st level ) STLB ( 2 nd level ) ~10s of CPU cycles 160+ CPU cycles
  • 44.
    The TLB WithoutLarge Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 4Kb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mappings covered* * Nehalem through to Ivybridge architectures
  • 45.
    The TLB WithLarge Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 2Mb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mapping coverage is increased to 64Mb !!! Fewer trips off the CPU to the page table
  • 46.
    What Difference DoLarge Pages Make ? 130s elapsed time 28% saving Large page support off Large page support on Page Lookups/s 222220 263766 0 50000 100000 150000 200000 250000 300000 93s elapsed time Single 6 core socket
  • 47.
    CPU Pipeline Architecture CP U Front end Back end  A ‘Pipeline’ of logical slots runs through the processor.  Front end can issue four micro ops per clock cycle.  Back end can retire up to four micro operations per clock cycle. AllocationRetirement
  • 48.
    Pipeline ‘Bubbles’ AreBad ! C P U Front end Back end  Empty slots are referred to as ‘Bubbles’.  Causes of front end bubbles:  Bad speculation.  CPU stalls.  Data dependencies A = B + C E = A + D  Back end bubbles can be due to excessive demand for specific types of execution unit. AllocationRetirement Bubble Bubble Bubble Bubble
  • 49.
    A Basic NUMAArchitecture Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Remote memory access Local memory access Local memory access NUMA Node 0 NUMA Node 1
  • 50.
    IOHub Four and EightNode NUMA QPI Topologies ( Nehalem i7 onwards ) CPU 1 CPU 3 CPU 0 CPU 2 CPU 6 CPU 7 CPU 4 CPU 5 CPU 2 CPU 3 CPU 0 CPU 1 IO Hub IO Hub IOHubIOHub IOHub With 18 core Xeon’s in the offing, these topologies will become increasingly rare.
  • 51.
    Remote Node MemoryAccess Comes At A Cost An additional 20% overhead when accessing ‘Foreign’ memory ! ( from coreinfo )
  • 52.
    Local Vs RemoteMemory Access and Thread Locality How does SQLOS schedule hyper- threads in relation to physical cores ? ( 6 cores per socket ) CPU socket 0 CPU socket 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
  • 53.
    Making Use OfCPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access n row B-tree L2 L3 Last level cache miss n row B-tree 1. Session 1 performs an index seek, pages not in CPU cache 2. A CPU stall takes place ( 160 clock cycles+ ) whilst the page is retrieved from memory. 3. The ‘Dead’ CPU stall cycles gives the physical core the opportunity to run a 2nd thread. Core L1
  • 54.
    Hyper-Threading Scalability ForThe ‘Sorted’ Column Store Index 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59 0 10 20 30 40 50 60 70 80 90 100 CPUUtilisation Degree of Parallelism  5 % CPU utilisation per core  60% of each core utilised by first hyper thread
  • 55.
    L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core SQL OS Layer OneScheduler Per Logical Processor  The OS has no concept of SQL resources such as latches.  ‘Hard’ context switches into soft user mode context switches.  SQL OS scheduler threads by prioritizing L2 cache hits and reuse over ‘Fairness’. The Rationale Behind SQL OS
  • 56.
    The LOGCACHE_ACCESS SpinLock Buffer Offset (cache line) Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG Signal thread which issued commit T0 Tn Slot 127 Slot 126 LOG FLUSHQ Slide borrowed from Thomas Kejser with his kind permission. This spin lock is the ultimate bottleneck for ACID transactions in SQL Server
  • 57.
    Core Core Log writer thread The logwriter needs to be able to free the spin lock cache line as fast as possible. Spinlocks and Context Switches Cache line
  • 58.
    Keeping The RestOf The Database Engine Away From The Log Writer Thread The log writer thread is always be assigned to the first free CPU core.
  • 59.
    A Simple TestCreate LOGCACHE_ACCESS Spin Lock Pressure  Create a table that ensures there is one row per page, in order to eliminate PAGELATCH_EX latch activity.  Insert 1000 rows into the table, the SPIDS for 22 sessions should fall in the range of 1 -> 1000.  Procedure to perform a single row update 10,000 times for the SPID of the session it is running in.
  • 60.
    Results 234,915,993 10,554,325 0 50000000 100000000 150000000 200000000 250000000 Default CPU affinitymask Core 0 removed from affinity mask Order of magnitude saving !!!
  • 61.
     The logwriter thread is always assigned to core 0.  By isolating the rest of the database engine from core 0, the log writer does not have to contend with so many context switches when handing off and receiving back the log cache access spin lock. How Has This Drop In Spins Been Achieved ?
  • 62.
  • 63.
    Segment Hash KeyValue Scan Segment Hash Key Value Scan Column Store on Non-Pre-Sorted Data Hash table is likely to be at the high latency end of the cache hierarchy. Hypothesis Sequential access  Random access  Column Store on Pre-Sorted Data Hash table is likely to be at the low latency end of the cache hierarchy.
  • 64.
    The Case OfThe 60% CPU Utilisation Ceiling Column store scans are pre-fetcher friendly, hash joins and hash aggregates are not Can a hash join or aggregate keep up with a column store scan ?
  • 65.
    Introducing Intel VTuneAmplifier XE Investigating events at the CPU cache, clock cycle and instruction level requires software outside the standard Windows and SQL Server tool set. Refer to Appendix D for an overview of what “General exploration” provides.
  • 66.
    VTune Can GoOne Better Than WPT, Down To CPU Cycle Level ! 181,578,272,367 versus 466,000,699 clock cycles !!!
  • 67.
    This Is WhatThe CPU Stall Picture Looks Like Against DOP 2 4 6 8 10 12 14 16 18 20 22 24 Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763 Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082 0 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 LLCMisses Degree of Parallelism Non-sorted Sorted
  • 68.
    Does The OrderDateKeyColumn Fit In The L3 Cache ? Table Name Column Name Size (Mb) FactInternetSalesBigNoSort OrderDateKey 1786182 Price1 3871 Price2 3871 Price3 3871 FactInternetSalesBigSorted OrderDateKey 738 Price1 2965127 Price2 2965127 Price3 2965127 No , L3 cache is 20Mb in size
  • 69.
    The Case OfThe Two Column Store Index Sizes: Conclusion Turning the memory access on the hash aggregate table from random to sequential probes = CPU savings > cost of scanning an enlarged column store
  • 70.
    Skew Is AlsoA Factor In Batch Mode Hash Join PerformanceBatch Mode Hash Joins 70 Row Mode Batch Mode Expensive to repartition inputs Data skew reduces parallelism Hash join B1 B2 B3 Bn Hash table (shared) Thread Thread Build input Thread Thread Thread Probe input B1 B2 B4 Bm B3 Build input Hash join Exchange Exchange Probe input Exchange Exchange Data skew speeds up processing No repartitioning Slide borrowed from Thomas Kejser with his kind permission.
  • 71.
    The Case OfThe 60% CPU Utilisation Ceiling Still not solved. Tuning 101, if you cannot max out the CPU capacity or IOPS bandwidth, there must be some form of contention at play . . .
  • 72.
    What Does DiggingDeep Into The Call Stack Tell Us ? Throttling !
  • 73.
    The Case OfThe 60% CPU Utilisation Ceiling: Conclusion The hash aggregate cannot keep up with the column store scan. The batch engine therefore throttles the column store scan by calling sleep system calls !!!. The integration services engine does something very similar, refer to “Back pressure”.
  • 74.
    The Case ofThe CPU Pressure Point Where are the pressure points on the CPU and what can be done to resolve this ?.
  • 75.
    Making Efficient UseOf The CPU In The “In Memory” World C P U D A T A F L O W D A T A F L O WFront endBack end Backend Pressure Retirement throttled due to pressure on back end resources ( port saturation ) Frontend Pressure Front end issuing < 4 uops per cycle whilst backend is ready to accept uops ( CPU stalls, bad speculation, data dependencies )
  • 76.
    Lots Of KPIsTo Choose From, Which To Select ?  CPU Cycles Per Retired Instruction (CPI) This should ideally be 0.25, anything approaching 1.0 is bad.  Front end bound Reflects the front end under supplying the back end with work, this figure should be as close to zero as possible.  Back end bound The Back end cannot accept work from the front end because there is excessive demand for specific execution units.
  • 77.
    These Are ThePressure Point Statistics For The ‘Sorted’ Column Store 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 14 16 18 20 22 24 KPIValue Degree of Parallelism CPI Front end Bound Back end bound Refer to Appendix C for the formulae from which these metrics are derived.
  • 78.
    Which Parts OfThe Database Engine Are Suffering Backend Pressure ? Results obtained for a degree of parallelism of 24.
  • 79.
    256-bit FMUL Blend Back end PressureCan Manifest As “Port Saturation” ( *Sandybridge core) CBagAggregateExpression:: TryAggregateUsingQE_Pure * SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
  • 80.
    Ports Saturation Analysis 0.46 0.45 0.16 0.17 0.1 0.55 00.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 Hit Ratio Port 0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.
  • 81.
    Back end Pressure Canwe help the back end keep up with the front end ?
  • 82.
    Single Instruction MultipleData ( SIMD )  A class of CPU instruction that can process multiple data points to be processed simultaneously.  A form of vectorised processing.  Once CPU stalls are minimised, the challenge becomes processing data on the CPU ( rate of instruction retirement ) as fast as possible.
  • 83.
    Lowering Clock CyclesPer Instruction By Leveraging SIMD A(1) B(1) + C(1) = + = + = + = Using conventional processing, adding two arrays together, each comprising of four elements requires four instructions. A(2) A(3) A(4) B(2) B(3) B(4) C(2) C(3) C(4)
  • 84.
    Lowering Clock CyclesPer Instruction By Leveraging SIMD A(1) A(2) A(3) A(4) B(1) B(2) B(3) B(4) + C(1) C(2) C(3) C(4) = Using SIMD – “Single instruction multiple data” commands, the addition can be performed using a single instruction.
  • 85.
    Intel Advanced VectoreXtensions 2011 2012 2013 2014 2015 Future Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake 87 GFLOPS 185 GFLOPS ~225 GFLOPS ~500 GFLOPS tbd GFLOPS tbd GFLOPS 32 nm SSE 4.2 DDR3 PCIe2 32 nm AVX (256 bit registers) DDR3 PCIe3 22 nm 22 nm AVX2 (new instructions) DDR4 PCIe3 14 nm 14 nm AVX 3.2 (512 bit registers) DDR4 PCIe4 AVX Registers getting wider, instruction set getting richer
  • 86.
    Does The SQLServer Database Engine Leverage SIMD Instructions ?   Vtune amplifier does not provide the option to pick out Streaming SIMD Extension (SSE) integer events.  However, for a floating point hash aggregate we would hope to see floating point AVX instructions in use.
  • 87.
    The Case OfThe CPU Pressure Point: Conclusion The batch mode engine needs to leverage SIMD. If the hash aggregate could ‘Crunch’ more data per clock cycle, it would stand a better chance of keeping up with the column store scan !. Just about all other column store database engines use SIMD, please can the batch engine leverage SIMD Microsoft . . .
  • 88.
    What About ? AMD came up with the 3D Now SIMD instruction set.  Its losing the single threaded performance race to Intel:  Will this change with Excavator ?  AMD innovations:  1st 64 bit implementation of the x86 architecture  1st the break the 1Ghz barrier for a x86 CPU  1st to release native dual and quad core x86 processors
  • 89.
    A Look IntoThe Future: CPU Technology  Moore’s law runs out of steam => transition away from CMOS  L4 caches becoming more prevalent.  Aggressive vectorisation by leveraging GPU cores on the CPU die.  On CPU package NIC functionality.  Rise of the ARM processor in the data centre. CPU
  • 90.
    A Look IntoThe Future: Storage and Memory  Rise of:  NVMe and NVMe over fabrics.  Stacked memory technology.  Increase in NAND flash density through TLC and 3D NAND.  Mass storage and main memory converge.
  • 91.
    A Look IntoThe Future: Database Engines  Column store technology becomes the de facto standard for data warehouses and marts.  Adoption of more MonetDB innovations into popular database engines:  Database cracking  Query recycling
  • 93.
    Which Memory YourData Is In Matters !, Locality Matters ! L3 Cache Here ? L1 Instruction Cache L2 Unified Cache Here ? L1 Data Cache Here ? Core Memory bus Where is my data ? C P U Hopefully not here ?!?
  • 94.
    Going Off CPUResults In Hitting A Performance Brick Wall  Affects data and logical memory mapping information.  Cache lines associated with singleton spinlocks needs to be freed up as fast as possible.  The impact of the above has been demonstrated. Throughput/thread Cache Size
  • 95.
    The Backend OfThe CPU Is Now The Bottleneck For The Batch mode Engine C P U Front end  Back end  Performance  Everything is good from a CPU cache hit, fetch and decode perspective.  The backend is struggling to keep up with the front end.  The emphatic answer is to process multiple data items per CPU clock cycle => leverage SIMD.
  • 97.
  • 99.
    Appendix A: InstructionExecution And The CPU Front / Back Ends Cache Fetch Decode Execute Branch Predict Decoded Instruction Buffer Execute Reorder And Retire Front end Back end
  • 100.
    Appendix B -The CPU Front / Back Ends In Detail Front end Back end
  • 101.
     Front endbound ( smaller is better )  IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)  Bad speculation  (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)  Retiring  UOPS_RETIRE_SLOTS / (4 * Clock ticks)  Back end bound ( ideally, should = 1 - Retiring)  1 – (Front end bound + Bad speculation + Retiring) Appendix C - CPU Pressure Points, Important Calculations
  • 102.
    An illustration ofwhat the “General exploration” analysis capability of the tool provides Appendix D - VTune Amplifier General Exploration

Editor's Notes

  • #11 Signal wait time = total wait time is to be expected for short waits on uncontended spin locks. CXPACKET waits can be discounted 99.9 % of the time.
  • #19 Xperf can provide deep insights into the database engine that other tools cannot, in this case we can walk the stack associated with query execution and observe the total CPU consumption up to any point in the stack in milliseconds
  • #39 This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  • #42 This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  • #47 These results are for the scan of a 118Gb table against a cold buffer cache, with max memory for the instance set to 40Gb.
  • #50 Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
  • #51 Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
  • #59 Keeping the rest of the database engine away from core 0 allows the spin lock cache line to be freed up as fast as possible.
  • #71 Thread numbers taken away… Removed bottom
  • #78 The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.
  • #79 The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.