SIMD Programming Introduction Champ Yen (嚴梓鴻) champ.yen@gmail.com http://champyen.blogspt.com
Agenda 2 ● What & Why SIMD ● SIMD in different Processors ● SIMD for Software Optimization ● What are important in SIMD? ● Q & A Link of This Slides https://goo.gl/Rc8xPE
What is SIMD (Single Instruction Multiple Data) 3 for(y = 0; y < height; y++){ for(x = 0; x < width; x+=8){ //process 8 point simutaneously uin16x8_t va, vb, vout; va = vld1q_u16(a+x); vb = vld1q_u16(b+x); vout = vaddq_u16(va, vb); vst1q_u16(out, vout); } a+=width; b+=width; out+=width; } for(y = 0; y < height; y++){ for(x = 0; x < width; x++){ //process 1 point out[x] = a[x]+b[x]; } a+=width; b+=width; out+=width; } one lane
Why do we need to use SIMD? 4
Why & How do we use SIMD? 5 Image Processing Scientific Computing Gaming Deep Neural Network
SIMD in different Processor - CPU 6 ● x86 − MMX − SSE − AVX/AVX-512 ● ARM - Application − v5 DSP Extension − v6 SIMD − v7 NEON − v8 Advanced SIMD (NEON) https://software.intel.com/sites/landingpage/IntrinsicsGuide/ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
SIMD in different Processor - GPU 7 ● SIMD − AMD GCN − ARM Mali ● WaveFront − Nvidia − Imagination PowerVR Rogue
SIMD in different Processor - DSP 8 ● Qualcomm Hexagon 600 HVX ● Cadence IVP P5 ● Synopsys EV6x ● CEVA XM4
SIMD Optimization 9 SIMD Hardware Design SIMD Framework / Programming Model SIMD Framework / Programming Model Software
SIMD Optimization 10 ● Auto/Semi-Auto Method ● Compiler Intrinsics ● Specific Framework/Infrastructure ● Coding in Assembly ● What are the difficult parts of SIMD Programming?
SIMD Optimization – Auto/SemiAuto 11 ● compiler − auto-vectorization optimization options − #pragma − IR optimization ● CilkPlus/OpenMP/OpenACC Serial Code for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } #pragma omp simd for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } SIMD Pragma https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
SIMD Optimization – Intrinsics 12 ● Arch-Dependent Intrinsics − Intel SSE/AVX − ARM NEON/MIPS ASE − Vector-based DSP ● Common Intrisincs − GCC/Clang vector intrinsics − OpenCL / SIMD.js ● take Vector Width into consideration ● Portability between compilers SIMD.js example from: https://01.org/node/1495
SIMD Optimization – Intrinsics 13 4x4 Matrix Multiplication ARM NEON Example http://www.fixstars.com/en/news/?p=125 //... //Load matrixB into four vectors uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4; vectorB1 = vld1_u16 (B[0]); vectorB2 = vld1_u16 (B[1]); vectorB3 = vld1_u16 (B[2]); vectorB4 = vld1_u16 (B[3]); //Temporary vectors to use with calculating the dotproduct uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4; // For each row in A... for (i=0; i<4; i++){ //Multiply the rows in B by each value in A's row vectorT1 = vmul_n_u16(vectorB1, A[i][0]); vectorT2 = vmul_n_u16(vectorB2, A[i][1]); vectorT3 = vmul_n_u16(vectorB3, A[i][2]); vectorT4 = vmul_n_u16(vectorB4, A[i][3]); //Add them together vectorT1 = vadd_u16(vectorT1, vectorT2); vectorT1 = vadd_u16(vectorT1, vectorT3); vectorT1 = vadd_u16(vectorT1, vectorT4); //Output the dotproduct vst1_u16 (C[i], vectorT1); } //... A B C A B C
SIMD Optimization – Data Parallel Frameworks 14 ● OpenCL/Cuda/C++AMP ● OpenVX/Halide ● SIMD-Optimized Libraries − Apple Accelerate − OpenCV − ffmpeg/x264 − fftw/Ne10 core idea of Halide Lang workitems in OpenCL
SIMD Optimization – Coding in Assembly 15 ● WHY !!!!??? ● Extreme Performance Optimization − precise code size/cycle/register usage ● The difficulty depends on ISA design and assembler
SIMD Optimization – The Difficult Parts 16 ● Finding Parallelism in Algorithm ● Portability between different intrinsics − sse-to-neon − neon-to-sse ● Boundary handling − Padding, Predication, Fallback ● Divergence − Predication, Fallback to scalar ● Register Spilling − multi-stages, # of variables + braces, assembly ● Non-Regular Access/Processing Pattern/Dependency − Multi-stages/Reduction ISA/Enhanced DMA ● Unsupported Operations − Division, High-level function (eg: math functions) ● Floating-Point − Unsupported/cross-deivce Compatiblilty
What are important in SIMD? 17 TLP Programming ILP Programming Data Parallel Algorithm Architecture
What are important in SIMD? 18 ● ISA design ● Memory Model ● Thread / Execution Model ● Scalability / Extensibility ● The Trends
What are important in SIMD? ISA Design 19 ● SuperScalar vs VLIW ● vector register design − Unified or Dedicated float/predication/accumulators registers ● ALU ● Inter-Lane − reduction, swizzle, pack/unpack ● Multiply ● Memory Access − Load/Store, Scatter/Gather, LUT, DMA ... ● Scalar ↔ Vector
What are important in SIMD? Memory Model 20 ● DATA! DATA! DATA! − it takes great latency to pull data from DRAM to register! ● Buffers Between Host Processor and SIMD Processor − Handled in driver/device-side ● TCM − DMA − relative simple hw/control/understand − fixed cycle (good for simulation/estimation) ● Cache − prefetch − transparent − portable − performance scalability − simple code flow ● Mixed − powerful
What are important in SIMD? Thread / Execution Model 21 ● The flow to trigger DSP to work Qualcomm FastRPC w/ IDL Compiler CEVA Link TI DSP Link Synopsys EV6x SW ecosystem
What are important in SIMD? Scalability/Extensibility 22 ● What to do when ONE core doesn’t meet performance requirement? ● HWA/Co-processor Interface ● Add another core? − Cost − Multicore Programming Mode
What are important in SIMD? Trends 23 Vector Width 128/256 bit vector width CEVA DSP/SSE/AVX 64 bit vector width Hexagon/ARM DSP/MIPS ASE/MMX 512 bit vector width AVX-512/IVP P5/EV6x/HVX 1024/2048 bit vector width HVX/ARM SVE ILP/TLP Count Explicit vector programming Autovectorization Time
Thank You, Q & A 24

Simd programming introduction

  • 1.
    SIMD Programming Introduction ChampYen (嚴梓鴻) champ.yen@gmail.com http://champyen.blogspt.com
  • 2.
    Agenda 2 ● What &Why SIMD ● SIMD in different Processors ● SIMD for Software Optimization ● What are important in SIMD? ● Q & A Link of This Slides https://goo.gl/Rc8xPE
  • 3.
    What is SIMD(Single Instruction Multiple Data) 3 for(y = 0; y < height; y++){ for(x = 0; x < width; x+=8){ //process 8 point simutaneously uin16x8_t va, vb, vout; va = vld1q_u16(a+x); vb = vld1q_u16(b+x); vout = vaddq_u16(va, vb); vst1q_u16(out, vout); } a+=width; b+=width; out+=width; } for(y = 0; y < height; y++){ for(x = 0; x < width; x++){ //process 1 point out[x] = a[x]+b[x]; } a+=width; b+=width; out+=width; } one lane
  • 4.
    Why do weneed to use SIMD? 4
  • 5.
    Why & Howdo we use SIMD? 5 Image Processing Scientific Computing Gaming Deep Neural Network
  • 6.
    SIMD in differentProcessor - CPU 6 ● x86 − MMX − SSE − AVX/AVX-512 ● ARM - Application − v5 DSP Extension − v6 SIMD − v7 NEON − v8 Advanced SIMD (NEON) https://software.intel.com/sites/landingpage/IntrinsicsGuide/ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
  • 7.
    SIMD in differentProcessor - GPU 7 ● SIMD − AMD GCN − ARM Mali ● WaveFront − Nvidia − Imagination PowerVR Rogue
  • 8.
    SIMD in differentProcessor - DSP 8 ● Qualcomm Hexagon 600 HVX ● Cadence IVP P5 ● Synopsys EV6x ● CEVA XM4
  • 9.
  • 10.
    SIMD Optimization 10 ● Auto/Semi-AutoMethod ● Compiler Intrinsics ● Specific Framework/Infrastructure ● Coding in Assembly ● What are the difficult parts of SIMD Programming?
  • 11.
    SIMD Optimization –Auto/SemiAuto 11 ● compiler − auto-vectorization optimization options − #pragma − IR optimization ● CilkPlus/OpenMP/OpenACC Serial Code for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } #pragma omp simd for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } SIMD Pragma https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
  • 12.
    SIMD Optimization –Intrinsics 12 ● Arch-Dependent Intrinsics − Intel SSE/AVX − ARM NEON/MIPS ASE − Vector-based DSP ● Common Intrisincs − GCC/Clang vector intrinsics − OpenCL / SIMD.js ● take Vector Width into consideration ● Portability between compilers SIMD.js example from: https://01.org/node/1495
  • 13.
    SIMD Optimization –Intrinsics 13 4x4 Matrix Multiplication ARM NEON Example http://www.fixstars.com/en/news/?p=125 //... //Load matrixB into four vectors uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4; vectorB1 = vld1_u16 (B[0]); vectorB2 = vld1_u16 (B[1]); vectorB3 = vld1_u16 (B[2]); vectorB4 = vld1_u16 (B[3]); //Temporary vectors to use with calculating the dotproduct uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4; // For each row in A... for (i=0; i<4; i++){ //Multiply the rows in B by each value in A's row vectorT1 = vmul_n_u16(vectorB1, A[i][0]); vectorT2 = vmul_n_u16(vectorB2, A[i][1]); vectorT3 = vmul_n_u16(vectorB3, A[i][2]); vectorT4 = vmul_n_u16(vectorB4, A[i][3]); //Add them together vectorT1 = vadd_u16(vectorT1, vectorT2); vectorT1 = vadd_u16(vectorT1, vectorT3); vectorT1 = vadd_u16(vectorT1, vectorT4); //Output the dotproduct vst1_u16 (C[i], vectorT1); } //... A B C A B C
  • 14.
    SIMD Optimization – DataParallel Frameworks 14 ● OpenCL/Cuda/C++AMP ● OpenVX/Halide ● SIMD-Optimized Libraries − Apple Accelerate − OpenCV − ffmpeg/x264 − fftw/Ne10 core idea of Halide Lang workitems in OpenCL
  • 15.
    SIMD Optimization – Codingin Assembly 15 ● WHY !!!!??? ● Extreme Performance Optimization − precise code size/cycle/register usage ● The difficulty depends on ISA design and assembler
  • 16.
    SIMD Optimization – TheDifficult Parts 16 ● Finding Parallelism in Algorithm ● Portability between different intrinsics − sse-to-neon − neon-to-sse ● Boundary handling − Padding, Predication, Fallback ● Divergence − Predication, Fallback to scalar ● Register Spilling − multi-stages, # of variables + braces, assembly ● Non-Regular Access/Processing Pattern/Dependency − Multi-stages/Reduction ISA/Enhanced DMA ● Unsupported Operations − Division, High-level function (eg: math functions) ● Floating-Point − Unsupported/cross-deivce Compatiblilty
  • 17.
    What are importantin SIMD? 17 TLP Programming ILP Programming Data Parallel Algorithm Architecture
  • 18.
    What are importantin SIMD? 18 ● ISA design ● Memory Model ● Thread / Execution Model ● Scalability / Extensibility ● The Trends
  • 19.
    What are importantin SIMD? ISA Design 19 ● SuperScalar vs VLIW ● vector register design − Unified or Dedicated float/predication/accumulators registers ● ALU ● Inter-Lane − reduction, swizzle, pack/unpack ● Multiply ● Memory Access − Load/Store, Scatter/Gather, LUT, DMA ... ● Scalar ↔ Vector
  • 20.
    What are importantin SIMD? Memory Model 20 ● DATA! DATA! DATA! − it takes great latency to pull data from DRAM to register! ● Buffers Between Host Processor and SIMD Processor − Handled in driver/device-side ● TCM − DMA − relative simple hw/control/understand − fixed cycle (good for simulation/estimation) ● Cache − prefetch − transparent − portable − performance scalability − simple code flow ● Mixed − powerful
  • 21.
    What are importantin SIMD? Thread / Execution Model 21 ● The flow to trigger DSP to work Qualcomm FastRPC w/ IDL Compiler CEVA Link TI DSP Link Synopsys EV6x SW ecosystem
  • 22.
    What are importantin SIMD? Scalability/Extensibility 22 ● What to do when ONE core doesn’t meet performance requirement? ● HWA/Co-processor Interface ● Add another core? − Cost − Multicore Programming Mode
  • 23.
    What are importantin SIMD? Trends 23 Vector Width 128/256 bit vector width CEVA DSP/SSE/AVX 64 bit vector width Hexagon/ARM DSP/MIPS ASE/MMX 512 bit vector width AVX-512/IVP P5/EV6x/HVX 1024/2048 bit vector width HVX/ARM SVE ILP/TLP Count Explicit vector programming Autovectorization Time
  • 24.