Richard Thomson legalize@xmission.com @LegalizeAdulthd github.com/LegalizeAdulthood
SIMD  Single  Instruction  Multiple  Data
SIMD Exploits Data Parallelism  Image Processing  Array Processing  Scientific Computing  3D Graphics
Brief History of CPU SIMD Year Extension Register Size 1997 MMX 64 bits 1999 SSE 128 bits 2001 SSE2 128 bits 2004 SSE3 128 bits 2006 SSE4 128 bits 2008 AVX 256 bits 2015 AVX-512 512 bits
Data Types  8-bit integers  16-bit integers  32-bit integers  64-bit integers  16-bit floats  32-bit floats  64-bit floats  Multiple smaller quantities are packed into registers ("multiple data")  Alignment requirements on data  Older extensions do not support all data types
Alignment C++11 struct alignas(16) foo { int i; // 4 bytes int j; // 4 bytes alignas(4) char s[3]; // 3 bytes short q; // 2 bytes }; // outputs 16: std::cout << alignof(foo) << 'n';
Alignment C++03 // pre-C++11 // MSVC: struct __declspec(align(16)) foo { // ... }; // gcc: struct foo __attribute__((aligned(16))) { // ... };
Boost.Align  Handles heap allocation of aligned memory  Query the alignment requirements of a type  Declare alignment to the compiler portably
Compiler Intrinsics  A function whose implementation is handled directly by the compiler.  SIMD registers exposed as data types  __m64, __m128, __m128d, __m128i, etc.  SIMD instructions exposed as intrinsic functions  _m_paddb, _m_paddd, _m_paddsb, etc.  Register allocation, instruction scheduling and addressing modes handled by the compiler  Proper alignment of operands is assumed
Options Available Assembly Intrinsics Class Library Automatic Vectorization + Direct control, - Hard to program + Pure C/C++, - Hard to program + Easier to program, - Less control - Very little control
Proposed Boost.Simd  https://github.com/NumScale/boost.simd  Seems promising; easier to program without loss of control?  I had problems using it on Windows (issue #189)  Abstracts away the different sizes of registers as packs  Provides facilities to deal with alignment  Provides natural syntax for manipulating packs, i.e. a+b adds two packs together  Single code base can target multiple extensions  Templates expand to calls to intrinsics
Group Exercise  Convert BasicMandel to use intrinsics  AVX packs 8 32-bit floats to a single 256-bit register  AVX Intrinsics:  #include <immintrin.h>  __m256 _mm256_add_ps(__m256 a, __m256 b)  __m256 _m256_mul_ps(__m256 a, __m256 b)  __m256 _m256_sub_ps(__m256 a, __m256 b)  __m256 _mm256_load_ps(float const *c)  __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)  __m256i _mm256_castps_si256(__m256 a)  Intel Intrinsics Guide

SIMD Processing Using Compiler Intrinsics

  • 1.
  • 2.
  • 3.
    SIMD Exploits DataParallelism  Image Processing  Array Processing  Scientific Computing  3D Graphics
  • 4.
    Brief History ofCPU SIMD Year Extension Register Size 1997 MMX 64 bits 1999 SSE 128 bits 2001 SSE2 128 bits 2004 SSE3 128 bits 2006 SSE4 128 bits 2008 AVX 256 bits 2015 AVX-512 512 bits
  • 5.
    Data Types  8-bitintegers  16-bit integers  32-bit integers  64-bit integers  16-bit floats  32-bit floats  64-bit floats  Multiple smaller quantities are packed into registers ("multiple data")  Alignment requirements on data  Older extensions do not support all data types
  • 6.
    Alignment C++11 struct alignas(16)foo { int i; // 4 bytes int j; // 4 bytes alignas(4) char s[3]; // 3 bytes short q; // 2 bytes }; // outputs 16: std::cout << alignof(foo) << 'n';
  • 7.
    Alignment C++03 // pre-C++11 //MSVC: struct __declspec(align(16)) foo { // ... }; // gcc: struct foo __attribute__((aligned(16))) { // ... };
  • 8.
    Boost.Align  Handles heapallocation of aligned memory  Query the alignment requirements of a type  Declare alignment to the compiler portably
  • 9.
    Compiler Intrinsics  Afunction whose implementation is handled directly by the compiler.  SIMD registers exposed as data types  __m64, __m128, __m128d, __m128i, etc.  SIMD instructions exposed as intrinsic functions  _m_paddb, _m_paddd, _m_paddsb, etc.  Register allocation, instruction scheduling and addressing modes handled by the compiler  Proper alignment of operands is assumed
  • 10.
    Options Available Assembly Intrinsics Class Library AutomaticVectorization + Direct control, - Hard to program + Pure C/C++, - Hard to program + Easier to program, - Less control - Very little control
  • 11.
    Proposed Boost.Simd  https://github.com/NumScale/boost.simd Seems promising; easier to program without loss of control?  I had problems using it on Windows (issue #189)  Abstracts away the different sizes of registers as packs  Provides facilities to deal with alignment  Provides natural syntax for manipulating packs, i.e. a+b adds two packs together  Single code base can target multiple extensions  Templates expand to calls to intrinsics
  • 12.
    Group Exercise  ConvertBasicMandel to use intrinsics  AVX packs 8 32-bit floats to a single 256-bit register  AVX Intrinsics:  #include <immintrin.h>  __m256 _mm256_add_ps(__m256 a, __m256 b)  __m256 _m256_mul_ps(__m256 a, __m256 b)  __m256 _m256_sub_ps(__m256 a, __m256 b)  __m256 _mm256_load_ps(float const *c)  __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)  __m256i _mm256_castps_si256(__m256 a)  Intel Intrinsics Guide