SIMD Processing Using Compiler Intrinsics

Richard Thomson legalize@xmission.com @LegalizeAdulthd github.com/LegalizeAdulthood

SIMD  Single  Instruction  Multiple  Data

SIMD Exploits Data Parallelism  Image Processing  Array Processing  Scientific Computing  3D Graphics

Brief History of CPU SIMD Year Extension Register Size 1997 MMX 64 bits 1999 SSE 128 bits 2001 SSE2 128 bits 2004 SSE3 128 bits 2006 SSE4 128 bits 2008 AVX 256 bits 2015 AVX-512 512 bits

Data Types  8-bit integers  16-bit integers  32-bit integers  64-bit integers  16-bit floats  32-bit floats  64-bit floats  Multiple smaller quantities are packed into registers ("multiple data")  Alignment requirements on data  Older extensions do not support all data types

Alignment C++11 struct alignas(16) foo { int i; // 4 bytes int j; // 4 bytes alignas(4) char s[3]; // 3 bytes short q; // 2 bytes }; // outputs 16: std::cout << alignof(foo) << 'n';

Alignment C++03 // pre-C++11 // MSVC: struct __declspec(align(16)) foo { // ... }; // gcc: struct foo __attribute__((aligned(16))) { // ... };

Boost.Align  Handles heap allocation of aligned memory  Query the alignment requirements of a type  Declare alignment to the compiler portably

Compiler Intrinsics  A function whose implementation is handled directly by the compiler.  SIMD registers exposed as data types  __m64, __m128, __m128d, __m128i, etc.  SIMD instructions exposed as intrinsic functions  _m_paddb, _m_paddd, _m_paddsb, etc.  Register allocation, instruction scheduling and addressing modes handled by the compiler  Proper alignment of operands is assumed

Options Available Assembly Intrinsics Class Library Automatic Vectorization + Direct control, - Hard to program + Pure C/C++, - Hard to program + Easier to program, - Less control - Very little control

Proposed Boost.Simd  https://github.com/NumScale/boost.simd  Seems promising; easier to program without loss of control?  I had problems using it on Windows (issue #189)  Abstracts away the different sizes of registers as packs  Provides facilities to deal with alignment  Provides natural syntax for manipulating packs, i.e. a+b adds two packs together  Single code base can target multiple extensions  Templates expand to calls to intrinsics

Group Exercise  Convert BasicMandel to use intrinsics  AVX packs 8 32-bit floats to a single 256-bit register  AVX Intrinsics:  #include <immintrin.h>  __m256 _mm256_add_ps(__m256 a, __m256 b)  __m256 _m256_mul_ps(__m256 a, __m256 b)  __m256 _m256_sub_ps(__m256 a, __m256 b)  __m256 _mm256_load_ps(float const *c)  __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)  __m256i _mm256_castps_si256(__m256 a)  Intel Intrinsics Guide

SIMD Processing Using Compiler Intrinsics

More Related Content

What's hot

Similar to SIMD Processing Using Compiler Intrinsics

More from Richard Thomson

Recently uploaded

SIMD Processing Using Compiler Intrinsics