High Performance GPU Computing with Ruby Prasun Anand
About me ● SciRuby Contributor ● Google Summer of Code 2016, 2017 ● Genenetwork project ● Ruby Grant 2017 ● Projects: ○ JRuby port of NMatrix ○ ArrayFire gem ○ RbCUDA
SciRuby has been trying to push Ruby for scientific computing. Popular Rubygems: ● NMatrix ● Daru ● Mixed_models ● Nyaplot ● Ipython Notebook SciRuby
CUDA and OpenCL CUDA is a parallel computing platform and programming model limited to NVIDIA hardware. OpenCL (Open Computing Language) is supported across all GPU hardware.
Af_Array An array of numbers is stored as an Af_Array object. This array is stored on GPU. [1] pry(main)> a = ArrayFire::Af_Array.new 2, [2,2],[1,2,3,4] No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 1.0000 3.0000 2.0000 4.0000 => #<ArrayFire::Af_Array:0x000000020aeab8>
[2] pry(main)> b = a + a No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 2.0000 6.0000 4.0000 8.0000 => #<ArrayFire::Af_Array:0x000000020625c8>
[3] pry(main)> b = a * a No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 1.0000 9.0000 4.0000 16.0000 => #<ArrayFire::Af_Array:0x00000001fe6f90>
VALUE arf_init(int argc, VALUE* argv, VALUE self) { afstruct* afarray; Data_Get_Struct(self, afstruct, afarray); dim_t ndims = (dim_t)NUM2LONG(argv[0]); dim_t* dimensions = (dim_t*)malloc(ndims * sizeof(dim_t)); dim_t count = 1; for (size_t index = 0; index < ndims; index++) { dimensions[index] = (dim_t)NUM2LONG(RARRAY_AREF(argv[1], index)); count *= dimensions[index]; } double* host_array = (double*)malloc(count * sizeof(double)); for (size_t index = 0; index < count; index++) { host_array[index] = (double)NUM2DBL(RARRAY_AREF(argv[2], index)); } af_create_array(&afarray->carray, host_array, ndims, dimensions, f64) return self; }
static VALUE arf_ew_add(VALUE left_val, VALUE right_val) { afstruct* left; afstruct* right; afstruct* result = ALLOC(afstruct); Data_Get_Struct(left_val, afstruct, left); Data_Get_Struct(right_val, afstruct, right); af_add(&result->carray, left->carray, right->carray, true); return Data_Wrap_Struct(CLASS_OF(left_val), NULL, arf_free, result); }
BLAS and LAPACK BLAS functionalities ● matmul ● Transpose LAPACK functionalities ● det ● inverse ● norm ● qr ● cholesky ● svd ● lu
=> #<ArrayFire::Af_Array:0x00000001591db0> [3] pry(main)> result = ArrayFire::BLAS.matmul(left, right, :AF_MAT_NONE, :AF_MAT_NONE) No Name Array [3 2 1 1] -39.0000 -74.0000 68.0000 -17.0000 86.0000 118.0000
Statistics ● Mean ● Median ● Variance
Random Engine Random number generators Types of Engines: ● :AF_RANDOM_ENGINE_PHILOX_4X32_10 ● :AF_RANDOM_ENGINE_THREEFRY_2X32_16 ● :AF_RANDOM_ENGINE_MERSENNE_GP11213
Benchmarks ● AMD FX 8350 octacore processor ● Nvidia GTX 750Ti GPU ● CUDA backend ● Double dtype
10,000 X Faster than NMatrix-Ruby
100,000 X Faster than NMatrix-Ruby-BLAS
10 X Faster than NMatrix-Ruby-Lapack
RbCUDA
Custom Kernels Scientific software require custom kernel code that suites to its needs. Run kernel from a Ruby file, on the fly. The kernel code is dynamically compiled Run on the GPU hardware to manipulate the array pointers.
vadd_kernel_src = <<-EOS extern "C" { __global__ void matSum(int *a, int *b, int *c) { int tid = blockIdx.x; if (tid < 100) c[tid] = a[tid] + b[tid]; } } EOS f = compile(vadd_kernel_src) puts f.path
GPU Array Generic pointer used to handle an array of elements on the GPU. Memory copying from CPU to GPU and vice-versa. Interfaced with NMatrix Interface with NArray
CuBLAS, CuSolver and CuRand ● BLAS routines ● Matrix Decomposition Routines ● Random Number Generators
Benchmarks ● AMD FX 8350 octacore processor ● Nvidia GTX 750Ti GPU ● Double dtype
1,000, 000 X Faster than NMatrix-Ruby-BLAS
Future Work ● Image Processing APIs and Indexers ● Multiple dtypes ● RbCUDA is under active development. ● Project RbCUDA is being funded by Ruby Association ● (Ruby Grant 2017)
● https://github.com/arrayfire/arrayfire-rb ● https://github.com/prasunanand/rbcuda Contributions are Welcome!
Acknowledgements 1. Pjotr Prins 2. Pradeep Garigipati
Thank You Github: prasunanand Twitter: @prasun_anand Blog: prasunanand.com

High performance GPU computing with Ruby

  • 1.
    High Performance GPUComputing with Ruby Prasun Anand
  • 2.
    About me ● SciRubyContributor ● Google Summer of Code 2016, 2017 ● Genenetwork project ● Ruby Grant 2017 ● Projects: ○ JRuby port of NMatrix ○ ArrayFire gem ○ RbCUDA
  • 3.
    SciRuby has beentrying to push Ruby for scientific computing. Popular Rubygems: ● NMatrix ● Daru ● Mixed_models ● Nyaplot ● Ipython Notebook SciRuby
  • 4.
    CUDA and OpenCL CUDAis a parallel computing platform and programming model limited to NVIDIA hardware. OpenCL (Open Computing Language) is supported across all GPU hardware.
  • 6.
    Af_Array An array ofnumbers is stored as an Af_Array object. This array is stored on GPU. [1] pry(main)> a = ArrayFire::Af_Array.new 2, [2,2],[1,2,3,4] No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 1.0000 3.0000 2.0000 4.0000 => #<ArrayFire::Af_Array:0x000000020aeab8>
  • 7.
    [2] pry(main)> b= a + a No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 2.0000 6.0000 4.0000 8.0000 => #<ArrayFire::Af_Array:0x000000020625c8>
  • 8.
    [3] pry(main)> b= a * a No Name Array [2 2 1 1] Offsets: [0 0 0 0] Strides: [1 2 4 4] 1.0000 9.0000 4.0000 16.0000 => #<ArrayFire::Af_Array:0x00000001fe6f90>
  • 9.
    VALUE arf_init(int argc,VALUE* argv, VALUE self) { afstruct* afarray; Data_Get_Struct(self, afstruct, afarray); dim_t ndims = (dim_t)NUM2LONG(argv[0]); dim_t* dimensions = (dim_t*)malloc(ndims * sizeof(dim_t)); dim_t count = 1; for (size_t index = 0; index < ndims; index++) { dimensions[index] = (dim_t)NUM2LONG(RARRAY_AREF(argv[1], index)); count *= dimensions[index]; } double* host_array = (double*)malloc(count * sizeof(double)); for (size_t index = 0; index < count; index++) { host_array[index] = (double)NUM2DBL(RARRAY_AREF(argv[2], index)); } af_create_array(&afarray->carray, host_array, ndims, dimensions, f64) return self; }
  • 10.
    static VALUE arf_ew_add(VALUEleft_val, VALUE right_val) { afstruct* left; afstruct* right; afstruct* result = ALLOC(afstruct); Data_Get_Struct(left_val, afstruct, left); Data_Get_Struct(right_val, afstruct, right); af_add(&result->carray, left->carray, right->carray, true); return Data_Wrap_Struct(CLASS_OF(left_val), NULL, arf_free, result); }
  • 11.
    BLAS and LAPACK BLASfunctionalities ● matmul ● Transpose LAPACK functionalities ● det ● inverse ● norm ● qr ● cholesky ● svd ● lu
  • 12.
    => #<ArrayFire::Af_Array:0x00000001591db0> [3] pry(main)>result = ArrayFire::BLAS.matmul(left, right, :AF_MAT_NONE, :AF_MAT_NONE) No Name Array [3 2 1 1] -39.0000 -74.0000 68.0000 -17.0000 86.0000 118.0000
  • 13.
  • 14.
    Random Engine Random numbergenerators Types of Engines: ● :AF_RANDOM_ENGINE_PHILOX_4X32_10 ● :AF_RANDOM_ENGINE_THREEFRY_2X32_16 ● :AF_RANDOM_ENGINE_MERSENNE_GP11213
  • 15.
    Benchmarks ● AMD FX8350 octacore processor ● Nvidia GTX 750Ti GPU ● CUDA backend ● Double dtype
  • 17.
  • 20.
    100,000 X Faster thanNMatrix-Ruby-BLAS
  • 22.
    10 X Faster thanNMatrix-Ruby-Lapack
  • 24.
  • 25.
    Custom Kernels Scientific softwarerequire custom kernel code that suites to its needs. Run kernel from a Ruby file, on the fly. The kernel code is dynamically compiled Run on the GPU hardware to manipulate the array pointers.
  • 26.
    vadd_kernel_src = <<-EOS extern"C" { __global__ void matSum(int *a, int *b, int *c) { int tid = blockIdx.x; if (tid < 100) c[tid] = a[tid] + b[tid]; } } EOS f = compile(vadd_kernel_src) puts f.path
  • 27.
    GPU Array Generic pointerused to handle an array of elements on the GPU. Memory copying from CPU to GPU and vice-versa. Interfaced with NMatrix Interface with NArray
  • 28.
    CuBLAS, CuSolver andCuRand ● BLAS routines ● Matrix Decomposition Routines ● Random Number Generators
  • 29.
    Benchmarks ● AMD FX8350 octacore processor ● Nvidia GTX 750Ti GPU ● Double dtype
  • 31.
    1,000, 000 X Fasterthan NMatrix-Ruby-BLAS
  • 32.
    Future Work ● ImageProcessing APIs and Indexers ● Multiple dtypes ● RbCUDA is under active development. ● Project RbCUDA is being funded by Ruby Association ● (Ruby Grant 2017)
  • 33.
  • 34.
  • 36.
    Thank You Github: prasunanand Twitter:@prasun_anand Blog: prasunanand.com