Runtime Code Generation and Data Management for Heterogeneous Computing in Java

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Runtime and Data Management for Heterogeneous Computing in Java Juan Jos´e Fumero, Toomas Remmelg, Michel Steuwer, Christophe Dubach The University of Edinburgh Principles and Practices of Programming on the Java Platform 2015 1 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion 1 Introduction 2 API 3 Runtime Code Generation 4 Data Management 5 Results 6 Conclusion 2 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Heterogeneous Computing NBody App (NVIDIA SDK) ˜105x speedup over seq LU Decomposition (Rodinia Benchmark) ˜10x over 32 OpenMP threads 3 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Cool, but how to program? 4 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Example in OpenCL 1 // create host buﬀers 2 i n t ∗A, . . . . 3 //Initialization 4 . . . 5 // platform 6 c l u i n t numPlatforms = 0; 7 c l p l a t f o r m i d ∗ p l a t f o r m s ; 8 s t a t u s = clGetPlatf ormIDs (0 , NULL, &numPlatforms ) ; 9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) malloc ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ; 10 s t a t u s = c lGetPlatform IDs ( numPlatforms , platforms , NULL) ; 11 c l u i n t numDevices = 0; 12 c l d e v i c e i d ∗ d e v i c e s ; 13 s t a t u s = clGetDeviceIDs ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL, & numDevices ) ; 14 // Allocate space for each device 15 d e v i c e s = ( c l d e v i c e i d ∗) malloc ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ; 16 // Fill in devices 17 s t a t u s = clGetDeviceIDs ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices , devices , NULL) ; 18 c l c o n t e x t context ; 19 context = c l C r e a t e C o n t e x t (NULL, numDevices , devices , NULL, NULL, &s t a t u s ) ; 20 cl command queue cmdQ ; 21 cmdQ = clCreateCommandQueue ( context , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ; 22 cl mem d A , d B , d C ; 23 d A = c l C r e a t e B u f f e r ( context , CL MEM READ ONLY|CL MEM COPY HOST PTR, d a t a s i z e , A, &s t a t u s ) ; 24 d B = c l C r e a t e B u f f e r ( context , CL MEM READ ONLY|CL MEM COPY HOST PTR, d a t a s i z e , B, &s t a t u s ) ; 25 d C = c l C r e a t e B u f f e r ( context , CL MEM WRITE ONLY, d a t a s i z e , NULL, &s t a t u s ) ; 26 . . . 27 // Check errors 28 . . . 5 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Example in OpenCL 1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ; 2 source = r e a d s o u r c e ( s o u r c e F i l e ) ; 3 program = clCreateProgramWithSource ( context , 1 , ( const char ∗∗)&source , NULL, &s t a t u s ) ; 4 c l i n t b u i l d E r r ; 5 b u i l d E r r = clBuildProgram ( program , numDevices , devices , NULL, NULL, NULL) ; 6 // Create a kernel 7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ; 8 9 s t a t u s = c l S e t K e r n e l A r g ( kernel , 0 , s i z e o f ( cl mem ) , &d A ) ; 10 s t a t u s |= c l S e t K e r n e l A r g ( kernel , 1 , s i z e o f ( cl mem ) , &d B ) ; 11 s t a t u s |= c l S e t K e r n e l A r g ( kernel , 2 , s i z e o f ( cl mem ) , &d C ) ; 12 13 s i z e t globalWorkSize [ 1 ] = { ELEMENTS}; 14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5}; 15 16 clEnqueueNDRangeKernel (cmdQ, kernel , 1 , NULL, globalWorkSize , NULL, 0 , NULL, NULL) ; 17 18 clEnqueueReadBuffer (cmdQ, d C , CL TRUE , 0 , d a t a s i z e , C, 0 , NULL, NULL) ; 19 20 // Free memory 6 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion OpenCL example 1 k e r n e l void vecadd ( 2 g l o b a l i n t ∗a , 3 g l o b a l i n t ∗b , 4 g l o b a l i n t ∗c ) { 5 6 i n t i d x = 7 g e t g l o b a l i d (0) ; 8 c [ i d x ] = a [ i d x ] ∗ b [ i d x ] ; 9 } • Hello world App ˜ 250 lines of code (including error checking) • Low-level and speciﬁc code • Knowledge about target architecture • If GPU/accelerator changes, tuning is required 7 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion OpenCL programming is hard and error-prone!! 8 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Higher levels of abstraction 9 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Programming for Heterogeneous Computing 10 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Higher levels of abstraction 11 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Similar works • Sumatra API (discontinued): Stream API for HSAIL • AMD Aparapi: Java API for OpenCL • NVIDIA Nova: functional programming language for CPU/GPU • Cooperhead: subset of python than can be executed on heterogeneous platforms. 12 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Our Approach Three levels of abstraction: • Parallel Skeletons: API based on functional programming style (map/reduce) • High-level optimising library which rewrites operations to target speciﬁc hardware • OpenCL code generation and runtime with data management for heterogeneous architecture 13 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Example: Saxpy 1 // Computation function 2 ArrayFunc<Tuple2<Float , Float >, Float > mult = new MapFunction<>(t −> 2.5 f ∗ t . 1 () + t . 2 () ) ; 3 4 // Prepare the input 5 Tuple2<Float , Float >[] i n p ut = new Tuple2 [ s i z e ] ; 6 f o r ( i n t i = 0; i < i np u t . l e n g t h ; ++i ) { 7 i n pu t [ i ] . 1 = ( f l o a t ) ( i ∗ 0.323) ; 8 i n pu t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ; 9 } 10 11 // Computation 12 Flo at [ ] output = mult . apply ( i np u t ) ; If accelerator enabled, the map expression is rewritten in lower level operations automatically. map(λ) = MapAccelerator(λ) = CopyIn().computeOCL(λ).CopyOut() 14 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Our Approach Overview Ar r ayFunc Map MapThr eads MapOpenCL Reduce. . . appl y( ) { f or ( i = 0; i < si ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ; } appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ; } appl y( ) { copyToDevi ce( ) ; execut e( ) ; copyToHost ( ) ; } Funct i on 15 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Runtime Code Generation Workflow ... 10: aload_2 11: iload_3 12: aload_0 13: getfield 16: aaload 18: invokeinterface#apply 23: aastore 24: iinc 27: iload_3 ... Java source Map.apply(f) Java bytecode Graal VM CFG + Dataflow (Graal IR) void kernel ( global float* input, global float* output) { ...; ...; } OpenCL Kernel 3. optimizations 2. IR generation 4. kernel generation 1. Type inference 16 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Runtime Code Generation Param StartNode MethodCallTarget Invoke#Integer.intValue DoubleConvert Const (2.0) * MethodCallTarget Invoke#Double.valueOf Param StartNode IsNull GuardingPi (NullCheckException) DoubleConvert Const (2.0) * Box Return Return Unbox Param StartNode DoubleConvert Const (2.0) * Return inline double lambda0 ( int p0 ) { double cast_1 = ( double ) p0 ; double result_2 = cast_1 * 2.0; return result_2 ; } 17 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion OpenCL code generated 1 double lambda0 ( f l o a t p0 ) { 2 double c a s t 1 = ( double ) p0 ; 3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ; 4 return r e s u l t 2 ; 5 } 6 k e r n e l void lambdaComputationKernel ( 7 g l o b a l f l o a t ∗ p0 , 8 g l o b a l i n t ∗ p0 index data , 9 g l o b a l double ∗p1 , 10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) { 11 i n t p0 dim 1 = 0; i n t p1 dim 1 = 0; 12 i n t gs = g e t g l o b a l s i z e (0) ; 13 i n t loop 1 = g e t g l o b a l i d (0) ; 14 f o r ( ; ; loop 1 += gs ) { 15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 dim 1 ] ; 16 bool cond 2 = loop 1 < p 0 l e n d i m 1 ; 17 i f ( cond 2 ) { 18 f l o a t auxVar0 = p0 [ loop 1 ] ; 19 double r e s = lambd0 ( auxVar0 ) ; 20 p1 [ p 1 i n d e x d a t a [ p1 dim 1 + 1] + loop 1 ] 21 = r e s ; 22 } e l s e { break ; } 23 } 24 } 18 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Main Components User Code Acceleartor Runtime Java Threads Deoptimisation(*) Skeleton/Pattern rewriten Runtime Buﬀers Management Code Cache Management Graal Brigde Code Generator Optimizer OCL JIT Compilation API: parallel pattern composition 19 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Investigation of runtime for BS Black-scholes benchmark. Float[] =⇒ Tuple2 < Float, Float > [] 0.0 0.2 0.4 0.6 0.8 1.0 Amountoftotalruntimein% Unmarshaling CopyToCPU GPU Execution CopyToGPU Marshaling Java overhead • Un/marshal data takes up to 90% of the time • Computation step should be dominant This is not acceptable. Can we do better? 20 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Custom Array Type Programmer's View Tuple2 ... Graal-OCL VM float float float float... double double double double... FloatBuffer DoubleBuffer ... 0 1 2 n-1 ... 0 1 2 n-1 0 1 2 n-1 float double Tuple2 float double Tuple2 float double Tuple2 float double ... PArray<Tuple2<Float,Double>> With this layout, un/marshal operations are not necessary 21 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Example of JPAI 1 ArrayFunc<Tuple2<Float , Double >, Double> f = new MapFunction<>(t −> 2.5 f ∗ t . 1 () + t . 2 () ) ; 2 3 PArray<Tuple2<Float , Double>> i n pu t = new PArray<>( s i z e ) ; 4 5 f o r ( i n t i = 0; i < s i z e ; ++i ) { 6 i np u t . put ( i , new Tuple2 <>(( f l o a t ) i , ( double ) i + 2) ) ; 7 } 8 9 PArray<Double> output = f . apply ( i n pu t ) ; pArray.put(2, new Tuple2<>(2.0f, 4.0)); 1.0f 2.0f0.0f FloatBuﬀer 3.0 4.02.0 DoubleBuﬀer ... ... 22 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Setup • 5 Applications • Comparison with: • Java Sequential - Graal compiled code • AMD and Nvidia GPUs • Java Array vs. Custom PArray • Java threads 23 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Java Threads Execution 0 1 2 3 4 5 6 small large Saxpy small large K−Means small large Black−Scholes small large N−Body small large Monte Carlo Speedupvs.Javasequential Number of Java Threads #1 #2 #4 #8 #16 CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz 24 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion OpenCL GPU Execution 0.1 1 10 100 1000 small large Saxpy 0.004 0.004 small large K−Means small large Black−Scholes small large N−Body small large Monte Carlo Speedupvs.Javasequential Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black 25 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion OpenCL GPU Execution 0.1 1 10 100 1000 small large Saxpy 0.004 0.004 small large K−Means small large Black−Scholes small large N−Body small large Monte Carlo Speedupvs.Javasequential Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized 10x 12x 70x AMD Radeon R9 295 NVIDIA Geforce GTX Titan Black 26 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion .zip(Conclusions).map(Future) Present • We have presented an API to enable heterogeneous computing in Java • Custom array type to reduce overheads when transfer the data • Runtime system to run heterogeneous applications within Java Future • Runtime data type specialization • Code generation for multiple devices • Runtime scheduling (Where is the best place to run the code?) 27 / 28

GPUs GraalVM Juan J. Fumero Introduction API Runtime Code Generation Data Management Results Conclusion Thanks so much for your attention This work was supported by a grant from: Juan Fumero Email: juan.fumero@ed.ac.uk Webpage: http://homepages.inf.ed.ac.uk/s1369892/ 28 / 28

Runtime Code Generation and Data Management for Heterogeneous Computing in Java

More Related Content

What's hot

Similar to Runtime Code Generation and Data Management for Heterogeneous Computing in Java

Recently uploaded

Runtime Code Generation and Data Management for Heterogeneous Computing in Java