Marawacc Marawacc: A Framework for Heterogeneous Computing in Java Motivation Marawacc-API Runtime Code Generation Juan Fumero, Michel Steuwer, Christophe Dubach Runtime Management Results Conclusion The University of Edinburgh UK Many-Core Developer Conference 2016 1 / 23
Motivation Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 2 / 23
Motivation Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 3 / 23
Motivation Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 4 / 23
Motivation Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 5 / 23
Marawacc: our approach Marawacc Three levels of abstraction Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 6 / 23
Marawacc API Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 7 / 23
Example: Saxpy in Java Marawacc Motivation Marawacc-API Runtime Code Generation Runtime f l o a t [ ] v1 = new f l o a t [ s i z e ] ; Management 1 f l o a t [ ] v2 = new f l o a t [ s i z e ] ; 2 Results f l o a t [ ] r e s u l t = new f l o a t [ s i z e ] ; 3 Conclusion 4 5 f o r ( i n t i = 0; i < s i z e ; i ++) { r e s u l t [ i ] = alpha ∗ v1 [ i ] + v2 [ i ] ; 6 7 } 8 / 23
Example: Saxpy in Java Marawacc Motivation Marawacc-API Runtime Code Generation Runtime 1 F l o at [ ] v1 = new F l o a t [ s i z e ] ; Management 2 F l o at [ ] v2 = new F l o a t [ s i z e ] ; Results 3 Conclusion 4 ArrayFunc < Tuple2 < Float , Float > , Float > f ; f = new MapFunction <> (t − > alpha ∗ t . 1 () + t . 2 () ) ; 5 6 7 F l o at [ ] r e s u l t = f . z i p ( v1 , v2 ) . apply () ; 9 / 23
Runtime Code Generation Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 10 / 23
Runtime Code Generation Marawacc Workflow Motivation Marawacc-API Graal VM Java source Runtime Code 1. Type inference Map.apply(f) Generation Runtime 2. IR generation Management ... Results 10: aload_2 11: iload_3 Conclusion 12: aload_0 3. optimizations 13: get fi eld 16: aaload 18: invokeinterface#apply 23: aastore 24: iinc 27: iload_3 CFG + Data fl ow ... (Graal IR) Java bytecode 4. kernel generation void kernel ( global fl oat* input, global fl oat* output) { ...; ...; } OpenCL Kernel 11 / 23
Runtime Code Generation Marawacc Motivation MapFunction < Integer , Double > (x − > x * 2.0) Marawacc-API Runtime Code Generation Runtime Param Param Management Param StartNode Results StartNode IsNull StartNode MethodCallTarget Conclusion DoubleConvert Const (2.0) GuardingPi (NullCheckException) Invoke#Integer.intValue * Unbox DoubleConvert Const (2.0) DoubleConvert Const (2.0) * Return * MethodCallTarget Box Invoke#Double.valueOf inline double lambda0 ( int p0 ) { double cast_1 = ( double ) p0 ; Return double result_2 = cast_1 * 2.0; Return return result_2 ; } 12 / 23
Marawacc: Runtime Management Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion 13 / 23
Where the time is spent? Marawacc Motivation Black-scholes benchmark. Marawacc-API Runtime Code Float [] = ⇒ Tuple 2 < Float , Float > [] Generation Runtime Management Results 1.0 Conclusion Unmarshaling ◮ Un/marshal data Amount of total runtime in % 0.8 CopyToCPU takes up to 90% of the time 0.6 GPU Execution ◮ Computation step CopyToGPU 0.4 should be dominant Marshaling 0.2 Java overhead 0.0 This is not acceptable. Can we do better? 14 / 23
Custom Array Type: PArray Marawacc Motivation PArray<T uple2<Float,Double>> Marawacc-API 0 1 2 n-1 Runtime Code Generation T uple2 T uple2 T uple2 T uple2 Runtime fl oat fl oat fl oat ... fl oat Programmer's View Management double double double double Results Conclusion Graal-OCL VM 2 0 1 n-1 FloatBu ff er fl oat fl oat fl oat ... ... ... fl oat 2 n-1 0 1 DoubleBu ff er double double double ... ... double With this layout, un/marshal operations are not necessary 15 / 23
Sapy example Marawacc Motivation Marawacc-API Runtime Code Generation Runtime 1 F l o at [ ] v1 = new F l o a t [ s i z e ] ; Management 2 Double [ ] v2 = new Double [ s i z e ] ; Results 3 Conclusion 4 ArrayFunc < Tuple2 < Float , Double > , Double > f ; f = new MapFunction <> (t − > alpha ∗ t . 1 () + t . 2 () ) ; 5 6 7 F l o at [ ] r e s u l t = f . z i p ( v1 , v2 ) . apply () ; 16 / 23
Saxpy with our Custom PArrays Marawacc Motivation Marawacc-API Runtime Code Generation Runtime 1 F l o at [ ] v1 = new F l o a t [ s i z e ] ; Management 2 Double [ ] v2 = new Double [ s i z e ] ; Results PArray i n p u t= new PArray ( v1 , v2 ) ; 3 Conclusion 4 5 ArrayFunc < Tuple2 < Float , Double > , Double > f ; 6 f = new MapFunction <> (t − > alpha ∗ t . 1 () + t . 2 () ) ; 7 PArray < Double > output = f . apply ( i n p u t ) ; 17 / 23
Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Results Conclusion 18 / 23
OpenCL GPU Execution Marawacc AMD R9 and NVIDIA GeForce GTX Titan Motivation Marawacc-API AMD Marshalling AMD Marshalling AMD Optimized AMD Optimized Runtime Code Speedup vs. Java sequential 1000 Generation 100 Runtime Management 10 Results 1 Conclusion 0.1 small large small large small large small large small large Saxpy K−Means Black−Scholes N−Body Monte Carlo Nvidia Marshalling Nvidia Marshalling Nvidia Optimized Nvidia Optimized Speedup vs. Java sequential 1000 100 10 1 0.1 small large small large small large small large small large Saxpy K−Means Black−Scholes N−Body Monte Carlo 19 / 23
Comparison with OpenCL C++ Marawacc AMD R9 and NVIDIA GeForce GTX Titan Motivation Marawacc-API Runtime Code Generation Speedup over sequential code on AMD Runtime Marawacc Aparapi OpenCL C++ 1000 Speedup over sequential code Management 500 100 Results Conclusion 10 1 Small Large Small Large Small Large Small Large Small Large Saxpy K-Means Black-Scholes N-Body MonteCarlo Speedup over sequential code on NVIDIA Marawacc Aparapi OpenCL C++ Speedup over sequential code 500 100 10 1 Small Large Small Large Small Large Small Large Small Large Saxpy K-Means Black-Scholes N-Body MonteCarlo 20 / 23
.zip(Conclusions).map(Future) Marawacc Motivation Marawacc-API Present Runtime Code ◮ We have presented Marawacc framework for Generation Runtime programming GPUs from Java Management ◮ Custom array type to reduce overheads when Results transforming the data Conclusion ◮ Runtime system to run heterogeneous applications within Java Future ◮ Code generation for multiple devices ◮ Runtime scheduling (Where is the best place to run the code?) 21 / 23
Thanks so much for your attention Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion Juan Fumero <juan.fumero@ed.ac.uk> 22 / 23
OpenCL code generated Marawacc lambda0 ( f l o a t p0 ) { 1 double Motivation double c a s t 1 = ( double ) p0 ; 2 Marawacc-API r e s u l t 2 = c a s t 1 ∗ 2 . 0 ; double 3 return r e s u l t 2 ; Runtime Code 4 Generation 5 } 6 k e r n e l void lambdaComputationKernel ( Runtime Management g l o b a l f l o a t ∗ p0 , 7 g l o b a l i n t ∗ p0 index data , 8 Results g l o b a l double ∗ p1 , 9 Conclusion g l o b a l ∗ p 1 i n d e x d a t a ) { i n t 10 i n t p0 dim 1 = 0 ; i n t p1 dim 1 = 0 ; 11 gs = g e t g l o b a l s i z e (0) ; i n t 12 i n t lo op 1 = g e t g l o b a l i d (0) ; 13 ( ; ; l oop 1 += gs ) { f o r 14 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 dim 1 ] ; 15 bool cond 2 = l oo p 1 < p 0 l e n d i m 1 ; 16 i f ( cond 2 ) { 17 auxVar0 = p0 [ l oo p 1 ] ; f l o a t 18 double r e s = lambd0 ( auxVar0 ) ; 19 p1 [ p 1 i n d e x d a t a [ p1 dim 1 + 1] + l oo p 1 ] 20 = r e s ; 21 } e l s e { break ; } 22 } 23 24 } 23 / 23
Recommend
More recommend