leveraging the gpu on spark
play

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - PowerPoint PPT Presentation

Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26 Leveraging the GPU on Spark Contents Motivation Challenges


  1. Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26

  2. Leveraging the GPU on Spark Contents Motivation Challenges Prototype Architecture Benchmarks Conclusions The Way Forward 2 / 26

  3. Accelerating operations with high arithmetic intensity is “easy”: copy from Spark to accelerated native application compute… copy back results Leveraging the GPU on Spark Motivation Motivation ◮ Initial motivation: Time series analysis in Chronix 3 / 26

  4. Leveraging the GPU on Spark Motivation Motivation ◮ Initial motivation: Time series analysis in Chronix ◮ Accelerating operations with high arithmetic intensity is “easy”: ◮ copy from Spark to accelerated native application ◮ compute… ◮ copy back results 3 / 26

  5. More generally: accelerate operations with low arithmetic intensity typically CPU GPU slow, GPU RAM fast Can we just keep the data on the GPU all the time? Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection 4 / 26

  6. Can we just keep the data on the GPU all the time? Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection ◮ More generally: accelerate operations with low arithmetic intensity ◮ typically CPU ↔ GPU slow, GPU RAM fast 4 / 26

  7. Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection ◮ More generally: accelerate operations with low arithmetic intensity ◮ typically CPU ↔ GPU slow, GPU RAM fast ◮ Can we just keep the data on the GPU all the time? 4 / 26

  8. OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious There has yet to emerge a standard way of GPU acceleration for Java Many publications, but few publish code Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). 5 / 26

  9. Many publications, but few publish code Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). ◮ OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious ◮ There has yet to emerge a standard way of GPU acceleration for Java 5 / 26

  10. Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). ◮ OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious ◮ There has yet to emerge a standard way of GPU acceleration for Java ◮ Many publications, but few publish code 5 / 26

  11. Aparapi (Java OpenCL) Both could use some love... Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) 6 / 26

  12. Both could use some love... Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) ◮ Aparapi (Java → OpenCL) 6 / 26

  13. Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) ◮ Aparapi (Java → OpenCL) Both could use some love... 6 / 26

  14. Direct OpenCL usage makes runtime code generation easy. Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage 7 / 26

  15. Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. 7 / 26

  16. Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper destructors is awkward. 7 / 26

  17. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. 7 / 26

  18. Leveraging the GPU on Spark Challenges CUDA vs. OpenCL CUDA ◮ has a mature ecosystem ◮ needs separate compilation ◮ works only on Nvidia GPUs OpenCL ◮ “works” on lots of devices (CPUs, GPUs, FPGAs, etc) ◮ supports JIT compilation of kernels (from C) ◮ most implementations are fragile/quirky 8 / 26

  19. IBM GPUEnabler (Tungsten prototype?) looks promising but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) 9 / 26

  20. but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?) ◮ looks promising 9 / 26

  21. Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?) ◮ looks promising ◮ but mostly undocumented ◮ uses internal Spark APIs ◮ had randomly failing tests ◮ their example code is faster on the CPU 9 / 26

  22. Provides GPU functions on the RDD The user can choose caching on the GPU at runtime If data is not cached on the GPU, it is streamed as needed CLRDD [ T ]( val wrapped : RDD [ CLPartition [ T ]]) extends RDD [ T ] Leveraging the GPU on Spark Prototype Architecture CLRDD ◮ One CLPartition yields one context and an iterator of binary chunks ◮ The context provides asynchronous methods on chunks 10 / 26

  23. CLRDD [ T ]( val wrapped : RDD [ CLPartition [ T ]]) extends RDD [ T ] Leveraging the GPU on Spark Prototype Architecture CLRDD ◮ One CLPartition yields one context and an iterator of binary chunks ◮ The context provides asynchronous methods on chunks ◮ Provides GPU functions on the RDD ◮ The user can choose caching on the GPU at runtime ◮ If data is not cached on the GPU, it is streamed as needed 10 / 26

  24. Leveraging the GPU on Spark Prototype Architecture Storage ◮ all useful operations on CLRDD [ T ] require a typeclass instance CLType [ T ] ◮ minimal defjnition includes OpenCL type, mapping to/from ByteBuffer storage ◮ optionally: OpenCL arithmetics ◮ macro generated instances for all primitve vector/tuple types 11 / 26

  25. cpu : Boolean , ) extends CLProgramSource { } ... : Array [ String ] = ... case class MapReduceKernel [ A , B ]( f : MapKernel [ A , B ], reduceBody : String , identity : String , def generateSource(supply : Iterator [ String ]) implicit val clA : CLType [ A ], implicit val clB : CLType [ B ] Leveraging the GPU on Spark Prototype Architecture Operations Operations are represented as composable case classes that can generate a kernel source: 12 / 26

  26. Simple reduction: reduce( MapReduceKernel ( clT.zeroName, // string zero } ), num.zero, ((x : T , y : T ) => num.plus(x,y))) clT, clT // explicit typeclasses crdd.map[ Byte ]("return x%2;") def sum( implicit num : Numeric [ T ]) : T = { val clT = implicitly[ CLType [ T ]] useCPU, // algorithm selection MapKernel .identity[ T ], // first map "return x+y;", // then reduce Leveraging the GPU on Spark Prototype Architecture Functions on the GPU High level functions that are implemented: ◮ One to one map functions (inplace/copying): 13 / 26

  27. reduce( MapReduceKernel ( "return x+y;", // then reduce } ), num.zero, ((x : T , y : T ) => num.plus(x,y))) clT, clT // explicit typeclasses crdd.map[ Byte ]("return x%2;") useCPU, // algorithm selection def sum( implicit num : Numeric [ T ]) : T = { val clT = implicitly[ CLType [ T ]] clT.zeroName, // string zero MapKernel .identity[ T ], // first map Leveraging the GPU on Spark Prototype Architecture Functions on the GPU High level functions that are implemented: ◮ One to one map functions (inplace/copying): ◮ Simple reduction: 13 / 26

  28. width, 1, // width, stride ) //just scala things... } clT.doubleCLInstance.elemClassTag) def movingAverage(width : Int )( implicit clT : CLType [ T ]) //polymorphic return type, e.g.CLRDD[(Double,Double)] : CLRDD [ clT.doubleCLInstance.elemType ] = { val clRes = clT.doubleCLInstance sliding[ clT.doubleCLInstance.elemType ]( (clT.doubleCLInstance.selfInstance, s""" ${ clRes.clName } res = ${ clRes.zeroName } ; for(int i=0; i< $width ; ++i) res += convert_ ${ clRes.clName } (GET(i)); return res/ $width ;""" Leveraging the GPU on Spark Prototype Architecture Functions on the GPU ◮ Many to one sliding window map 14 / 26

Recommend


More recommend