tornado vm running java on gpus and fpgas
play

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD - PowerPoint PPT Presentation

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020 Agenda 1. Motivation & Background 2. TornadoVM API - examples Runtime & Just In Time Compiler Live


  1. Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020

  2. Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos 3. Performance Results 4. Related Work & Future Directions 5. Conclusions 2

  3. Who am I? Dr. Juan Fumero Lead Developer of TornadoVM Postdoc @ University of Manchester juan.fumero@manchester.ac.uk @snatverk 3

  4. Motivation 4

  5. Why should we care about GPUs/FPGAs, etc.? CPU GPU FPGA Intel FPGA Stratix 10 (14nm) Intel Ice Lake (10nm) NVIDIA GP 100 – Pascal - 16nm Reconfigurable Hardware 8 cores HT, AVX(512 SIMD) 60 SMs, 64 cores each ~ 10 TFlops ~1TFlops* (including the iGPU) 3584 FP32 cores TDP ~225Watts ~ TDP 28W 10.6 TFlops (FP32) Source: Intel docs Source: Intel docs TDP ~300 Watts Source: NVIDIA docs 5

  6. What is a GPU? Graphics Processing Unit Contains a set of Stream Multiprocessor cores (SMx) * Pascal arch. 60 SMx * ~3500 CUDA cores Users need to know: A) Programming model (normally CUDA or OpenCL) B) Details about the architecture are essential to achieve performance * Non sequential consistency, manual barriers, etc. Source: NVIDIA docs 6

  7. What is an FPGA? Field Programmable Gate Array You can configure the design of your hardware after manufacturing It is like having " your algorithms directly wired on hardware " with only the parts you need 7

  8. Current Computer Systems & Prog. Lang. 8

  9. Ideal System for Managed Languages 9

  10. TornadoVM 10

  11. Demo: Kinect Fusion with TornadoVM * Computer Vision Application * ~7K LOC * Thousands of OpenCL LOC generated. https://github.com/beehive-lab/kfusion-tornadovm 11

  12. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods 12

  13. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation 13

  14. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Execution Engine Device Drivers 14

  15. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 15

  16. TornadoVM Overview • OpenJDK 8 > 141 • OpenJDK 11 Tasks = Methods Annotations • GraalVM 19.3.0 API • OpenCL >= 1.2 Task-Schedulers = Group of Methods • Support for: • NVIDIA GPUs • Intel HD Graphics • AMD GPUs Data-Flow & Optimizer • Intel Altera FPGAs Runtime • Xilinx FPGAs TornadoVM Bytecode Generation • Multi-core CPUs Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 16

  17. Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( int i = 0; i < size; i++) { for ( int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 17

  18. Tornado API – example class Compute { We add the parallel public static void mxm(Matrix2DFloat A, Matrix2DFloatB, annotation as a hint for the Matrix2DFloat C, final int size) { compiler. for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 18

  19. Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 19

  20. Tornado API – example class Compute { To run: public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { $ tornado Compute for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } tornado command is just an C.set(i, j, sum); alias to Java and all the } parameters to enable } TornadoVM } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 20

  21. Demo: Running Matrix Multiplication https://github.com/jjfumero/qconlondon2020-tornadovm 21

  22. TornadoVM Compiler & Runtime Overview 22

  23. TornadoVM & Dynamic Languages 23

  24. TornadoVM & Dynamic Languages 24

  25. De Demo 2: 2: Node. e.js ex example le https://github.com/jjfumero/qconlondon2020-tornadovm 25

  26. TornadoVM Compiler & Runtime Overview 26

  27. TornadoVM Compiler & Runtime Overview 27

  28. TornadoVM JIT Compiler Specializations 28

  29. FPGA Specializations void compute( float [] input, float [] output) { for ( @Parallel int i = 0; …) } for ( int j = 0; ...) { // Computation } } } From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs 29

  30. TornadoVM: VM in a VM 30

  31. TornadoVM: VM in a VM 31

  32. TornadoVM Bytecodes - Example 32

  33. TornadoVM Bytecodes - Example 33

  34. TornadoVM Bytecodes - Example 34

  35. TornadoVM Bytecodes - Example 35

  36. TornadoVM Bytecodes - Example 36

  37. TornadoVM Bytecodes - Example 37

  38. TornadoVM Bytecodes - Example 38

  39. TornadoVM Bytecodes - Example 39

  40. Batch Processing: 16GB into 1GB GPU 40

  41. Batch Processing: 16GB into 1GB GPU 41

  42. Batch Processing: 16GB into 1GB GPU 42

  43. Live Task Migration 43

  44. Dynamic Reconfiguration 44

  45. Dynamic Reconfiguration 45

  46. Dynamic Reconfiguration 46

  47. How is the decision made? • End-to-end: including JIT compilation time • Peak Performance: without JIT and after warming-up • Latency: does not wait for all threads to finish 47

  48. Demo Live Task Migration – Server/Client App https://github.com/jjfumero/qconlondon2020-tornadovm 48

  49. New compilation tier for Heterogeneous Systems 49

  50. New compilation tier for Heterogeneous Systems 50

  51. Related Work 51

  52. Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project​ Ready​ Devices​ Migration​ Specializations​ Languages Sumatra​ No​ AMD GPUs​ No​ No​ No Multi-core, Marawacc No​ No​ No​ No GPUs​ JaBEE No​ NVIDIA GPUs​ No​ No​ No RootBeer No​ NVIDIA GPUs​ No​ No​ No GPUs, Aparapi Yes​ No​ No​ No multi- core​ IBM GPU J9​ Yes​ NVIDIA GPUs​ No​ No​ No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes​ Yes​ Yes GPUs,FPGAs 52

  53. Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project​ Ready​ Devices​ Migration​ Specializations​ Languages Sumatra​ No​ AMD GPUs​ No​ No​ No Multi-core, Marawacc No​ No​ Yes​ Yes GPUs​ JaBEE No​ NVIDIA GPUs​ No​ No​ No RootBeer No​ NVIDIA GPUs​ No​ No​ No GPUs, Aparapi Yes​ No​ No​ No multi- core​ IBM GPU J9​ Yes​ NVIDIA GPUs​ No​ No​ No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes​ Yes​ Yes GPUs,FPGAs 53

  54. Ok, cool! What about performance? 54

  55. Performance * TornadoVM performs up to 7.7x over the best device (statically). * Up to >4500x over Java sequential - NVIDIA GTX 1060 - Intel FPGA Nallatech 385a - Intel Core i7-7700K 55

  56. Performance on GPUs, iGPUs, and CPUs 56

  57. More details in our papers! https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md 57

  58. Limitations & Future Work 58

  59. Limitations We inherit limitations from the underlying Programming Model: • No object support (except for a few cases) • No recursion • No dynamic memory allocation (*) • No support for exceptions (*) 59

  60. Future Work • GPU/FPGA full capabilities • Exploitation of Tier-memories such as local memory (in progress) • Policies for energy efficiency • Multi-device within a task-schedule • More parallel skeletons ( reductions , stencil, scan, filter, …) • PTX Backend for NVIDIA 60

  61. Current Applicability of TornadoVM 61

  62. EU H2020 E2Data Project https://e2data.eu/ "End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware" European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245 62

Recommend


More recommend