Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020
Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos 3. Performance Results 4. Related Work & Future Directions 5. Conclusions 2
Who am I? Dr. Juan Fumero Lead Developer of TornadoVM Postdoc @ University of Manchester juan.fumero@manchester.ac.uk @snatverk 3
Motivation 4
Why should we care about GPUs/FPGAs, etc.? CPU GPU FPGA Intel FPGA Stratix 10 (14nm) Intel Ice Lake (10nm) NVIDIA GP 100 – Pascal - 16nm Reconfigurable Hardware 8 cores HT, AVX(512 SIMD) 60 SMs, 64 cores each ~ 10 TFlops ~1TFlops* (including the iGPU) 3584 FP32 cores TDP ~225Watts ~ TDP 28W 10.6 TFlops (FP32) Source: Intel docs Source: Intel docs TDP ~300 Watts Source: NVIDIA docs 5
What is a GPU? Graphics Processing Unit Contains a set of Stream Multiprocessor cores (SMx) * Pascal arch. 60 SMx * ~3500 CUDA cores Users need to know: A) Programming model (normally CUDA or OpenCL) B) Details about the architecture are essential to achieve performance * Non sequential consistency, manual barriers, etc. Source: NVIDIA docs 6
What is an FPGA? Field Programmable Gate Array You can configure the design of your hardware after manufacturing It is like having " your algorithms directly wired on hardware " with only the parts you need 7
Current Computer Systems & Prog. Lang. 8
Ideal System for Managed Languages 9
TornadoVM 10
Demo: Kinect Fusion with TornadoVM * Computer Vision Application * ~7K LOC * Thousands of OpenCL LOC generated. https://github.com/beehive-lab/kfusion-tornadovm 11
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods 12
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation 13
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Execution Engine Device Drivers 14
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 15
TornadoVM Overview • OpenJDK 8 > 141 • OpenJDK 11 Tasks = Methods Annotations • GraalVM 19.3.0 API • OpenCL >= 1.2 Task-Schedulers = Group of Methods • Support for: • NVIDIA GPUs • Intel HD Graphics • AMD GPUs Data-Flow & Optimizer • Intel Altera FPGAs Runtime • Xilinx FPGAs TornadoVM Bytecode Generation • Multi-core CPUs Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 16
Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( int i = 0; i < size; i++) { for ( int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 17
Tornado API – example class Compute { We add the parallel public static void mxm(Matrix2DFloat A, Matrix2DFloatB, annotation as a hint for the Matrix2DFloat C, final int size) { compiler. for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 18
Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 19
Tornado API – example class Compute { To run: public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { $ tornado Compute for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } tornado command is just an C.set(i, j, sum); alias to Java and all the } parameters to enable } TornadoVM } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 20
Demo: Running Matrix Multiplication https://github.com/jjfumero/qconlondon2020-tornadovm 21
TornadoVM Compiler & Runtime Overview 22
TornadoVM & Dynamic Languages 23
TornadoVM & Dynamic Languages 24
De Demo 2: 2: Node. e.js ex example le https://github.com/jjfumero/qconlondon2020-tornadovm 25
TornadoVM Compiler & Runtime Overview 26
TornadoVM Compiler & Runtime Overview 27
TornadoVM JIT Compiler Specializations 28
FPGA Specializations void compute( float [] input, float [] output) { for ( @Parallel int i = 0; …) } for ( int j = 0; ...) { // Computation } } } From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs 29
TornadoVM: VM in a VM 30
TornadoVM: VM in a VM 31
TornadoVM Bytecodes - Example 32
TornadoVM Bytecodes - Example 33
TornadoVM Bytecodes - Example 34
TornadoVM Bytecodes - Example 35
TornadoVM Bytecodes - Example 36
TornadoVM Bytecodes - Example 37
TornadoVM Bytecodes - Example 38
TornadoVM Bytecodes - Example 39
Batch Processing: 16GB into 1GB GPU 40
Batch Processing: 16GB into 1GB GPU 41
Batch Processing: 16GB into 1GB GPU 42
Live Task Migration 43
Dynamic Reconfiguration 44
Dynamic Reconfiguration 45
Dynamic Reconfiguration 46
How is the decision made? • End-to-end: including JIT compilation time • Peak Performance: without JIT and after warming-up • Latency: does not wait for all threads to finish 47
Demo Live Task Migration – Server/Client App https://github.com/jjfumero/qconlondon2020-tornadovm 48
New compilation tier for Heterogeneous Systems 49
New compilation tier for Heterogeneous Systems 50
Related Work 51
Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No No No GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 52
Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No Yes Yes GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 53
Ok, cool! What about performance? 54
Performance * TornadoVM performs up to 7.7x over the best device (statically). * Up to >4500x over Java sequential - NVIDIA GTX 1060 - Intel FPGA Nallatech 385a - Intel Core i7-7700K 55
Performance on GPUs, iGPUs, and CPUs 56
More details in our papers! https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md 57
Limitations & Future Work 58
Limitations We inherit limitations from the underlying Programming Model: • No object support (except for a few cases) • No recursion • No dynamic memory allocation (*) • No support for exceptions (*) 59
Future Work • GPU/FPGA full capabilities • Exploitation of Tier-memories such as local memory (in progress) • Policies for energy efficiency • Multi-device within a task-schedule • More parallel skeletons ( reductions , stencil, scan, filter, …) • PTX Backend for NVIDIA 60
Current Applicability of TornadoVM 61
EU H2020 E2Data Project https://e2data.eu/ "End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware" European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245 62
Recommend
More recommend