Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD - PowerPoint PPT Presentation
Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020 Agenda 1. Motivation & Background 2. TornadoVM API - examples Runtime & Just In Time Compiler Live
Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020
Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos 3. Performance Results 4. Related Work & Future Directions 5. Conclusions 2
Who am I? Dr. Juan Fumero Lead Developer of TornadoVM Postdoc @ University of Manchester juan.fumero@manchester.ac.uk @snatverk 3
Motivation 4
Why should we care about GPUs/FPGAs, etc.? CPU GPU FPGA Intel FPGA Stratix 10 (14nm) Intel Ice Lake (10nm) NVIDIA GP 100 – Pascal - 16nm Reconfigurable Hardware 8 cores HT, AVX(512 SIMD) 60 SMs, 64 cores each ~ 10 TFlops ~1TFlops* (including the iGPU) 3584 FP32 cores TDP ~225Watts ~ TDP 28W 10.6 TFlops (FP32) Source: Intel docs Source: Intel docs TDP ~300 Watts Source: NVIDIA docs 5
What is a GPU? Graphics Processing Unit Contains a set of Stream Multiprocessor cores (SMx) * Pascal arch. 60 SMx * ~3500 CUDA cores Users need to know: A) Programming model (normally CUDA or OpenCL) B) Details about the architecture are essential to achieve performance * Non sequential consistency, manual barriers, etc. Source: NVIDIA docs 6
What is an FPGA? Field Programmable Gate Array You can configure the design of your hardware after manufacturing It is like having " your algorithms directly wired on hardware " with only the parts you need 7
Current Computer Systems & Prog. Lang. 8
Ideal System for Managed Languages 9
TornadoVM 10
Demo: Kinect Fusion with TornadoVM * Computer Vision Application * ~7K LOC * Thousands of OpenCL LOC generated. https://github.com/beehive-lab/kfusion-tornadovm 11
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods 12
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation 13
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Execution Engine Device Drivers 14
TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 15
TornadoVM Overview • OpenJDK 8 > 141 • OpenJDK 11 Tasks = Methods Annotations • GraalVM 19.3.0 API • OpenCL >= 1.2 Task-Schedulers = Group of Methods • Support for: • NVIDIA GPUs • Intel HD Graphics • AMD GPUs Data-Flow & Optimizer • Intel Altera FPGAs Runtime • Xilinx FPGAs TornadoVM Bytecode Generation • Multi-core CPUs Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 16
Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( int i = 0; i < size; i++) { for ( int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 17
Tornado API – example class Compute { We add the parallel public static void mxm(Matrix2DFloat A, Matrix2DFloatB, annotation as a hint for the Matrix2DFloat C, final int size) { compiler. for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 18
Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 19
Tornado API – example class Compute { To run: public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { $ tornado Compute for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } tornado command is just an C.set(i, j, sum); alias to Java and all the } parameters to enable } TornadoVM } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 20
Demo: Running Matrix Multiplication https://github.com/jjfumero/qconlondon2020-tornadovm 21
TornadoVM Compiler & Runtime Overview 22
TornadoVM & Dynamic Languages 23
TornadoVM & Dynamic Languages 24
De Demo 2: 2: Node. e.js ex example le https://github.com/jjfumero/qconlondon2020-tornadovm 25
TornadoVM Compiler & Runtime Overview 26
TornadoVM Compiler & Runtime Overview 27
TornadoVM JIT Compiler Specializations 28
FPGA Specializations void compute( float [] input, float [] output) { for ( @Parallel int i = 0; …) } for ( int j = 0; ...) { // Computation } } } From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs 29
TornadoVM: VM in a VM 30
TornadoVM: VM in a VM 31
TornadoVM Bytecodes - Example 32
TornadoVM Bytecodes - Example 33
TornadoVM Bytecodes - Example 34
TornadoVM Bytecodes - Example 35
TornadoVM Bytecodes - Example 36
TornadoVM Bytecodes - Example 37
TornadoVM Bytecodes - Example 38
TornadoVM Bytecodes - Example 39
Batch Processing: 16GB into 1GB GPU 40
Batch Processing: 16GB into 1GB GPU 41
Batch Processing: 16GB into 1GB GPU 42
Live Task Migration 43
Dynamic Reconfiguration 44
Dynamic Reconfiguration 45
Dynamic Reconfiguration 46
How is the decision made? • End-to-end: including JIT compilation time • Peak Performance: without JIT and after warming-up • Latency: does not wait for all threads to finish 47
Demo Live Task Migration – Server/Client App https://github.com/jjfumero/qconlondon2020-tornadovm 48
New compilation tier for Heterogeneous Systems 49
New compilation tier for Heterogeneous Systems 50
Related Work 51
Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No No No GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 52
Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No Yes Yes GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 53
Ok, cool! What about performance? 54
Performance * TornadoVM performs up to 7.7x over the best device (statically). * Up to >4500x over Java sequential - NVIDIA GTX 1060 - Intel FPGA Nallatech 385a - Intel Core i7-7700K 55
Performance on GPUs, iGPUs, and CPUs 56
More details in our papers! https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md 57
Limitations & Future Work 58
Limitations We inherit limitations from the underlying Programming Model: • No object support (except for a few cases) • No recursion • No dynamic memory allocation (*) • No support for exceptions (*) 59
Future Work • GPU/FPGA full capabilities • Exploitation of Tier-memories such as local memory (in progress) • Policies for energy efficiency • Multi-device within a task-schedule • More parallel skeletons ( reductions , stencil, scan, filter, …) • PTX Backend for NVIDIA 60
Current Applicability of TornadoVM 61
EU H2020 E2Data Project https://e2data.eu/ "End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware" European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245 62
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.