Vector: A High-Level Programming Language for GPU Computing Harry - PowerPoint PPT Presentation

Vector: A High-Level Programming Language for GPU Computing Harry Lee (hhl2114), Howard Mao (zm2169), Zachary Newman (zjn2101), Sidharth Shanker (sps2133), Jonathan Yu (jy2432)

The Problem ● GPUs have gained the ability to perform general- purpose computing tasks, so-called GPGPU ● GPGPU now the workhorse of High-Performance Computing ● Current GPGPU languages, CUDA and OpenCL, not very beginner-friendly and operate at low level of abstraction ○ Explicit copying of memory to and from GPU ○ Explicit choice of warp size ● GPU programming often follows common patterns, like map or reduce, but with no first-class functions, no way to implement patterns in reusable way

The Solution: Vector ● Memory implicitly copied to and from GPU on ad-hoc basis ● Automatic warp size selection ● Lightweight parallel-for syntax instead of defining kernels ● Map and Reduce implemented as higher order functions ● Compiles to CUDA

Syntax ● Mostly C-like syntax ● Extensions for GPU computing and some syntactic sugar

Arrays int a[3, 4, 5]; ● Support for n-dimensional arrays ● Arrays created on both CPU and x := a[i, j, k]; GPU ● Arrays are reference counted a[i, j, k] = x; ● Data automatically copied to GPU if accessed in GPU statements ● Automatically copied back to CPU if accessed in CPU code

For and Parallel For (pfor) for (i in 0:5:2, j in 0:4) { ● For loop uses iterator statements // some code instead of explicit incrementing as } in C, so “i=0; i<5; i+=2” becomes “i in 0:5:2” for (x in arr) { // some code ● Pfor loop uses same syntax, but } each iteration run in separate thread on GPU pfor (i in 0:5:2, j in 0:4) { ● For loop also supports “for each” // some GPU code type syntax. Iterate over elements } of array

Map and Reduce __device__ float square(float x) { ● Higher order functions return x * x; ● Must be generated at compile-time } (function pointers not guaranteed to int[] another_function(int inputs[]) { work in CUDA) squares := @map(square, inputs); return squares; ● Map takes function f and array a , } returns array b where b[i] = f(a[i]) __device__ int add(int x, int y) { ● Reduce takes function f and array return x + y; a , returns the result of applying f to } two pairs of elements in a , then int another_function(int inputs[]) { applying it to pairs of the results, sum := @reduce(add, inputs); return sum; etc. The function f must be } associative and commutative

Implementation Details ● Scanner/Parser in Ocamllex and Ocamlyacc ● Generator takes AST and produces CPU code inline ● Generation of GPU code is deferred until end ● Environment stores variables in scope and other state ● Runtime library implements arrays and iterators

Lessons Learned ● Group dynamics is important - good balance between leader and team members ● It’s better to segment building the compiler by feature than by phase of the compiler. It’s very hard to predict exactly what the grammar should be before implementing code generation. ● Communication with teammates is very important. Enforcing a consistent coding style (especially with respect to indentation) will avoid problems down the line. ● OCaml tools (and the functional programming paradigm in general) are really great for writing compilers. ● Start early

And Now a Demo!!! Mandelbrot set generator on CPU and GPU

CPU vs GPU performance

Vector: A High-Level Programming Language for GPU Computing Harry - PowerPoint PPT Presentation

Vector: A High-Level Programming Language for GPU Computing Harry Lee (hhl2114), Howard Mao (zm2169), Zachary Newman (zjn2101), Sidharth Shanker (sps2133), Jonathan Yu (jy2432) The Problem GPUs have gained the ability to perform general-

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Data-Level Parallelism Vector, SIMD, GPU 1 MO401 Tpicos IC-UNICAMP Vector

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

High-level programming on the GPU with Julia Tim Besard (@maleadt) Yet another high-level

Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

instruction in learning to read? Kathy Rastle Royal Holloway, University of London @kathy_rastle

Managing Dynamic Memory Allocations in a Cloud through Golondrina 4th International DMTF Academic

Detecting and Eliminating Memory Leaks Using Cyclic Memory Allocation Huu Hai Nguyen and Martin

1 Typical query on a memory Request language Data and type operators: = <= ~ != Find

Explicit Locks Alma Orucevic-Alagic 2013-11-28 Synchronized Java incorporates a

Hybrid cache architecture for high-speed packet processing Z. Liu, K. Zheng and B. Liu Abstract:

Whats feelings got to do with it! DeeAnn Paul , MA, LPC deeannhelps.com www.deeannhelps@

Slurm: New NREL Capabilities HPC Operations March 2019 Presentation by: Dan Harris NREL |

Sambuz

Useful Links

Newsletter

Mail Us

Vector: A High-Level Programming Language for GPU Computing Harry - PowerPoint PPT Presentation

Vector: A High-Level Programming Language for GPU Computing Harry Lee (hhl2114), Howard Mao (zm2169), Zachary Newman (zjn2101), Sidharth Shanker (sps2133), Jonathan Yu (jy2432) The Problem GPUs have gained the ability to perform general-

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Data-Level Parallelism Vector, SIMD, GPU 1 MO401 Tpicos IC-UNICAMP Vector

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

High-level programming on the GPU with Julia Tim Besard (@maleadt) Yet another high-level

Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

instruction in learning to read? Kathy Rastle Royal Holloway, University of London @kathy_rastle

Managing Dynamic Memory Allocations in a Cloud through Golondrina 4th International DMTF Academic

Detecting and Eliminating Memory Leaks Using Cyclic Memory Allocation Huu Hai Nguyen and Martin

1 Typical query on a memory Request language Data and type operators: = &lt;= ~ != Find

Explicit Locks Alma Orucevic-Alagic 2013-11-28 Synchronized Java incorporates a

Hybrid cache architecture for high-speed packet processing Z. Liu, K. Zheng and B. Liu Abstract:

Whats feelings got to do with it! DeeAnn Paul , MA, LPC deeannhelps.com www.deeannhelps@

Slurm: New NREL Capabilities HPC Operations March 2019 Presentation by: Dan Harris NREL |

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

1 Typical query on a memory Request language Data and type operators: = <= ~ != Find