gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) - PowerPoint PPT Presentation

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Artem Belevich, Robert Hundt

One-Slide Overview ● Motivation Lack of a state-of-the-art platform for CUDA compiler and HPC research ○ ○ Binary dependencies, performance tuning, language features, bug turnaround times, etc. Solution ● ○ gpucc : the first fully-functional, open-source, high performance CUDA compiler based on LLVM and supports C++11 and C++14 ○ ○ developed and tuned several general and CUDA-specific optimization passes ● Results highlight (compared with nvcc) up to 51% faster on internal end-to-end benchmarks ○ ○ on par on open-source benchmarks compile time is 8% faster on average and 2.4x faster for pathological compilations ○

Mixed-Mode CUDA Code template <int N> __global__ void kernel( float *y) { ... } GPU/device

Mixed-Mode CUDA Code template <int N> void host(float *x) { float *y; template <int N> cudaMalloc(&y, 4*N); __global__ void kernel( cudaMemcpy(y, x, ...); float *y) { kernel<N><<<16, 128>>>(y); ... ... } } CPU/host GPU/device

Mixed-Mode CUDA Code foo.cu template <int N> void host(float *x) { float *y; template <int N> cudaMalloc(&y, 4*N); __global__ void kernel( cudaMemcpy(y, x, ...); float *y) { kernel<N><<<16, 128>>>(y); ... ... } } CPU/host GPU/device

gpucc Architecture (Current and Interim) Mixed mode input file template <int N> __global__ void kernel( float *y) { Host-device splitter ... } Host code Device code template <int N> Clang void host(float *x) { float *y; Device code IR optimizer cudaMalloc(&y, 4*N); generator cudaMemcpy(y, x, ...); kernel<N><<<16, 128>>>(y); NVPTX codegen PTX assembly ... } Host compilation Fat Binary

Clang Integration (WIP and Long-Term) mixed mode ● Major issues with the separate compilation input file Source-to-source translation is complex and fragile ○ ○ Long compilation time Device compilation Clang driver instead of physical code splitting ● ○ (by Artem Belevich) PTX assembly ○ $ clang foo.cu ... ○ $ clang -x cuda <file> ... Host compilation Fat Binary

CPU vs GPU Characteristics CPU GPU ● Designed for general purposes ● Designed for rendering Optimized for latency Optimized for throughput ● ● ● Heavyweight hardware threads ● Lightweight hardware threads Branch prediction ○ ○ Out-of-order execution Superscalar ○ ● Small number of cores per die Massive parallelism ● ○ Can trade latency for throughput

Major Optimizations in gpucc ● Straight-line scalar optimizations Inferring memory spaces ● Loop unrolling and function inlining ● ● Memory-space alias analysis ● Speculative execution Bypassing 64-bit divisions ●

Straight-Line Scalar Optimizations y n for (long x = 0; x < 3; ++x) { for (long y = 0; y < 3; ++y) { float *p = &a[(c+y) + (b+x) * n]; ... // load from p } } (b,c) a x

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; loop for (long x = 0; x < 3; ++x) { unroll p3 = &a[c + (b + 1) * n]; for (long y = 0; y < 3; ++y) { p4 = &a[c + 1 + (b + 1) * n]; float *p = &a[(c+y) + (b+x) * n]; p5 = &a[c + 2 + (b + 1) * n]; ... // load from p } p6 = &a[c + (b + 2) * n]; } p7 = &a[c + 1 + (b + 2) * n]; p8 = &a[c + 2 + (b + 2) * n];

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p5 = &a[c + 2 + (b + 1) * n]; p6 = &a[c + (b + 2) * n]; p7 = &a[c + 1 + (b + 2) * n]; c + 2 b + 2 (b + 2) * n c + 2 + (b + 2) * n p8 = &a[c + 2 + (b + 2) * n];

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p5 = &a[c + 2 + (b + 1) * n]; Addressing mode (base+imm) p6 = &a[c + (b + 2) * n]; p8 = &a[c + (b + 2) * n] + 2 Injured redundancy p7 = &a[c + 1 + (b + 2) * n]; (b + 1) * n + n ● Pointer arithmetic reassociation c + 2 ● Straight-line strength reduction b + 2 ● Global reassociation (b + 2) * n c + 2 + (b + 2) * n p8 = &a[c + 2 + (b + 2) * n];

Pointer Arithmetic Reassociation p0 = &a[c + b * n]; p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p1 = &p0[1]; p2 = &a[c + 2 + b * n]; p2 = &p0[2]; p3 = &a[c + (b + 1) * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p4 = &p3[1]; p5 = &a[c + 2 + (b + 1) * n]; p5 = &p3[2]; p6 = &a[c + (b + 2) * n]; p6 = &a[c + (b + 2) * n]; p7 = &a[c + 1 + (b + 2) * n]; p7 = &p6[1]; p8 = &a[c + 2 + (b + 2) * n]; p8 = &p6[2];

Straight-Line Strength Reduction x = (base+C0)*stride x = (base+C0)*stride y = (base+C1)*stride y = x + (C1-C0)*stride

Straight-Line Strength Reduction x = (base+C0)*stride x = (base+C0)*stride y = (base+C1)*stride y = x + (C1-C0)*stride x0 = b * n; x0 = b * n; p0 = &a[c + x0]; p0 = &a[c + x0]; p1 = &p0[1]; p1 = &p0[1]; p2 = &p0[2]; p2 = &p0[2]; x1 = (b + 1) * n; x1 = x0 + n; p3 = &a[c + x1]; p3 = &a[c + x1]; p4 = &p3[1]; p4 = &p3[1]; p5 = &p3[2]; p5 = &p3[2]; x2 = (b + 2) * n; x2 = x1 + n; p6 = &a[c + x2]; p6 = &a[c + x2]; p7 = &p6[1]; p7 = &p6[1]; p8 = &p6[2]; p8 = &p6[2];

Global Reassociation x0 = b * n; p0 = &a[c + x0]; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; p3 = &a[c + x1]; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; // = c+(x0+n) = (c+x0)+n p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p0 = &a[i0]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p3 = &p0[n]; p4 = &p3[1]; p3 = &a[i1]; p3 = &a[i1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p0 = &a[c + x0]; p1 = &p0[1]; p0 = &a[i0]; p1 = &p0[1]; p2 = &p0[2]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p3 = &p0[n]; p3 = &p0[n]; p4 = &p3[1]; p3 = &a[i1]; p3 = &a[i1]; p4 = &p3[1]; p5 = &p3[2]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p6 = &p3[n]; p7 = &p6[1]; p7 = &p6[1]; p8 = &p6[2]; p8 = &p6[2];

Summary of Straight-Line Scalar Optimizations x0 = b * n; p0 = &a[c + b * n]; p0 = &a[c + x0]; p1 = &a[c + 1 + b * n]; p1 = &p0[1]; p2 = &a[c + 2 + b * n]; p2 = &p0[2]; p3 = &a[c + (b + 1) * n]; p3 = &p0[n]; p4 = &a[c + 1 + (b + 1) * n]; p4 = &p3[1]; p5 = &a[c + 2 + (b + 1) * n]; p5 = &p3[2]; p6 = &a[c + (b + 2) * n]; p6 = &p3[n]; p7 = &a[c + 1 + (b + 2) * n]; p7 = &p6[1]; p8 = &a[c + 2 + (b + 2) * n]; p8 = &p6[2]; Design doc: https://goo.gl/4Rb9As

Optimizations ● Straight-line scalar optimizations Inferring memory spaces ● Loop unrolling and function inlining ● ● Memory-space alias analysis ● Speculative execution Bypassing 64-bit divisions ●

Inferring Memory Spaces Load/store PTX assembly instructions GPU Device ● Special Block (processor) Block (processor) ○ ld.shared/st.shared ○ ld.global/st.global Thread Thread Thread Thread ○ ... ● Shared memory Shared memory Generic ○ ld/st ○ Overhead in checking (e.g. ~10% Global memory slower than ld.shared ) ○ Alias analysis suffers

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) - PowerPoint PPT Presentation

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Artem Belevich, Robert Hundt One-Slide Overview Motivation Lack

gpucc: An Open-Source GPGPU Compiler Jingyue Wu , Artem Belevich, Eli Bendersky, Mark Heffernan,

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

ACCEPT: We Built an Open-Source Approximation Compiler Framework So You Don't Have To Adrian

Make Money With Open Source What is Open Source? Community Free software vs. open source

What is a Compiler? A compiler translates a source specification into a target specification.

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Intermediate Representation (IR) IR encodes all knowledge the compiler has derived about source

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Compiler CS 449 Executables and gcc Object Linking Preprocessed C source files source

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Reinventing How America Votes Through Open Source Solutions Deborah Bryant OSU Open Source Lab

Creating an Open Source Project Thiago Macieira Who am I? Open Source developer for 15 years

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

mITu Skillologies Open Source World http://mitu.co.in Existence of open source There is a

Building Clang/LLVM efficiently Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

What is open source? Computer sofuware where the source code is distributed under an open

What is open source ? Computer software where the source code is distributed under an open

What is open source? Computer software where the source code is distributed under an open

The State of Open Source Databases Peter Zaitsev CEO, Percona October 1 st , 2019 Open Source

A closer look at ARM code quality Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

Open Komodo: An Open Source IDE For Open Languages For Open Languages Own Your IDE Eric