Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor - PowerPoint PPT Presentation

Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 http://lawlor.cs.uaf.edu/ 1 8

Obligatory Introductory Quote “He who controls the past, controls the future.” George Orwell, 1984 2

In Parallel Programming... “He who controls the writes, controls performance.” “He who controls the past, controls the future.” George Orwell, 1984 Orion Lawlor, 2011 3

Talk Outline  Existing parallel programming models  Who controls the writes?  Charm++ and Charm--  Charm-style GPGPU  Conclusions 4

Existing Model: Superscalar  Hardware parallelization of a sequential programming model  Fetch future instructions  Need good branch prediction  Runtime Dependency Analysis  Load/store buffer for mem-carried  Rename away false dependencies  RAR, WAR, WAW, -> RAW <-  Now “solved”: low future gain 5

Spacetime Data Arrows “space” (memory, node) Read Write “time” (program order) 6

Read After Write Dependency Read Write Artificial Instruction Read Boundary Write 7

Read After Read: No Problem! Read Write Artificial Instruction Read Boundary Write 8

Existing Model: Shared Memory  OpenMP, threads, shmem  “Just” let different processors access each others' memory  HW: Cache coherence • false sharing, cache thrashing  SW: Synchronization • locks, semaphores, fences, ...  Correctness is a huge issue  Weird race conditions abound  New bugs in 10+ year old code 9

Gather: Works Fine Distributed Reads Centralized Writes 10

Scatter: Tough to Synchronize! Oops! 11

Existing Model: Message Passing  MPI, sockets  Explicit control of parallel reads (send) and writes (recv)  Far fewer race conditions  Programmability is an issue  Raw byte-based interface (C style)  High per-message cost (alpha)  Synchronization issues: when does MPI_Send block? 12

Existing Model: SIMD  SSE, AVX, and GPU  Single Instruction, Multiple Data  Far fewer fetches & decodes  Far higher arithmetic intensity  CPU: Programmability N/A  Assembly language (hello, 1984!)  mmintrin.h wrappers: _mm_add_ps  Or pray for automagic compiler!  GPU: Programmability OK  Graphics side: GLSL, HLSL, Cg  GPGPU: CUDA, OpenCL, DX CS 13

NVIDIA CUDA  CPU calls a GPU “kernel” with a “block” of threads • Now fully programmable (throw/catch, virtual methods, recursion, etc)  Read and write memory anywhere • Zero protection against multithreaded race conditions  Manual control over a small __shared__ memory region  Only runs on NVIDIA hardware (OpenCL is portable... sorta) 14

OpenGL: GL Shading Language  Mostly programmable (loops, etc)  Can read anywhere in “textures”, only write to “framebuffer” (2D/3D arrays) • Reads go through “texture cache”, so performance is good (iff locality) • Writes are on space-filling curve • Writes are controlled by the graphics driver • So cannot have synchronization bugs!  Rich selection of texture filtering (array interpolation) modes • Includes mipmaps, for multigrid  GLSL can run OK on every modern GPU 15 (well, except Intel...)

GLSL vs CUDA GLSL Programs 16

GLSL vs CUDA Arbitrary writes CUDA Programs GLSL Programs Mipmaps; texture writes 17

GLSL vs CUDA Correct CUDA Programs Programs GLSL Programs 18

GLSL vs CUDA Correct CUDA Programs Programs GLSL High Programs Performance Programs 19

GPU/CPU Convergence  GPU, per socket:  SIMD: 16-32 way  SMT: 2-50 way (register limited)  SMP: 4-36 way  CPUs will get there, soon!  SIMD: 8 way AVX (64-way SWAR)  SMT: 2 way Intel; 4 way IBM  SMP: 6-8 way/socket already • Intel has shown 48 way chips  Biggest difference: CPU has branch prediction & superscalar! 20

CUDA: Memory Output Bandwidth t = 4000ns / kernel + bytes * 0.0125 ns / byte s / B G 0 8 : h t d i w d n a b t u p t u o l e n r e K Kernel startup latency: 4us 21 NVIDIA GeForce GTX 280, fixed 128 threads per block

Charm++ and “Charm--”

Existing Model: Charm++  Chares send each other messages  Runtime system does delivery  Scheduling!  Migration with efficient forwarding  Cheap broadcasts  Runtime system schedules Chares  Overlap comm and compute  Programmability still an issue  Per-message overhead, even with message combining library  Collect up your messages (SDAG?)  Cheap SMP reads? SIMD? GPU? 23

One Charm++ Method Invocation Chare Messages Read internal Receive one message state (but in what order?) Entry Method Update internal state Send messages Between send and receive: migration, 24 checkpointing, ...

The Future: SIMD  AVX, SSE, AltiVec, GPU, etc  Thought experiment  Imagine a block of 8 chares living in one SIMD register • Deliver 8 messages at once (!)  Or imagine 100K chares living in GPU RAM  Locality (mapping) is important!  Branch divergence penalty  Struct-of-Arrays member storage • xxxxxxxx yyyyyyyy zzzzzzzz 25 • Members of 8 separate chares!

Vision: Charm-- Stencil array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; 26

Vision: Charm-- Explained array [2D] stencil { public: Assembled into GPU arrays or SSE vectors float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; 27

Vision: Charm-- Explained array [2D] stencil { public: Broadcast out to blocks of array elements float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; 28

Vision: Charm-- Explained array [2D] stencil { public: Hides local synchronized reads, network, and domain boundaries float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; 29

Vision: Charm-- Springs array [1D] sim_spring { public: float restlength; [entry] void netforce( sim_vertex ends[2]=fetch_ends()) { vec3 along=ends[1].pos-ends[0].pos; float f=-k*(length(along)-restlength); vec3 F=f*normalize(along); ends[0].netforce+=F; ends[1].netforce-=F; } }; 30

One Charm-- Method Invocation Chare “Mainchare” Read (on GPU) (on CPU) internal Fetch together states multiple messages Update internal states Send off network messages 31

Noncontiguous Communication  Run scatter kernel Network Data Buffer  Or fold into fetch GPU Target Buffer 32

Key Charm-- Design Features  Multiple chares receive message at once  Runtime block-allocates incoming and outgoing message storage  Critical for SIMD, GPU, SMP  Receive multiple messages in one entry point  Minimize roundtrip to GPU  Explicit support for timesteps  E.g., double-buffer message storage 33

Charm-- Not Shown  Lots of work in “mainchare”  Controls decomposition & comms  Set up “fetch”  Still lots of network work  Gather & send off messages  Distribute incoming messages  Division of labor?  Application scientist writes Chare  Computer scientist writes Mainchare 34

Related Work  Charm++ Accelerator API [Wesolowski]  Pipeline CUDA copy, queue kernels  Good backend for Charm--  Intel ArBB: SIMD from kernel  Based on RapidMind  But GPU support?  My “GPGPU” library  Based on GLSL 35

The Future

The Future: Memory Bandwidth  Today: 1TF/s, but only 0.1TB/s  Don't communicate, recompute  multistep stencil methods  “fetch” gets even more complex!  64-bit -> 32-bit -> 16-bit -> 8?  Spend flops scaling the data  Split solution + residual storage • Most flops use fewer bits, in residual  Fight roundoff with stochastic rounding • Add noise to improve precision 37

Conclusions  C++ is dead. Long live C++!  CPU and GPU on collision course  SIMD+SMT+SMP+network  Software is the bottleneck  Exciting time to build software!  Charm-- model  Support ultra-low grainsize chares • Combine into SIMD blocks at runtime  Simplify programmer's life  Add flexibility for runtime system  BUT must scale to real applications! 38

Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor - PowerPoint PPT Presentation

Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 http://lawlor.cs.uaf.edu/ 1 8 Obligatory Introductory Quote He who controls the past, controls the future. George

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Towards Usable Privacy in Cross-System Personalization Yang Wang CMU Usable Privacy and Security

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Visual Design Matters no alternative to design desirable usable useful Original photo by

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

A New Two- -Scale Mix Model: Towards Scale Mix Model: Towards a Multi a Multi- - A New Two A

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Usable security for control systems Jun Ho Huh Honeywell ACS Labs, research scientist

Blind and Human: Exploring More Usable Audio CAPTCHA Designs Valerie Fanelle, Sepideh Karimi,

Course Introduction Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Rethinking)the) landscape0 Carys Swanwick Landscape Forward Symposium Newcastle 18 th March

Augmented reality and perception of analogue and digital images and maps Piotr A. Werner Spatial

Tutorial: Constraint Programming for the Vehicle Routing Problem (slides) Conference Paper

Big Data and Classification Paul Balas Content Architect 303Computing A Dystopian Future

Embrace The Past How Software Evolution Lets You Understand Large Codebases @AdamTornhill

The Challenge of Big Data for the Social Sciences Kenneth Cukier Professor Kenneth Benoit

CSE 505: Programming Languages CSE 505: Programming Languages Instructor: Craig Chambers But