GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tamás Budavári / The Johns Hopkins University 7/18/2012
How I got into this? Tamás Budavári 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method Dual-tree traversal ISSAC at HiPACC 7/18/2012
What if? 800 × 800 bins Tamás Budavári 3 ISSAC at HiPACC 7/18/2012
Extending SQL Server 4 Tamás Budavári Dedicated service for direct access Shared memory IPC w/ on-the-fly data transform IPC ISSAC at HiPACC 7/18/2012
User-Defined Functions 5 Tamás Budavári Pair counts computed on the GPU Returns 2D histogram as a table (i, j, cts) Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012
User-Defined Functions 6 Tamás Budavári Pair counts computed on the GPU Returns 2D histogram as a table (i, j, cts) Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012
Multiple GPUs in Parallel 7 Tamás Budavári Several C# proxies to launch jobs on more cards Non-blocking SQL routines IPC ISSAC at HiPACC 7/18/2012
Async SQL Interface 8 Tamás Budavári ISSAC at HiPACC 7/18/2012
Baryon Acoustic Oscillations Tamás Budavári 9 600 trillion galaxy pairs Tian, Neyrinck, TB & Szalay (2011) C for CUDA on GPUs BAO ISSAC at HiPACC 7/18/2012
Outline Tamás Budavári 10 Parallelism Hardware Programming Multithreading Coding for GPUs CUDA, Thrust, … ISSAC at HiPACC 7/18/2012
Parallelism Tamás Budavári 11 Data parallel Same processing on different pieces of data Task parallel Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012
On all levels of the hierarchy 12 Tamás Budavári Clouds Clusters Machines Cores Threads ISSAC at HiPACC 7/18/2012
Scalability 13 Tamás Budavári Scale up Scale out Vertically Horizontally Add resources to a node Use more of the Bigger memory, … Threads, cores, machines, clusters, Faster processor, … clouds, … ISSAC at HiPACC 7/18/2012
Cluster 14 ISSAC at HiPACC 7/18/2012
High-Performance Computing Tamás Budavári 15 Traditional HPC clusters Launching jobs on a cluster of machines Use MPI to communicate among nodes Message Passing Interface ISSAC at HiPACC 7/18/2012
Queuing Systems Tamás Budavári 16 Used for batch jobs on computer clusters Fair scheduling of user jobs Group policies Several systems Portable Batch System (PBS) Condor, etc… ISSAC at HiPACC 7/18/2012
Computer 17 ISSAC at HiPACC 7/18/2012
Classification of Parallel Computers Tamás Budavári 18 Flynn’s Taxonomy ISSAC at HiPACC 7/18/2012
SISD Tamás Budavári 19 Single Instruction Single Data Classical Von Neumann machines Single threaded codes arstechnica.com ISSAC at HiPACC 7/18/2012
SIMD Tamás Budavári 20 Single Instruction Multiple Data On x86 MMX: Math Matrix eXtension SSE: Streaming SIMD Extension arstechnica.com …and more… GPU programming!! ISSAC at HiPACC 7/18/2012
Amdahl’s Law of Parallelism Tamás Budavári 22 Speed up: T ( 1 ) S P P T ( N ) T ( 1 ) 1 S N p T ( N ) ( 1 p ) P p with N S P Before looking into parallelism, speed up the serial code, to figure out the max speedup, i.e., N ISSAC at HiPACC 7/18/2012
Chip 23 ISSAC at HiPACC 7/18/2012
Moore’s Law Tamás Budavári 24 ISSAC at HiPACC 7/18/2012
New Limitation is Energy! Tamás Budavári 25 Power to compute the same thing? CPU is 10× less efficient than a digital signal processor DSP is 10× less efficient than a custom chip New design: multicores with slower clocks But the interconnect is expensive Need simpler components ISSAC at HiPACC 7/18/2012
Emerging Architectures Tamás Budavári 26 Andrew Chien: 10×10 to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Statistics on a video codec module? ISSAC at HiPACC 7/18/2012
Emerging Architectures Tamás Budavári 27 Andrew Chien: 10×10 to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Scientific analysis on such specialized units? ISSAC at HiPACC 7/18/2012
GPUs Evolved to be General Purpose Tamás Budavári 28 Virtual world: simulation of real physics C for CUDA and OpenCL 512 cores 25k threads, running 1 billion/sec Old algorithms built on wrong assumption Today processing is free but memory is slow New programming paradigm! ISSAC at HiPACC 7/18/2012
New Moore’s Law 29 Tamás Budavári In the number of cores Faster than ever ISSAC at HiPACC 7/18/2012
Programming 30 ISSAC at HiPACC 7/18/2012
Programming Languages Tamás Budavári 31 No one language to rule them all And many to choose from ISSAC at HiPACC 7/18/2012
Assembly Tamás Budavári 32 Low-level (almost) machine code Different for each computer ISSAC at HiPACC 7/18/2012
The “C” Language Tamás Budavári 33 Higher level but still close to hardware, i.e., fast Pointers! Many things written in C Operating systems Other languages, … ISSAC at HiPACC 7/18/2012
Java Tamás Budavári 34 Pros Memory management with garbage collection Just-In- Time compilation from ‘ bytecode ’ Cons Not so great performance Hard to include legacy codes New language features were an afterthought ISSAC at HiPACC 7/18/2012
Python Tamás Budavári 35 Scripting to glue things together Easy to wrap legacy codes Lots of scientific modules and plotting Good for prototyping ISSAC at HiPACC 7/18/2012
Etc… 36 Tamás Budavári Perl Lisp Matlab Haskell Mathematica Ocaml IDL Erlang R Your favorite here… ISSAC at HiPACC 7/18/2012
Programming in C Tamás Budavári 37 Skeleton of an application ISSAC at HiPACC 7/18/2012
Programming in C Tamás Budavári 38 Files Headers *.h Source *.c Building an application Compile source Link object files ISSAC at HiPACC 7/18/2012
Using Pointers Tamás Budavári 39 ISSAC at HiPACC 7/18/2012
Arrays Tamás Budavári 40 Dynamic arrays Memory allocation Freeing memory Pointer arithmetics ISSAC at HiPACC 7/18/2012
Matrix, etc… Tamás Budavári 41 Point to pointers Data allocated in v Pointers in A For 2D indexing One can have Matrix, tensor, … Jagged arrays, … ISSAC at HiPACC 7/18/2012
Concurrency 42 Parallel actions ISSAC at HiPACC 7/18/2012
Data Parallel Techniques Tamás Budavári 43 “Embarrassingly Parallel” Decoupled problems, independent processing MapReduce Map Reduce ISSAC at HiPACC 7/18/2012
The Elevator Problem Tamás Budavári 44 People on multiple levels Press the button… ISSAC at HiPACC 7/18/2012
Mutual Exclusion Tamás Budavári 45 Multiple processes or threads Access shared resources in critical sections E.g., call the elevator when it’s time to go Locking Elevators, etc… ISSAC at HiPACC 7/18/2012
Dining Philosophers Tamás Budavári 46 Five silent philosophers sit at the table Alternate between eating and thinking Need both forks left & right to eat Must be picked up one by one! Infinite food in front of them How can they all think & eat forever? ISSAC at HiPACC 7/18/2012
Parallel Threads 47 ISSAC at HiPACC 7/18/2012
Threading Tamás Budavári 48 Concurrent parallelism in a machine ISSAC at HiPACC 7/18/2012
Parallelism Tamás Budavári 49 Data parallel Same processing on different pieces of data Task parallel Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012
Comparing Chips 50 Tamás Budavári ISSAC at HiPACC 7/18/2012
Hybrid Architecture 51 Tamás Budavári launch launch run run sync ISSAC at HiPACC 7/18/2012
Programming GPGPUs 52 Tamás Budavári CUDA Low-level & high-level OpenCL DirectCompute DirectX, etc… C++ AMP New! Accelerated Massive Parallelism ISSAC at HiPACC 7/18/2012
CUDA 53 ISSAC at HiPACC 7/18/2012
Projects on CUDA Zone Tamás Budavári 54 ISSAC at HiPACC 7/18/2012
Currently Available Tamás Budavári 55 GPU optimized Sorting, RNG, BLAS, FFT, Hadamard... SDK w/examples Nsight debugger! Imaging routines Python w/ PyCUDA High-level C++ programming with ISSAC at HiPACC 7/18/2012
Fermi Tamás Budavári 56 Previous generation 20 series Tesla cards, e.g., C2050 400+ series GeForce cards, e.g., GTX 480 IEEE-754 arithmetic Standard floating point Same as in the CPUs ISSAC at HiPACC 7/18/2012
Kepler Tamás Budavári 57 Latest generation More efficient, more cores GTX 680 has 1536 cores ISSAC at HiPACC 7/18/2012
Recommend
More recommend