p rogramming in cuda
play

P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University 7/18/2012 How I got into this? Tams Budavri 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method


  1. GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tamás Budavári / The Johns Hopkins University 7/18/2012

  2. How I got into this? Tamás Budavári 2  Galaxy correlation function 8 bins  Histogram of distances  State-of-the-art method  Dual-tree traversal ISSAC at HiPACC 7/18/2012

  3. What if? 800 × 800 bins Tamás Budavári 3 ISSAC at HiPACC 7/18/2012

  4. Extending SQL Server 4 Tamás Budavári  Dedicated service for direct access  Shared memory IPC w/ on-the-fly data transform IPC ISSAC at HiPACC 7/18/2012

  5. User-Defined Functions 5 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

  6. User-Defined Functions 6 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

  7. Multiple GPUs in Parallel 7 Tamás Budavári  Several C# proxies to launch jobs on more cards  Non-blocking SQL routines IPC ISSAC at HiPACC 7/18/2012

  8. Async SQL Interface 8 Tamás Budavári ISSAC at HiPACC 7/18/2012

  9. Baryon Acoustic Oscillations Tamás Budavári 9  600 trillion galaxy pairs Tian, Neyrinck, TB & Szalay (2011)  C for CUDA on GPUs BAO ISSAC at HiPACC 7/18/2012

  10. Outline Tamás Budavári 10  Parallelism  Hardware  Programming  Multithreading  Coding for GPUs  CUDA, Thrust, … ISSAC at HiPACC 7/18/2012

  11. Parallelism Tamás Budavári 11  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

  12. On all levels of the hierarchy 12 Tamás Budavári  Clouds  Clusters  Machines  Cores  Threads ISSAC at HiPACC 7/18/2012

  13. Scalability 13 Tamás Budavári  Scale up  Scale out  Vertically  Horizontally  Add resources to a node  Use more of the  Bigger memory, …  Threads, cores, machines, clusters,  Faster processor, … clouds, … ISSAC at HiPACC 7/18/2012

  14. Cluster 14 ISSAC at HiPACC 7/18/2012

  15. High-Performance Computing Tamás Budavári 15  Traditional HPC clusters  Launching jobs on a cluster of machines  Use MPI to communicate among nodes  Message Passing Interface ISSAC at HiPACC 7/18/2012

  16. Queuing Systems Tamás Budavári 16  Used for batch jobs on computer clusters  Fair scheduling of user jobs  Group policies  Several systems  Portable Batch System (PBS)  Condor, etc… ISSAC at HiPACC 7/18/2012

  17. Computer 17 ISSAC at HiPACC 7/18/2012

  18. Classification of Parallel Computers Tamás Budavári 18  Flynn’s Taxonomy ISSAC at HiPACC 7/18/2012

  19. SISD Tamás Budavári 19  Single Instruction Single Data  Classical Von Neumann machines  Single threaded codes arstechnica.com ISSAC at HiPACC 7/18/2012

  20. SIMD Tamás Budavári 20  Single Instruction Multiple Data  On x86  MMX: Math Matrix eXtension  SSE: Streaming SIMD Extension arstechnica.com  …and more…  GPU programming!! ISSAC at HiPACC 7/18/2012

  21. Amdahl’s Law of Parallelism Tamás Budavári 22   Speed up: T ( 1 ) S P  P T ( N )  T ( 1 ) 1 S  N p T ( N )   ( 1 p ) P  p with N  S P  Before looking into parallelism, speed up the serial code, to figure out the max speedup, i.e.,   N ISSAC at HiPACC 7/18/2012

  22. Chip 23 ISSAC at HiPACC 7/18/2012

  23. Moore’s Law Tamás Budavári 24 ISSAC at HiPACC 7/18/2012

  24. New Limitation is Energy! Tamás Budavári 25  Power to compute the same thing?  CPU is 10× less efficient than a digital signal processor  DSP is 10× less efficient than a custom chip  New design: multicores with slower clocks  But the interconnect is expensive  Need simpler components ISSAC at HiPACC 7/18/2012

  25. Emerging Architectures Tamás Budavári 26  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Statistics on a video codec module? ISSAC at HiPACC 7/18/2012

  26. Emerging Architectures Tamás Budavári 27  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Scientific analysis on such specialized units? ISSAC at HiPACC 7/18/2012

  27. GPUs Evolved to be General Purpose Tamás Budavári 28  Virtual world: simulation of real physics  C for CUDA and OpenCL  512 cores  25k threads, running 1 billion/sec  Old algorithms built on wrong assumption  Today processing is free but memory is slow New programming paradigm! ISSAC at HiPACC 7/18/2012

  28. New Moore’s Law 29 Tamás Budavári  In the number of cores  Faster than ever ISSAC at HiPACC 7/18/2012

  29. Programming 30 ISSAC at HiPACC 7/18/2012

  30. Programming Languages Tamás Budavári 31  No one language to rule them all  And many to choose from ISSAC at HiPACC 7/18/2012

  31. Assembly Tamás Budavári 32  Low-level (almost) machine code  Different for each computer ISSAC at HiPACC 7/18/2012

  32. The “C” Language Tamás Budavári 33  Higher level but still close to hardware, i.e., fast  Pointers!  Many things written in C  Operating systems  Other languages, … ISSAC at HiPACC 7/18/2012

  33. Java Tamás Budavári 34  Pros  Memory management with garbage collection  Just-In- Time compilation from ‘ bytecode ’  Cons  Not so great performance  Hard to include legacy codes  New language features were an afterthought ISSAC at HiPACC 7/18/2012

  34. Python Tamás Budavári 35  Scripting to glue things together  Easy to wrap legacy codes  Lots of scientific modules and plotting  Good for prototyping ISSAC at HiPACC 7/18/2012

  35. Etc… 36 Tamás Budavári  Perl  Lisp  Matlab  Haskell  Mathematica  Ocaml  IDL  Erlang  R  Your favorite here… ISSAC at HiPACC 7/18/2012

  36. Programming in C Tamás Budavári 37  Skeleton of an application ISSAC at HiPACC 7/18/2012

  37. Programming in C Tamás Budavári 38  Files  Headers *.h  Source *.c  Building an application  Compile source  Link object files ISSAC at HiPACC 7/18/2012

  38. Using Pointers Tamás Budavári 39 ISSAC at HiPACC 7/18/2012

  39. Arrays Tamás Budavári 40  Dynamic arrays  Memory allocation  Freeing memory  Pointer arithmetics ISSAC at HiPACC 7/18/2012

  40. Matrix, etc… Tamás Budavári 41  Point to pointers  Data allocated in v  Pointers in A  For 2D indexing  One can have  Matrix, tensor, …  Jagged arrays, … ISSAC at HiPACC 7/18/2012

  41. Concurrency 42 Parallel actions ISSAC at HiPACC 7/18/2012

  42. Data Parallel Techniques Tamás Budavári 43  “Embarrassingly Parallel”  Decoupled problems, independent processing  MapReduce  Map  Reduce ISSAC at HiPACC 7/18/2012

  43. The Elevator Problem Tamás Budavári 44  People on multiple levels  Press the button… ISSAC at HiPACC 7/18/2012

  44. Mutual Exclusion Tamás Budavári 45  Multiple processes or threads  Access shared resources in critical sections  E.g., call the elevator when it’s time to go  Locking  Elevators, etc… ISSAC at HiPACC 7/18/2012

  45. Dining Philosophers Tamás Budavári 46  Five silent philosophers sit at the table  Alternate between eating and thinking  Need both forks left & right to eat  Must be picked up one by one!  Infinite food in front of them  How can they all think & eat forever? ISSAC at HiPACC 7/18/2012

  46. Parallel Threads 47 ISSAC at HiPACC 7/18/2012

  47. Threading Tamás Budavári 48  Concurrent parallelism in a machine ISSAC at HiPACC 7/18/2012

  48. Parallelism Tamás Budavári 49  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

  49. Comparing Chips 50 Tamás Budavári ISSAC at HiPACC 7/18/2012

  50. Hybrid Architecture 51 Tamás Budavári launch launch run run sync ISSAC at HiPACC 7/18/2012

  51. Programming GPGPUs 52 Tamás Budavári  CUDA  Low-level & high-level  OpenCL  DirectCompute  DirectX, etc…  C++ AMP New!  Accelerated Massive Parallelism ISSAC at HiPACC 7/18/2012

  52. CUDA 53 ISSAC at HiPACC 7/18/2012

  53. Projects on CUDA Zone Tamás Budavári 54 ISSAC at HiPACC 7/18/2012

  54. Currently Available Tamás Budavári 55  GPU optimized Sorting, RNG, BLAS, FFT, Hadamard...  SDK w/examples  Nsight debugger!  Imaging routines  Python w/ PyCUDA  High-level C++ programming with ISSAC at HiPACC 7/18/2012

  55. Fermi Tamás Budavári 56  Previous generation  20 series Tesla cards, e.g., C2050  400+ series GeForce cards, e.g., GTX 480  IEEE-754 arithmetic  Standard floating point  Same as in the CPUs ISSAC at HiPACC 7/18/2012

  56. Kepler Tamás Budavári 57  Latest generation  More efficient, more cores  GTX 680 has 1536 cores ISSAC at HiPACC 7/18/2012

Recommend


More recommend