cub
play

CUB: A pattern of collective software design, abstraction, and reuse - PowerPoint PPT Presentation

CUB: A pattern of collective software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software


  1. CUB: A pattern of “collective” software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH

  2. What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software components for SIMT groups (warps, blocks, etc.) 2. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives (built from collectives) Device-reduce, device-sort, device-scan, etc. Demonstrate collective composition, performance, performance-portability 2

  3. Outline 1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB ’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts 3

  4. Software reuse Abstraction & composability are fundamental design principles Reduce redundant programmer effort Save time, energy, money Reduce buggy software Encapsulate complexity Empower productivity-oriented programmers Insulation from underlying hardware – five NVIDIA GPU architectures between 2008-2014 Software reuse empowers a durable programming model 4

  5. Software reuse Abstraction & composability are fundamental design principles Reduce redundant programmer effort Save time, energy, money Reduce buggy software Encapsulate complexity Empower productivity-oriented programmers Insulation from changing capabilities of the underlying hardware – NVIDIA has produced nine different CUDA GPU architectures since 2008! Software reuse empowers a durable programming model 5

  6. Outline 1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB ’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts 6

  7. Parallel programming is hard … 7

  8. No, cooperative parallel programming is hard… Parallel decomposition and grain sizing Bookkeeping control structures Synchronization Memory access conflicts, coalescing, etc. Deadlock, livelock, and data races Occupancy constraints from SMEM, RF, etc Plurality of state Algorithm selection and instruction scheduling Plurality of flow control (divergence, etc.) Special hardware functionality, instructions, etc. 8

  9. No, cooperative parallel programming is hard… Parallel decomposition and grain sizing Bookkeeping control structures Synchronization Memory access conflicts, coalescing, etc. Deadlock, livelock, and data races Occupancy constraints from SMEM, RF, etc Plurality of state Algorithm selection and instruction scheduling Plurality of flow control (divergence, etc.) Special hardware functionality, instructions, etc. … 9

  10. CUDA today Application CUDA function stub Threadblock Threadblock Threadblock … 10

  11. Software abstraction in CUDA Application CUDA function stub Kernel threadblock … PROBLEM: virtually every CUDA kernel written today is cobbled from scratch A tunability, portability, and maintenance concern 11

  12. Software abstraction in CUDA Application scalar interface Kernel function stub BlockLoad BlockLoad collective interface collective BlockSort BlockSort function … BlockStore BlockStore Collective software components reduce development cost, hide complexity, bugs, etc. 12

  13. What do these applications have in common? 2 1 1 1 2 2 2 0 1 1 2 ∞ 2 2 1 2 ∞ 3 2 ∞ 3 Parallel radix sort Parallel sparse graph traversal Parallel BWT compression Parallel SpMV 13

  14. What do these applications have in common? Block-wide prefix-scan 2 1 1 1 2 Scan for 2 2 0 Scan for 1 1 enqueueing 2 ∞ partitioning 2 2 1 2 ∞ 3 2 ∞ 3 Parallel radix sort Parallel sparse graph traversal Scan for Scan for solving segmented recurrences reduction (move-to-front) Parallel BWT compression Parallel SpMV 14

  15. Examples of parallel scan data flow 16 threads contributing 4 items each t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 id t 0 t 1 t 2 t 3 id t 8 t 9 id t 4 t 5 t 6 t 7 id 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 3 id id t 3 t 3 t 2 id id t 4 t 5 t 6 t 7 id id t 8 t 9 id id 0 1 2 3 4 5 t 1 t 1 t 1 t 1 t 1 t 1 id t 0 t 1 t 2 t 3 t 2 t 3 t 3 t 3 t 4 t 5 t 6 t 7 t 8 t 9 0 1 5 2 3 4 id id t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 4 t 5 t 6 t 7 t 8 t 9 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 8 t 9 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 5 2 3 4 t 0 t 1 t 2 t 3 t 8 t 9 t 10 t 11 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 8 t 9 t 10 t 11 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 Brent-Kung hybrid Kogge-Stone hybrid (Work-efficient ~130 binary ops, depth 15) (Depth-efficient ~170 binary ops, depth 12) 15

  16. CUDA today Kernel programming is complicating Application CUDA function stub threadblock threadblock threadblock … 16

  17. Software abstraction in CUDA Application scalar interface Kernel function stub BlockLoad BlockLoad collective interface collective BlockSort BlockSort function … BlockStore BlockStore Collective software components reduce development cost, hide complexity, bugs, etc. 17

  18. Outline 1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB ’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts 18

  19. Collective composition CUB primitives are easily nested & sequenced threadblock application CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockSort 19

  20. Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 20

  21. Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub BlockScan threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 21

  22. Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 22

  23. Tunable composition Flexible grain-size (“shape” remains the same) Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 23

  24. Tunable composition Flexible grain- size (“shape” remains the same) Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 24

  25. Tunable composition Algorithmic-variant selection Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 25

  26. Tunable composition Algorithmic-variant selection Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 26

  27. Tunable composition Algorithmic-variant selection Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 27

  28. Tunable composition Algorithmic-variant selection Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 28

  29. CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA arch families (Telsa, Fermi, Kepler, Maxwell) 1.40 21 CUB Thrust v1.7.1 CUB Thrust v1.7.1 billions of 32b keys / sec billions of 32b items / sec 1.05 14 0.71 0.66 0.50 0.51 8 6 6 4 Tesla Tesla Tesla Tesla Tesla Tesla C1060 C2050 K20C C1060 C2050 K20C Global radix sort Global prefix scan 16.4 CUB NPP CUB Thrust v1.7.1 19.3 billions of 8b items / sec billions of 32b inputs / sec 16.2 8.6 4.2 2.7 2 2.4 2 2.2 1.7 0 Tesla Tesla Tesla Tesla Tesla Tesla C1060 C2050 K20C C1060 C2050 K20C Global Histogram Global partition-if 29

Recommend


More recommend