comment on bitonic merging more cuda performance tuning
play

Comment on bitonic merging; more CUDA performance tuning CSE 6230: - PowerPoint PPT Presentation

Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12 Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al.


  1. Comment on bitonic merging; more CUDA performance tuning CSE 6230: HPC Tools & Apps Tu Sep 18, 2012 Tuesday, September 18, 12

  2. ๏ Comment on bitonic merging , including ideas & hints for Lab 3 Note: Some figures taken from Grama et al. book (2003) http://www-users.cs.umn.edu/~karypis/parbook/ This book is also available online through the GT library – see our course website. Tuesday, September 18, 12

  3. Source: Grama et al. (2003) Tuesday, September 18, 12

  4. Summary so far: bitonicMerge (bitonic sequence) == sorted Q: How do we get a bitonic sequence? Tuesday, September 18, 12

  5. Source: Grama et al. (2003) Tuesday, September 18, 12

  6. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  7. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  8. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  9. “ ⊕ ” = (min, max) “ ⊖ ” = (max, min) Source: Grama et al. (2003) Tuesday, September 18, 12

  10. Source: Grama et al. (2003) Tuesday, September 18, 12

  11. Source: Grama et al. (2003) Tuesday, September 18, 12

  12. Bitonic sort parallel complexity (work-depth)? Tuesday, September 18, 12

  13. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 Tuesday, September 18, 12

  14. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  15. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = O (log n ) 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P ) 7: 0111 words sent per exchange = O ( n / P ) 8: 1000 9: 1001 total words sent = O ( n log n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  16. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log p steps: Comm req’d log (n/p) steps: No comm Block Layout (p=4) Tuesday, September 18, 12

  17. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 7: 0111 8: 1000 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (n/p): No comm log (p): Comm req’d Cyclic Layout (p=4) Tuesday, September 18, 12

  18. These (block or cyclic) examples are binary exchange algorithms. Question: Can we get the “best” of these two schemes? Tuesday, September 18, 12

  19. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  20. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 rounds of communication = 1 5: 0101 6: 0110 number of pairwise exchanges per round = O ( P 2 ) … All-to-all 7: 0111 exchange words sent per exchange = O ( n / P 2 ) 8: 1000 … 9: 1001 total words sent = O ( n ) 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  21. 0: 0000 1: 0001 2: 0010 3: 0011 4: 0100 5: 0101 6: 0110 … All-to-all 7: 0111 exchange 8: 1000 … 9: 1001 10: 1010 11: 1011 12: 1100 13: 1101 14: 1110 15: 1111 log (p): No comm log (n/p): No comm “Transpose” (p=4) Tuesday, September 18, 12

  22. Cyclic Block 0 1 2 3 0 4 8 12 All-to-all 4 5 6 7 1 5 9 13 exchange ≡ 8 9 10 11 2 6 10 14 Matrix transpose 12 13 14 15 3 7 11 15 Tuesday, September 18, 12

  23. “Binary exchange” algorithm (block or cyclic): rounds of communication = O (log n ) number of pairwise exchanges per round = O ( P ) total number of pairwise exchanges = O ( P log n ) words sent per exchange = O ( n / P ) total words sent = O ( n log n ) “Transpose” algorithm (cyclic → all-to-all → block): rounds of communication = 1 number of pairwise exchanges per round = O ( P 2 ) total number of pairwise exchanges = O ( P 2 ) words sent per exchange = O ( n / P 2 ) total words sent = O ( n ) Tuesday, September 18, 12

  24. ๏ More CUDA tuning: Occupancy and ILP References: http://developer.nvidia.com/cuda/get-started-cuda-cc http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf Tuesday, September 18, 12

  25. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  26. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  27. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  28. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  29. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  30. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  31. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  32. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  33. /opt/cuda-4.0/cuda/bin/nvcc -arch=sm_20 --ptxas-options=-v -O3 \ -o bitmerge-cuda.o -c bitmerge-cuda.cu ptxas info : Compiling entry function '_Z12bitonicSplitjPfj' for 'sm_20' ptxas info : Function properties for _Z12bitonicSplitjPfj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers , 52 bytes cmem[0] icpc -O3 -g -o bitmerge timer.o bitmerge.o bitmerge-seq.o \ bitmerge-cilk.o bitmerge-cuda.o \ -L/opt/cuda-4.0/cuda/bin/../lib64 \ -Wl,-rpath /opt/cuda-4.0/cuda/bin/../lib64 -lcudart Tuesday, September 18, 12

  34. Occupancy Limiters: Registers Register usage: compile with --ptxas-options=-v Fermi has 32K registers per SM Example 1 Kernel uses 20 registers per thread (+1 implicit) Active threads = 32K/21 = 1560 threads > 1536 thus an occupancy of 1 Example 2 Kernel uses 63 registers per thread (+1 implicit) Active threads = 32K/64 = 512 threads 512/1536 = .3333 occupancy Can control register usage with the nvcc flag: --maxrregcount Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  35. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  36. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  37. https://piazza.com/class#fall2012/cse6230/52 Tuesday, September 18, 12

  38. Recall: Reduction example Tuesday, September 18, 12

  39. Recall: Reduction example Tuesday, September 18, 12

  40. Recall: Reduction example Tuesday, September 18, 12

  41. Recall: Reduction example b = 256 threads/block ⇒ shmem = 256 * (4 Bytes/ int ) = 1024 Bytes Tuesday, September 18, 12

  42. Occupancy Limiters: Shared Memory Shared memory usage: compile with --ptxas-options=-v Reports shared memory per block Fermi has either 16K or 48K shared memory Example 1, 48K shared memory Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536 threads occupancy=1 Example 2, 16K shared memory Kernel uses 32 bytes of shared memory per thread 16K/32 = 512 threads occupancy=.3333 Don’t use too much shared memory Choose L1/Shared config appropriately. Occupancy = (Active warps) / (Max active warps) Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

  43. Occupancy Occupancy = Active Warps / Maximum Active Warps Remember: resources are allocated for the entire block Resources are finite Utilizing too many resources per thread may limit the occupancy Potential occupancy limiters: Register usage Shared memory usage Block size Jinx’s Fermi GPUs: 48 max active warps/SM, 32 threads/warp Tuesday, September 18, 12

Recommend


More recommend