autotuning opencl workgroup size for stencil patterns
play

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris - PowerPoint PPT Presentation

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc Stencils & Workgroup size Stencils & Workgroup size input stencil output element border region input stencil output 10^6 elements


  1. Autotuning OpenCL Workgroup Size for Stencil Patterns

  2. Chris Cummins http://chriscummins.cc

  3. Stencils & Workgroup size

  4. Stencils & Workgroup size

  5. input stencil output

  6. element border region input stencil output

  7. 10^6 elements 10^6 border regions input stencil output

  8. 10^6 elements 10^6 border regions input stencil output Multiple independent computations

  9. 10^6 elements 10^6 border regions input stencil output Multiple (overlapping) memory accesses

  10. element border region input stencil output

  11. element border region kernel input stencil output

  12. element work-item border region kernel input stencil output

  13. Border region Work-item wr Workgroup Tile Matrix wc

  14. Stencils & Workgroup size

  15. Stencils & Workgroup size

  16. Border region Work-item wr Workgroup Tile Matrix wc

  17. Workgroup size affects mapping to SIMD hardware. device occupancy. local memory utilisation.

  18. Pop Quiz!

  19. What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on: 1. AMD HD7990? 2. Nvidia GTX Titan? 3. Intel i7-3820?

  20. What is the best workgroup size for … Gaussian blur, 512px x 512px, floats, on: 64 x 4 1. AMD HD7990? 96 x 4 2. Nvidia GTX Titan? 40 x 24 3. Intel i7-3820?

  21. What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running: 1. Sobel edge detection? 2. Heat equation? 3. Game of life?

  22. What is the best workgroup size for … Nvidia GTX 590, 4096 x 4096 elements running: 256 x 2 1. Sobel edge detection? 128 x 2 2. Heat equation? 32 x 6 3. Game of life?

  23. What is the best workgroup size for … 1. Intel i5-2430, game of life, 4096 x 4096? 2. Nvidia GTX 690, threshold, 512 x 512? 3. Intel i7-3820, NMS, 512 x 512?

  24. What is the best workgroup size for … 1. Intel i5-2430, game of life, 196 x 20 4096 x 4096? 2. Nvidia GTX 690, threshold, 32 x 4 512 x 512? 3. Intel i7-3820, NMS, 512 x 512? 88 x 8

  25. One size does not fit all!

  26. Choosing workgroup size depends on: 1. Device 2. Program 3. Dataset

  27. performance Optimisation space rows cols

  28. Same stencil! Different device!

  29. Same device! Different stencil!

  30. Workgroup Size + Stencils 1. Non-linear, non-continuous 2. Device, program, dataset 3. Not all values are legal

  31. Autotuning

  32. Set a workgroup size Execute and time program

  33. Set a workgroup size Execute and time program Set a workgroup size Execute and time program

  34. Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program

  35. Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program

  36. Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

  37. Set a workgroup size Execute and time program Set a workgroup size (iterative Execute and time program Set a workgroup size compilation) Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

  38. BAD!

  39. e m i t g n o o o o l a s e k a T BAD!

  40. e m i t g n o o o o l a s e k a T BAD! M u s t b e r e p e a t e d f o r e v e r y n e w “ x ” device dataset program

  41. Let’s improve

  42. Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

  43. Set a workgroup size Execute and time program Set a workgroup size Execute and time program Set a workgroup size Execute and time program 1 data point Set a workgroup size Execute and time program … (continue until done / bored) Pick the best one you tried

  44. Collect data points Extract “features” Train machine learning classifier Extract “features” Input to classifier

  45. GOOD!

  46. ” x “ n e e s n u n o s n o i t c i d e r p e k a m n a C device dataset program GOOD!

  47. ” x “ n e e s n u n o s n o i t c i d e r p e k a m n a C device dataset program GOOD! Many unanswered questions …

  48. Questions: 1. What features do we need? 2. What programs do we train on? 3. How do we make predictions?

  49. Questions: 1. What features do we need? 2. What programs do we train on? 3. How do we make predictions?

  50. 1. Device 2. Kernel 3. Dataset

  51. 1. Device 2. Kernel 3. Dataset

  52. or How many compute units? How much memory? Cache size? etc.

  53. 1. Device 2. Kernel 3. Dataset

  54. 1. Device 2. Kernel 3. Dataset

  55. 1. Device 2. Kernel 3. Dataset

  56. xi-2,j+2 xi+2,j+2 Sn xi,j Ss How big is border region? xi-2,j-2 xi+2,j-2 Sw Se What shape is it? How many instructions? What type of instructions? etc.

  57. 1. Device 2. Kernel 3. Dataset

  58. 1. Device 2. Kernel 3. Dataset

  59. 1. Device 2. Kernel 3. Dataset

  60. How big is the data? What type is the input? What type is the output?

  61. 1. Device 2. Kernel 3. Dataset

  62. 1. Device 2. Kernel 3. Dataset

  63. Questions: 1. What features do we need? 2. What programs do we train on? 3. How do we make predictions?

  64. Questions: 1. What features do we need? ✓ 2. What programs do we train on? 3. How do we make predictions?

  65. 1. Learn by example 2. Learn by exploration

  66. Use benchmark programs Hope that they are representative 1. Learn by example 2. Learn by exploration

  67. 1. Learn by example 2. Learn by exploration

  68. 1. Learn by example 2. Learn by exploration Create own benchmarks Explore (the huge!) program space

  69. Questions: 1. What features do we need? ✓ 2. What programs do we train on? 3. How do we make predictions?

  70. Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions?

  71. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  72. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  73. 32 x 4 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

  74. 32 x 4 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

  75. 32 x 4 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

  76. 32 x 4 128 x 2 48 x 12 ! t c e r r o c n i Predict category (optimal workgroup size) for scenario

  77. 32 x 4 ! d i l a v n i 128 x 2 48 x 12 Predict category (optimal workgroup size) for scenario

  78. Fallback Handlers 1. Baseline 2. Random 3. Nearest Neighbour

  79. Fallback Handlers “pick something we 1. Baseline know is safe” 2. Random 3. Nearest Neighbour

  80. Fallback Handlers 1. Baseline “pick a random 2. Random value” 3. Nearest Neighbour

  81. Fallback Handlers 1. Baseline 2. Random 3. Nearest Neighbour “pick the closest value we think will work”

  82. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  83. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  84. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  85. Predict runtime of program for workgroup size Search for lowest runtime

  86. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  87. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  88. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  89. Predict speedup of workgroup size A over B for program Search for highest speedup

  90. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  91. 1. Classifier 2. Runtime Regressor 3. Speedup Regressor

  92. Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions?

  93. Questions: 1. What features do we need? ✓ 2. What programs do we train on? ✓ 3. How do we make predictions? ✓

  94. Experiment

  95. Implementation Modified SkelCL stencil pattern Python server process for autotuning 5 classifiers, random forest regressor

  96. Experimental Setup 6 stencil benchmarks + synthetic. 7 different GPUs & CPUs. 4 dataset sizes. Exhaustive search of workgroup size space for each

  97. Results

Recommend


More recommend