a baniasadi
play

A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - PowerPoint PPT Presentation

A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina The Goal Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs. Produces the optimum


  1. A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina

  2. The Goal  Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs.  Produces the optimum configuration in 84% of the cases.  Produces the second optimum configuration for the rest of the cases (less than 3.5% error).  Reduces the number of explorations by as much as 78%.

  3. Outline  GPU Architecture  Plackett and Burman Design Method  Knapsack Optimization Technique  Proposed Method  Results  Conclusion

  4. GPU Architecture User program Parallel section Serial section CPU GPU

  5. GPU Architecture Streaming Multiprocessor Warp Scheduler Warps SM SM SM SM SM Warp Pool SM SM SM SM SM Off Chip DRAM Memory Register File SM SM SM SM SM SIMD Pipeline Shared Mem SM SM SM SM SM L1 Cache Memory Hierarchy Const Cache Text Cache SM SM SM SM SM SM SM SM SM SM

  6. Design Parameters  Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

  7. Parameters Under Study  Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

  8. Proposed method  Suggests the best application-specific configuration for different available chip budgets.  Plackett & Burman design  measures the effect of each parameter on performance.  Knapsack problem  determines the configuration of parameters based on their effect on performance such that:  leads to the optimum performance  meets the budget

  9. Plackett & Burman Design (PB)  N parameters; each one takes L values  PB considers only the min and max values  X experiments ( X is the next multiple of 4 strictly greater than N )  PB with fold-over captures the effect of two interactive parameters (doubles the number of experiments)  4 parameters, 16 experiments

  10. PB Design Table exp A B C D Perf exp A B C D Perf + T 1 - T 9 1 + + - 9 - - + + 2 + + + 10 - - - - T 2 T 10 3 - - + + 11 + + - - T 3 T 11 4 - - + 12 + + - + T 4 - T 12 5 13 - + - - T 5 - + + T 13 + + T 6 T 14 6 - + - 14 - + - + 7 + + - + T 7 15 - - + - T 15 - 8 - - - 16 + + + + T 8 T 16 S A S B S C S D =

  11. Knapsack Problem  A constraint optimization problem v j = value of an item of type j ; determined by PB design (S x ) w j = weight of an item of type j ; transistor count of unit j b j = uper bound on the availability of items of type j C = capacity of the knapsack ; maximum number of transistors Select a number x j of items of each type so as to n n w x C z v j x maximize subject to j j j j 1 i 1 j N 1 ,.., n . 1 x b and integer , j i

  12. GPU configuration Parameter Value Number of shader 30 Shader clock frequency 1.3 GHZ Max thread per shader 1024 SIMD pipeline width 32 Warp width 32 Scheduling PDOM Max CTA/shader 8

  13. Transistor costs Cost parameter Size/number (million transistor) Memory controller 1 0.3 DL1 cache 32 KB 87 Constant cache 32 KB 52 Register file 32 KB 170

  14. Benchmarks and PB design values benchmark MCB DL1 cache Constant cache Register file AES 1-3 1 KB, 2KB 512 B- 8 KB 4 KB, 8 KB Montcarlo 1-5 32 KB, 64 KB 1 KB, 2 KB 8 KB, 16 KB LIB 3-5 32 KB- 128 KB 64 KB- 256 KB 2 KB,4 KB Ray 1-3 1 KB, 2 KB 1 KB- 32 KB 8 KB, 16 KB NN 1-4 8 KB- 32 KB 512 B, 1 KB 4 KB, 8 KB Scan 1-3 512 B- 2 KB N/A 2 KB- 8 KB Srad 1-6 2 KB- 256 KB N/A 4 KB- 16 KB Blachschole 1-4 2 KB- 8 KB N/A 4 KB, 8 KB Hotspot 1,2 1 KB, 2 KB N/A 8 KB, 16 KB Matrix 1-5 1 KB, 2 KB N/A 4 KB, 8 KB Backprop 1-3 1 KB, 2 KB N/A 4 KB, 8 KB FWT 1-4 4 KB- 32 KB N/A 4 KB, 8 KB LPS 1-3 1 KB, 2 KB N/A 2 KB, 4 KB

  15. PB results Benchmark MCB R DL1 cache R Const cache R Register file R AES 57100 1 17742 4 35500 2 18922 3 Montcarlo 445073 1 293537 2 46469 3 26229 4 LIB 1482938 3 2284182 2 1051960 4 6037368 1 Ray 34441615 2 38417897 1 1405 4 4684461 3 NN 1054553 2 1508787 1 5307 3 535 4 Scan 12372 1 9708 2 1748 3 Srad 139159 1 2125 2 117 3 Blachschole 3110257 1 142613 2 29369 3 Hotspot 433926 1 133392 2 4584 3 Matrix 13693 1 7931 2 1775 3 Backprop 14864 1 12020 2 476 3 FWT 127799 1 73903 2 459 3 LPS 340704 1 136082 2 16736 3

  16. Example: Knapsack result for NN benchmark 44 – 135 million transistor MCB 1-4 units DL1 cache 8-32 KB number of units Const cache 512-1 KB Register file 4-8 KB Region2: 56 – 67 million transistor  MCB 4 units  DL1 cache 2 unit (16 KB)  Const cache 1 unit (512 B)  Register file 1 unit (4 KB) transistor count

  17. Execution Platform  GPGPU-Sim 2.1  Computing resource: Hermes cluster (Westgrid)  88 node, dual socket X5550 (@2.66GHz)  Exhaustive simulation time: 11 day, 8 hours, 40 minutes and 56 seconds (on a single node)  Proposed method time: 1 day, 7 hours, 48 minutes and 21 seconds (on a single node)

  18. ILP Result Validation ILP-suggested 44 – 135 million transistor 65 optimum 60 performance (IPC) MCB 1-4 units 55 DL1 cache 8-32 KB 50 Const cache 512-1 KB 45 Register file 4-8 KB 40 Region2: 35 56 – 66 million transistor 30 65,4 65,9 66,2 66,7 65,7 66,2 66,5 66,0 66,5 66,8 66,3 66,8 66,8  MCB 4 units  DL1 cache 2 unit (16 KB) configurations (labeled by the transistor counts) ILP suggested  Const cache 1 unit (512 B)  Register file 1 unit (4 KB)

  19. ILP Result Validation best configuration ILP-suggested configuration 800 700 600 600 500 IPC IPC 500 400 300 200 400 AES BlackSchole RAY Srad 550 450 IPC 350 250 FWT LPS Scan BackProp 900 65 600 850 60 IPC IPC IPC 500 800 55 750 50 400 Montcarlo HotSpot NN LIB Matrix

  20. Miss-Match Regions Details best configuration ILP-suggested configuration other configuration 800 700 600 600 500 IPC IPC 500 400 300 200 400 AES BlackSchole RAY Srad 700 Error: 0.66% IPC 600 500 Error: 1.2% AES – Region1 450 IPC 250 50 Srad - Region5

  21. Conclusion  delivers the optimum performing configuration in 48 out of 57 cases  In other nine cases, performance lagged the optimum one by less than 3.5%.  Reduces the number of explorations by as much as 78%.

  22. Questions?

Recommend


More recommend