A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - PowerPoint PPT Presentation

A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina

The Goal  Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs.  Produces the optimum configuration in 84% of the cases.  Produces the second optimum configuration for the rest of the cases (less than 3.5% error).  Reduces the number of explorations by as much as 78%.

Outline  GPU Architecture  Plackett and Burman Design Method  Knapsack Optimization Technique  Proposed Method  Results  Conclusion

GPU Architecture User program Parallel section Serial section CPU GPU

GPU Architecture Streaming Multiprocessor Warp Scheduler Warps SM SM SM SM SM Warp Pool SM SM SM SM SM Off Chip DRAM Memory Register File SM SM SM SM SM SIMD Pipeline Shared Mem SM SM SM SM SM L1 Cache Memory Hierarchy Const Cache Text Cache SM SM SM SM SM SM SM SM SM SM

Design Parameters  Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

Parameters Under Study  Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

Proposed method  Suggests the best application-specific configuration for different available chip budgets.  Plackett & Burman design  measures the effect of each parameter on performance.  Knapsack problem  determines the configuration of parameters based on their effect on performance such that:  leads to the optimum performance  meets the budget

Plackett & Burman Design (PB)  N parameters; each one takes L values  PB considers only the min and max values  X experiments ( X is the next multiple of 4 strictly greater than N )  PB with fold-over captures the effect of two interactive parameters (doubles the number of experiments)  4 parameters, 16 experiments

PB Design Table exp A B C D Perf exp A B C D Perf + T 1 - T 9 1 + + - 9 - - + + 2 + + + 10 - - - - T 2 T 10 3 - - + + 11 + + - - T 3 T 11 4 - - + 12 + + - + T 4 - T 12 5 13 - + - - T 5 - + + T 13 + + T 6 T 14 6 - + - 14 - + - + 7 + + - + T 7 15 - - + - T 15 - 8 - - - 16 + + + + T 8 T 16 S A S B S C S D =

Knapsack Problem  A constraint optimization problem v j = value of an item of type j ; determined by PB design (S x ) w j = weight of an item of type j ; transistor count of unit j b j = uper bound on the availability of items of type j C = capacity of the knapsack ; maximum number of transistors Select a number x j of items of each type so as to n n w x C z v j x maximize subject to j j j j 1 i 1 j N 1 ,.., n . 1 x b and integer , j i

GPU configuration Parameter Value Number of shader 30 Shader clock frequency 1.3 GHZ Max thread per shader 1024 SIMD pipeline width 32 Warp width 32 Scheduling PDOM Max CTA/shader 8

Transistor costs Cost parameter Size/number (million transistor) Memory controller 1 0.3 DL1 cache 32 KB 87 Constant cache 32 KB 52 Register file 32 KB 170

Benchmarks and PB design values benchmark MCB DL1 cache Constant cache Register file AES 1-3 1 KB, 2KB 512 B- 8 KB 4 KB, 8 KB Montcarlo 1-5 32 KB, 64 KB 1 KB, 2 KB 8 KB, 16 KB LIB 3-5 32 KB- 128 KB 64 KB- 256 KB 2 KB,4 KB Ray 1-3 1 KB, 2 KB 1 KB- 32 KB 8 KB, 16 KB NN 1-4 8 KB- 32 KB 512 B, 1 KB 4 KB, 8 KB Scan 1-3 512 B- 2 KB N/A 2 KB- 8 KB Srad 1-6 2 KB- 256 KB N/A 4 KB- 16 KB Blachschole 1-4 2 KB- 8 KB N/A 4 KB, 8 KB Hotspot 1,2 1 KB, 2 KB N/A 8 KB, 16 KB Matrix 1-5 1 KB, 2 KB N/A 4 KB, 8 KB Backprop 1-3 1 KB, 2 KB N/A 4 KB, 8 KB FWT 1-4 4 KB- 32 KB N/A 4 KB, 8 KB LPS 1-3 1 KB, 2 KB N/A 2 KB, 4 KB

PB results Benchmark MCB R DL1 cache R Const cache R Register file R AES 57100 1 17742 4 35500 2 18922 3 Montcarlo 445073 1 293537 2 46469 3 26229 4 LIB 1482938 3 2284182 2 1051960 4 6037368 1 Ray 34441615 2 38417897 1 1405 4 4684461 3 NN 1054553 2 1508787 1 5307 3 535 4 Scan 12372 1 9708 2 1748 3 Srad 139159 1 2125 2 117 3 Blachschole 3110257 1 142613 2 29369 3 Hotspot 433926 1 133392 2 4584 3 Matrix 13693 1 7931 2 1775 3 Backprop 14864 1 12020 2 476 3 FWT 127799 1 73903 2 459 3 LPS 340704 1 136082 2 16736 3

Example: Knapsack result for NN benchmark 44 – 135 million transistor MCB 1-4 units DL1 cache 8-32 KB number of units Const cache 512-1 KB Register file 4-8 KB Region2: 56 – 67 million transistor  MCB 4 units  DL1 cache 2 unit (16 KB)  Const cache 1 unit (512 B)  Register file 1 unit (4 KB) transistor count

Execution Platform  GPGPU-Sim 2.1  Computing resource: Hermes cluster (Westgrid)  88 node, dual socket X5550 (@2.66GHz)  Exhaustive simulation time: 11 day, 8 hours, 40 minutes and 56 seconds (on a single node)  Proposed method time: 1 day, 7 hours, 48 minutes and 21 seconds (on a single node)

ILP Result Validation ILP-suggested 44 – 135 million transistor 65 optimum 60 performance (IPC) MCB 1-4 units 55 DL1 cache 8-32 KB 50 Const cache 512-1 KB 45 Register file 4-8 KB 40 Region2: 35 56 – 66 million transistor 30 65,4 65,9 66,2 66,7 65,7 66,2 66,5 66,0 66,5 66,8 66,3 66,8 66,8  MCB 4 units  DL1 cache 2 unit (16 KB) configurations (labeled by the transistor counts) ILP suggested  Const cache 1 unit (512 B)  Register file 1 unit (4 KB)

ILP Result Validation best configuration ILP-suggested configuration 800 700 600 600 500 IPC IPC 500 400 300 200 400 AES BlackSchole RAY Srad 550 450 IPC 350 250 FWT LPS Scan BackProp 900 65 600 850 60 IPC IPC IPC 500 800 55 750 50 400 Montcarlo HotSpot NN LIB Matrix

Miss-Match Regions Details best configuration ILP-suggested configuration other configuration 800 700 600 600 500 IPC IPC 500 400 300 200 400 AES BlackSchole RAY Srad 700 Error: 0.66% IPC 600 500 Error: 1.2% AES – Region1 450 IPC 250 50 Srad - Region5

Conclusion  delivers the optimum performing configuration in 48 out of 57 cases  In other nine cases, performance lagged the optimum one by less than 3.5%.  Reduces the number of explorations by as much as 78%.

Questions?

A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - PowerPoint PPT Presentation

A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina The Goal Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs. Produces the optimum

CCPS Energy Conservation Updates and Highlights JULY 11, 2019 This Photo by Unknown Author is

Proofs A proof is a mechanically derivable demonstration that a formula logically follows from

CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme L eo Ducas (CWI), Eike Kiltz

function TT-Entails? ( KB , ) returns true or false symbols a list of the proposition

Welcome Back! Our History 2009: Founded as Cleveland Carbon Fund, the

Exchange & Study Abroad Christopher Amdi, Study and Career Guidance Rikke Ilona Ustrup,

Renormalisation & resurgent transseries in quantum field theory Lutz Klaczynski, Humboldt

Connected flight route search Using Skyscanners Travel API Services available today AirTreks

KNOWN-ITEM SEARCH Alan Smeaton Dublin City University Paul Over NIST Task 2 Use case :

Concluding remarks Franoise Genova Insight on the evolution from pathfinders to ESFRIs (from

L ECTURE 22: M AP B UILDING ( UNKNOWN # OF LANDMARKS ) EKF SLAM I NSTRUCTOR : G IANNI A. D I C ARO

needs, trends Break-out Discussion Charles McCathieNevile Margie Foster Elena Rudeshko Matia

A New Approach to Assessing Model Risk in High Dimensions Carole Bernard (University of

Hurwitz trees and deformations of Artin-Schreier covers Huy Dang University of Virginia May 16,

Spatial coupling: Algorithm and Proof Technique Workshop on Local Algorithms - WOLA 2018 Boston,

Inference of Field Initialization Fausto Spoto and Michael D. Ernst University of Verona, Italy

Provisioning IoT with Web NFC Zoltan Kis (@zolkis), Intel Background JavaScript APIs for

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Fast Radio Bursts with HIRAX Jeff Peterson FRB 110523 McWilliams Center for Cosmology Carnegie

Whats a Wiki anyway? Which Should I Choose? Tech Tools with Tine Webinar Series Presents:

A New Prediction Capability for post-sunset Equatorial Plasma Bubbles Brett A. Carter 1,2 ,

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

DRAFT IN CONFIDENCE ASX SMALL & MID-CAP CONFERENCE PRESENTATION SEPTEMBER 2020 EXECUTIVE

Analysis Support for Enhanced Nuclear Energy Sustainability IAEA/INPRO service to Member States

A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - PowerPoint PPT Presentation

A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina The Goal Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs. Produces the optimum

CCPS Energy Conservation Updates and Highlights JULY 11, 2019 This Photo by Unknown Author is

Proofs A proof is a mechanically derivable demonstration that a formula logically follows from

CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme L eo Ducas (CWI), Eike Kiltz

function TT-Entails? ( KB , ) returns true or false symbols a list of the proposition

Welcome Back! Our History 2009: Founded as Cleveland Carbon Fund, the

Exchange &amp; Study Abroad Christopher Amdi, Study and Career Guidance Rikke Ilona Ustrup,

Renormalisation &amp; resurgent transseries in quantum field theory Lutz Klaczynski, Humboldt

Connected flight route search Using Skyscanners Travel API Services available today AirTreks

KNOWN-ITEM SEARCH Alan Smeaton Dublin City University Paul Over NIST Task 2 Use case :

Concluding remarks Franoise Genova Insight on the evolution from pathfinders to ESFRIs (from

L ECTURE 22: M AP B UILDING ( UNKNOWN # OF LANDMARKS ) EKF SLAM I NSTRUCTOR : G IANNI A. D I C ARO

needs, trends Break-out Discussion Charles McCathieNevile Margie Foster Elena Rudeshko Matia

A New Approach to Assessing Model Risk in High Dimensions Carole Bernard (University of

Hurwitz trees and deformations of Artin-Schreier covers Huy Dang University of Virginia May 16,

Spatial coupling: Algorithm and Proof Technique Workshop on Local Algorithms - WOLA 2018 Boston,

Inference of Field Initialization Fausto Spoto and Michael D. Ernst University of Verona, Italy

Provisioning IoT with Web NFC Zoltan Kis (@zolkis), Intel Background JavaScript APIs for

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Fast Radio Bursts with HIRAX Jeff Peterson FRB 110523 McWilliams Center for Cosmology Carnegie

Whats a Wiki anyway? Which Should I Choose? Tech Tools with Tine Webinar Series Presents:

A New Prediction Capability for post-sunset Equatorial Plasma Bubbles Brett A. Carter 1,2 ,

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

DRAFT IN CONFIDENCE ASX SMALL &amp; MID-CAP CONFERENCE PRESENTATION SEPTEMBER 2020 EXECUTIVE

Analysis Support for Enhanced Nuclear Energy Sustainability IAEA/INPRO service to Member States

Exchange & Study Abroad Christopher Amdi, Study and Career Guidance Rikke Ilona Ustrup,

Renormalisation & resurgent transseries in quantum field theory Lutz Klaczynski, Humboldt

DRAFT IN CONFIDENCE ASX SMALL & MID-CAP CONFERENCE PRESENTATION SEPTEMBER 2020 EXECUTIVE