bandwidth bottlenecks across the
play

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay - PowerPoint PPT Presentation

Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh ISPASS 2017 25 th April Santa Rosa, California Multithreading on GPUs Hardware


  1. Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh ISPASS 2017 25 th April Santa Rosa, California

  2. Multithreading on GPUs Hardware Kernel Scheduler Core Core Core Core Host CPU to Cores hide GPU memory latencies with concurrent execution DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 2 Memory Hierarchy in GPUs

  3. Multithreading on GPUs Hardware Kernel Scheduler Memory- Core Core Core Core intensive applications Host CPU to Latencies grow GPU Appear in critical path Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 3 Memory Hierarchy in GPUs

  4. Deeper Memory Hierarchy Core Core Core Core Small caches High multithreading L1 L1 L1 L1 Bandwidth filtering High cache miss rates Distributed Bandwidth filtering L2 Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 4 Memory Hierarchy in GPUs

  5. Deeper Memory Hierarchy L2 roundtrip latency ̴ 300 cycles Core Core Core Core (2-3x higher) Small caches High multithreading L1 L1 Bandwidth filtering L1 L1 High cache miss rates Distributed Bandwidth filtering L2 Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 5 Memory Hierarchy in GPUs

  6. Deeper Memory Hierarchy L2 roundtrip latency ̴ 200 cycles Core Core Core Core Small caches High multithreading L1 L1 L1 L1 Bandwidth filtering High cache miss rates Identify and mitigate Bandwidth filtering bottlenecks across the L2 memory hierarchy DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 6 Memory Hierarchy in GPUs

  7. Goals • Characterize : Understand the bandwidth bottlenecks across different levels of the memory hierarchy such as L1, L2 and DRAM • Cause : Investigate the architectural causes for congestion • Effect : Design-space exploration to evaluate the effect of mitigating congestion • Proposal : Use cause and effect analysis to present cost-effective configurations of the memory hierarchy Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 7 Memory Hierarchy in GPUs

  8. Experimental Environment • Platform • GPGPU -Sim (v3.2.2) • GPUWattch (McPAT) • Benchmark Suites • Rodinia • Parboil • MapReduce Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 8 Memory Hierarchy in GPUs

  9. Baseline Configuration • GTX 480 NVIDIA GPU • 15 SMs • Private L 1 Data Cache (16 KB; 32 MSHRs) • Shared L 2 Cache (768 KB; 32 MSHRs/bank) • L1-L2 Interconnect (Crossbar; 32+32 bytes) • DRAM ( 384 bits bus width) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 9 Memory Hierarchy in GPUs

  10. Latency Tolerance Performance plateau Latency tolerance Latency appears in the critical path Performance versus Latency curve for memory-intensive benchmarks Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 10 Memory Hierarchy in GPUs

  11. Latency Tolerance Performance plateau [ 120 cycles , 220 cycles ] Ideal L2 access latency Ideal DRAM access latency Added latencies due to increasing congestion Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 11 Memory Hierarchy in GPUs

  12. Latency Tolerance Performance plateau Baseline Memory Latencies 1x Far from [ 120 cycles , 220 cycles ] Practically possible saturation Ideal L2 access latency Ideal al DRAM AM access ss latenc ncy to improve (theoretically performance possible) Observations about “baseline memory latencies” 1. Baseline memory latencies critically higher than performance plateau latencies 2. Baseline memory latencies critically higher than ideal access latencies to L2/DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 12 Memory Hierarchy in GPUs

  13. Infinite Bandwidth 2.37x Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 13 Memory Hierarchy in GPUs

  14. Infinite Bandwidth 2.37x 1.15x Significant congestion in the cache hierarchy Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 14 Memory Hierarchy in GPUs

  15. Understanding Bandwidth Bottleneck • While the bandwidth provided decreases in the lower Core levels of the memory hierarchy, bandwidth demand does not reduce proportionally. Decreasing bandwidth L1 access • This leads to a bandwidth skew between adjacent levels. queue L1 • As a result, requests queue up in the memory hierarchy for long durations, causing congestion. L2 access queue L2 • L2 access queues are full for 46% of its usage lifetime. DRAM • DRAM access queue are full for 39% of its usage lifetime access queue DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 15 Memory Hierarchy in GPUs

  16. Causes of congestion • • Structural Hazards Back Pressure Core L1 L1 MSHR HR L2 L2 MSHR DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 16 Memory Hierarchy in GPUs

  17. Causes of congestion • • Structural Hazards Back Pressure Core • Prolonged contention for cache resources such as MSHRs or replaceable cache lines. L1 • Pending requests must complete and relinquish the HIT? L1 MSHR resources. L2 • Therefore, new miss requests get serialized, increasing the FULL MISS memory latencies even more. L2 MSHR Structural Hazard High cache hit latencies DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 17 Memory Hierarchy in GPUs

  18. Causes of congestion • • Structural Hazards Back Pressure STALL Core Independent compute? X • Cascading effect of structural hazards • Higher level gets throttled L1 L1 MSHR X • Eventually throttles core performance L2 L2 MSHR Restricted parallelism on cores DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 18 Memory Hierarchy in GPUs

  19. Causes of congestion • • Structural Hazards Back Pressure Core L1 cache stalls L1 L1 MSHR 41% L2 11% L2 MSHR HR DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 19 Memory Hierarchy in GPUs

  20. Causes of congestion • • Structural Hazards Back Pressure Core L1 cache stalls L1 48% 48% L1 MSHR L2 L2 MSHR Major causes of stalls at L1 DRAM 1. L1 MSHR : 41% (Structural Hazards) 2. L2 back pressure : 48% (Back pressure) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 20 Memory Hierarchy in GPUs

  21. Causes of congestion • • Structural Hazards Back Pressure Core L2 cache stalls L1 35% L1 MSHR 42% 42% L2 L2 MSHR Major causes of stalls at L2 DRAM 1. Crossbar (response path) : 42% (Back pressure) 2. DRAM : 35% (Back pressure) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 21 Memory Hierarchy in GPUs

  22. Mitigating congestion Classifying the Design Space Core • Category -1: Operate at peak throughput L1 • Minimize stalls by exploiting existing peak throughput L1 MSHR • e.g. MSHRs, Access Queue size L2 • Category -2: Increase peak throughput L2 MSHR HR • Minimize stalls by increasing the peak throughput • e.g. Crossbar flit size, DRAM bus width DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 22 Memory Hierarchy in GPUs

  23. Identifying the Design Space • L1 parameters • L1 Miss Queue Core • L1 MSHR • Memory pipeline width • L2 parameters L1 • L2 Miss/Response Queue • L2 MSHR L1 MSHR • L2 Data Port Width • L2 Banks L2 • Flit Size (Crossbar) L2 MSHR • DRAM parameters • Scheduler Queue • Banks DRAM • Bus width Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 23 Memory Hierarchy in GPUs

  24. Mitigating congestion 4% Scaling L1 parameters by 4x Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 24 Memory Hierarchy in GPUs

  25. Mitigating congestion - 7% - 13% - 33% - 25% 4% 4% Scaling L1 parameters by 4x Improving bandwidth in isolation can lead to even more congestion at the lower levels Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 25 Memory Hierarchy in GPUs

  26. Mitigating congestion Core frequency scaling on real GTX 480 Up to 23% slowdown Improving bandwidth in isolation can lead to even more congestion at the lower levels Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 26 Memory Hierarchy in GPUs

  27. Mitigating congestion 59% Scaling L2 parameters by 4x Shows the criticality of the L2 bandwidth Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 27 Memory Hierarchy in GPUs

  28. Mitigating congestion 11% Scaling DRAM parameters by 4x (HBM) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 28 Memory Hierarchy in GPUs

  29. Mitigating congestion 226% 212% 59% 59% - 13% 69% 4% Scaling L1 and L2 parameters by 4x A case for synergistic scaling! Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 29 Memory Hierarchy in GPUs

Recommend


More recommend