poise
play

Poise : Balancing Thread-Level Parallelism and Memory System - PowerPoint PPT Presentation

Poise : Balancing Thread-Level Parallelism and Memory System Performance in GPUs using Machine Learning Saumay Dublish * Vijay Nagarajan Nigel Topham * * Synopsys Inc. The University of Edinburgh HPCA 2019 Washington D.C., USA


  1. Poise : Balancing Thread-Level Parallelism and Memory System Performance in GPUs using Machine Learning Saumay Dublish * Vijay Nagarajan ‡ Nigel Topham ‡ * * Synopsys Inc. ‡ ‡ The University of Edinburgh HPCA 2019 Washington D.C., USA 19 th February, 2019

  2. GPU Architecture Overview SM SM SM • GPUs are throughput-oriented systems L1 L1 L1 • Focus on overall system throughput • Rely on high levels of multithreading L2 • Implemented by switching across warps • Overlap latency with useful execution DRAM 2 GPU Architecture Overview

  3. GPU Architecture Consequence of increasing TLP SM SM SM • Increasing TLP not always useful L1 L1 L1 • Leads to cache thrashing • Leads to bandwidth bottlenecks L2 • Results in high levels of congestion • Latencies tend to be very high! DRAM Can such high latencies be hidden? 3 GPU Architecture Consequence of increasing TLP

  4. Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 4 GPU Architecture Hiding Latencies in GPUs

  5. Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 5 GPU Architecture Hiding Latencies in GPUs

  6. Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Works well in compute-intensive Independent DEPENDENCY applications DEPENDENCY 6 GPU Architecture Hiding Latencies in GPUs

  7. The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 7 GPU Architecture The Case of Limited Parallelism

  8. The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 8 GPU Architecture The Case of Limited Parallelism

  9. The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY Higher load latency LOAD LOAD due to congestion Independent LOAD LOAD LOAD Independent Independent LOAD LOAD Independent Independent Independent Independent LOAD Independent LOAD Independent Warp concurrency Independent Independent Independent LOAD Load latency Independent Independent LOAD LOAD Independent Independent Independent Independent Independent Independent Independent Independent LOAD (Inter-warp concurrency) Independent Independent LOAD LOAD Independent Independent Independent DEPENDENCY Independent Independent Independent Independent Independent time Independent Execution Independent Independent DEPENDENCY Independent Independent Independent Independent Independent Independent Independent Independent DEPENDENCY DEPENDENCY Independent Independent Independent Independent DEPENDENCY Independent Independent Independent Independent DEPENDENCY Independent DEPENDENCY Independent Independent Independent Independent Independent DEPENDENCY DEPENDENCY Independent DEPENDENCY Impractically large number of warps Independent Independent DEPENDENCY DEPENDENCY DEPENDENCY required to completely hide latency DEPENDENCY DEPENDENCY 9 GPU Architecture The Case of Limited Parallelism

  10. Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens Memory Performance Concurrency 10

  11. Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ☓ Memory Performance ✓ Concurrency 11

  12. Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ☓ ✓ Concurrency Memory Performance 12

  13. Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ✓ ✓ Memory Performance Concurrency Optimal system throughput with balanced TLP and memory performance

  14. Outline • Problem Statement Balancing TLP and memory performance • Prior state-of-the-art CCWS and PCAL warp schedulers • Pitfalls in prior techniques Iterative search and prone to local optima • Goals Computing the best warp scheduling decisions • Proposal Poise • Results Experimental results • Conclusion Key takeaways 14

  15. Prior state-of-the-art Warps Cache Thrashing L1 cache Memory Congestion 15 Prior state-of-the-art CCWS

  16. Prior state-of-the-art Cache-conscious wavefront scheduling (CCWS) Limits the degree of multithreading ☓ Warps Reduces cache thrashing L1 cache Relieves congestion Shortcomings • Restricted coupling of warps with cache performance • Underutilization of shared memory resources • Dynamic policy has significant performance and cost overheads • Static policy burdens the user with the task of profiling every workload 16 Prior state-of-the-art CCWS

  17. Prior state-of-the-art Priority-based cache allocation (PCAL) Alter parallelism independent of memory system performance ☓ Warps L1 cache 17 Prior state-of-the-art CCWS

  18. Prior state-of-the-art Priority-based cache allocation (PCAL) Vital warps (W1, W2, W3) Cache-polluting warps ☓ Warps L1 cache Cache-polluting warps (W1, W2) Vital warps 18 Prior state-of-the-art PCAL

  19. Prior state-of-the-art Priority-based cache allocation (PCAL) Vital warps (N) Determine degree of multithreading Cache-polluting warps Cache-polluting warps (p) Subset of vital warps Ability to allocate and evict the L1 cache Reduce cache contention Warp-tuple { N, p } Vital warps 19 Prior state-of-the-art PCAL

  20. Limitations of PCAL • Heuristic-based iterative search are slow in hardware • Prone to local optima in Cache-polluting warps presence of multiple performance peaks • These two limitations lead to sub-optimal solutions Local optimum Vital warps 20 Prior state-of-the-art Limitations of PCAL

  21. Goals How to find the best warp-tuple? • Balance TLP and memory performance • Avoid local optima Cache-polluting warps • Converge expeditiously • Low sampling and hardware overhead • Avoid burdening the user Best warp-tuple? Vital warps 21 Goals

  22. Proposal Poise A technique to dynamically balance TLP and memory system performance Machine Learning Framework Hardware Inference Engine Supervised learning Runtime prediction Unseen user application Feature Set Runtime Input Sample Input Profiled Kernels Feature weights Prediction Training Regression Stage & Dataset Model Poise Prediction via compiler Local Search Best warp-tuple Sample Output Best warp-tuple 22 Poise Poise : A System Overview

More recommend