iee5008 autumn 2012 memory systems survey on memory
play

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com


  1. IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido 2012

  2. Outline  Introduction  Background of GPGPUs  Hardware  Programming and execution model  GPU Memory Hierarchy  Issues and limitations  Locality on GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 2

  3. Outline  State of the art solutions for memory access scheduling on GPUs  Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs  Conclusion: Comparing and analyzing  References Luis Garrido NCTU IEE5008 Memory Systems 2012 3

  4. Introduction  GPGPUs as a major focus of attention in the heterogeneous computing field  Design philosophy of GPGPUs  Bulk synchronous Programming model  Characteristics of applications  Design space of GPUs  Diverse and multi-variable  Major bottleneck: memory hierarchy  Focus of this work Luis Garrido NCTU IEE5008 Memory Systems 2012 4

  5. Background of GPGPUs  Hardware of GPUs: general view Luis Garrido NCTU IEE5008 Memory Systems 2012 5

  6. Background of GPGPUs  Hardware of GPUs: SMX Luis Garrido NCTU IEE5008 Memory Systems 2012 6

  7. GPU’s Memory Hierarchy  Five different memory spaces  Constant  Texture  Shared  Private  Global  Issues and limitations of the memory hierarchy  Caches of small size  Latency is not a concerned, it is bandwidth  How to make good usage of memory resources? Luis Garrido NCTU IEE5008 Memory Systems 2012 7

  8. Locality on GPUs  Locality in GPUs is defined in close relation to the programming model  Why?  Locality in GPUs as:  Within-warp locality  Within-block localty (cross-warp)  Cross-instruction data reuse  Model suited for GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 8

  9. State of the Art Solutions  Many techniques  An exhaustive survey would be huge  Bring forward the most recent and the ones that explain the most  Types of techniques  Static: rely on compiler techniques, done a priori of execution  Dynamic: rely on architectural enhancements  Static+Dynamic: combine the best of both previous approaches Luis Garrido NCTU IEE5008 Memory Systems 2012 9

  10. State of the Art Solutions  Three papers in this survey:  Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 10

  11. Shared Memory Bank Conflicts  Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  Objective of the mechanism:  Avoid on-chip memory bank conflicts  On-chip memories of GPUs are heavily banked  Impact on performance:  Varying latencies as a result of memory bank conflicts Luis Garrido NCTU IEE5008 Memory Systems 2012 11

  12. Shared Memory Bank Conflicts  This work makes the following contributions:  Careful analysis of the impact on GPU on-chip shared memory bank conflicts  A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts.  Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline  Pipeline stalls reduction of up to 64% leading to overall system performance Luis Garrido NCTU IEE5008 Memory Systems 2012 12

  13. Shared Memory Bank Conflicts  Core contains information of the warps  Instructions of warps issued in round-robin way  Conflicts has two consequences:  Blocks the upstream pipeline  Introduces a bubble into the M1 stage Luis Garrido NCTU IEE5008 Memory Systems 2012 13

  14. Shared Memory Bank Conflicts  Elastic Pipeline  Two modifications  Buses to allow forwarding  Turning the two stage NONMEMx stage into a FIFO  Modify the scheduling policy of warps Luis Garrido NCTU IEE5008 Memory Systems 2012 14

  15. Shared Memory Bank Conflicts  Avoid the issues due to out- of-order instruction commit  Necessary to know if future instructions will create conflict  Prevent queues from saturating  Mechanism for modification of the scheduling policy Luis Garrido NCTU IEE5008 Memory Systems 2012 15

  16. Shared Memory Bank Conflicts  Experimentation Platform: GPGPUSim  Cycle accurate simulator for GPUs  Three categories of pipeline:  Warp scheduling fails  Shared memory bank  Other Luis Garrido NCTU IEE5008 Memory Systems 2012 16

  17. Shared Memory Bank Conflicts  Core scheduling can fail when:  Not hidden by parallel execution  Barrier synch.  Warp control-flow  Performance enhancement  GPU  GPU + Elastic Pipeline  GPU + elastic pipeline + novel scheduler  Theoretic Luis Garrido NCTU IEE5008 Memory Systems 2012 17

  18. Memory Opt. and Parallelism Pipeline  Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Two major challenges:  Effective utilization of the GPU memory hierarchy  Judicious management of parallelism  Compiler-based:  Analyzes memory access patterns  Applies an optimization procedure Luis Garrido NCTU IEE5008 Memory Systems 2012 18

  19. Memory Opt. and Parallelism Pipeline  Consider an appropriate way to parallelize and application  Why compiler techniques?  Purposes of the optimization procedure  Increase memory coalescing  Improve usage of shared memory  Balance the amount of parallelism and memory opt.  Distribute memory traffic among different off-chip memory partitions Luis Garrido NCTU IEE5008 Memory Systems 2012 19

  20. Memory Opt. and Parallelism Pipeline  Input: naïve kernel code  Output: optimized kernel code  Re-schedules threads  Depend on mem. acc. behavior  Uncoalesced to coalesced:  Through shared-memory  Compiler analyzes data reuse  Assesses the benefit of using shared-memory Luis Garrido NCTU IEE5008 Memory Systems 2012 20

  21. Memory Opt. and Parallelism Pipeline  Data sharing detected  Merge thread blocks  Merge threads  When to merge thread blocks or just threads?  For experiments:  Various GPU descriptions  Cetus: source-to-source Luis Garrido NCTU IEE5008 Memory Systems 2012 21

  22. Memory Opt. and Parallelism Pipeline  Main limitation: cannot change algorithm structure  Biggest performance increase:  Merging of thread  Merging of thread-blocks Luis Garrido NCTU IEE5008 Memory Systems 2012 22

  23. Use of Demand-Fetched Caches  Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs  Caches are highly configurable  Main contributions:  Characterization of application performance  Provides taxonomy  Algorithm to identify an application’s memory access patterns  Presence of a cache hurts or damages performance? Luis Garrido NCTU IEE5008 Memory Systems 2012 23

  24. Use of Demand-Fetched Caches  It is bandwidth, not latency  Estimated traffic: what does it indicate?  A block runs on a single SM  Remember, L1 cache are not coherent.  Configurability of L1 caches  Capacity  ON/OFF  Cached or not?  Kernels are analyzed independently Luis Garrido NCTU IEE5008 Memory Systems 2012 24

  25. Use of Demand-Fetched Caches  Classification of kernels in three groups:  Texture and constant  Shared memory  Global memory Luis Garrido NCTU IEE5008 Memory Systems 2012 25

  26. Use of Demand-Fetched Caches  No correlation between hit rates and performance  Analysis of traffic generated from L2 to L1  The impact of cache line size:  Fraction fetched from DRAM  The whole line into L1 Luis Garrido NCTU IEE5008 Memory Systems 2012 26

  27. Use of Demand-Fetched Caches  Changes in L2-to-L1 memory traffic  Algorithm can reveal the access patterns:  Mem. address <->threadID  Estimate traffic  Determines which instructions to cache Luis Garrido NCTU IEE5008 Memory Systems 2012 27

  28. Use of Demand-Fetched Caches  Steps of the algorithm:  Analyze access patterns  Estimate memory traffic  Determine which instructions will use cache  Based on the thread ID  Memory traffic <-> number of bytes for thread  How to decide whether or not to cache the instruction? Luis Garrido NCTU IEE5008 Memory Systems 2012 28

  29. Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 29

  30. Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 30

  31. Use of Demand-Fetched Caches  Keeping caches for all kernels: 5.8% improvement on average  Conservative strategy: 16.9%  Aggressive approach: 18.0% Luis Garrido NCTU IEE5008 Memory Systems 2012 31

Recommend


More recommend