IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido 2012
Outline Introduction Background of GPGPUs Hardware Programming and execution model GPU Memory Hierarchy Issues and limitations Locality on GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 2
Outline State of the art solutions for memory access scheduling on GPUs Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs Conclusion: Comparing and analyzing References Luis Garrido NCTU IEE5008 Memory Systems 2012 3
Introduction GPGPUs as a major focus of attention in the heterogeneous computing field Design philosophy of GPGPUs Bulk synchronous Programming model Characteristics of applications Design space of GPUs Diverse and multi-variable Major bottleneck: memory hierarchy Focus of this work Luis Garrido NCTU IEE5008 Memory Systems 2012 4
Background of GPGPUs Hardware of GPUs: general view Luis Garrido NCTU IEE5008 Memory Systems 2012 5
Background of GPGPUs Hardware of GPUs: SMX Luis Garrido NCTU IEE5008 Memory Systems 2012 6
GPU’s Memory Hierarchy Five different memory spaces Constant Texture Shared Private Global Issues and limitations of the memory hierarchy Caches of small size Latency is not a concerned, it is bandwidth How to make good usage of memory resources? Luis Garrido NCTU IEE5008 Memory Systems 2012 7
Locality on GPUs Locality in GPUs is defined in close relation to the programming model Why? Locality in GPUs as: Within-warp locality Within-block localty (cross-warp) Cross-instruction data reuse Model suited for GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 8
State of the Art Solutions Many techniques An exhaustive survey would be huge Bring forward the most recent and the ones that explain the most Types of techniques Static: rely on compiler techniques, done a priori of execution Dynamic: rely on architectural enhancements Static+Dynamic: combine the best of both previous approaches Luis Garrido NCTU IEE5008 Memory Systems 2012 9
State of the Art Solutions Three papers in this survey: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 10
Shared Memory Bank Conflicts Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Objective of the mechanism: Avoid on-chip memory bank conflicts On-chip memories of GPUs are heavily banked Impact on performance: Varying latencies as a result of memory bank conflicts Luis Garrido NCTU IEE5008 Memory Systems 2012 11
Shared Memory Bank Conflicts This work makes the following contributions: Careful analysis of the impact on GPU on-chip shared memory bank conflicts A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts. Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline Pipeline stalls reduction of up to 64% leading to overall system performance Luis Garrido NCTU IEE5008 Memory Systems 2012 12
Shared Memory Bank Conflicts Core contains information of the warps Instructions of warps issued in round-robin way Conflicts has two consequences: Blocks the upstream pipeline Introduces a bubble into the M1 stage Luis Garrido NCTU IEE5008 Memory Systems 2012 13
Shared Memory Bank Conflicts Elastic Pipeline Two modifications Buses to allow forwarding Turning the two stage NONMEMx stage into a FIFO Modify the scheduling policy of warps Luis Garrido NCTU IEE5008 Memory Systems 2012 14
Shared Memory Bank Conflicts Avoid the issues due to out- of-order instruction commit Necessary to know if future instructions will create conflict Prevent queues from saturating Mechanism for modification of the scheduling policy Luis Garrido NCTU IEE5008 Memory Systems 2012 15
Shared Memory Bank Conflicts Experimentation Platform: GPGPUSim Cycle accurate simulator for GPUs Three categories of pipeline: Warp scheduling fails Shared memory bank Other Luis Garrido NCTU IEE5008 Memory Systems 2012 16
Shared Memory Bank Conflicts Core scheduling can fail when: Not hidden by parallel execution Barrier synch. Warp control-flow Performance enhancement GPU GPU + Elastic Pipeline GPU + elastic pipeline + novel scheduler Theoretic Luis Garrido NCTU IEE5008 Memory Systems 2012 17
Memory Opt. and Parallelism Pipeline Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Two major challenges: Effective utilization of the GPU memory hierarchy Judicious management of parallelism Compiler-based: Analyzes memory access patterns Applies an optimization procedure Luis Garrido NCTU IEE5008 Memory Systems 2012 18
Memory Opt. and Parallelism Pipeline Consider an appropriate way to parallelize and application Why compiler techniques? Purposes of the optimization procedure Increase memory coalescing Improve usage of shared memory Balance the amount of parallelism and memory opt. Distribute memory traffic among different off-chip memory partitions Luis Garrido NCTU IEE5008 Memory Systems 2012 19
Memory Opt. and Parallelism Pipeline Input: naïve kernel code Output: optimized kernel code Re-schedules threads Depend on mem. acc. behavior Uncoalesced to coalesced: Through shared-memory Compiler analyzes data reuse Assesses the benefit of using shared-memory Luis Garrido NCTU IEE5008 Memory Systems 2012 20
Memory Opt. and Parallelism Pipeline Data sharing detected Merge thread blocks Merge threads When to merge thread blocks or just threads? For experiments: Various GPU descriptions Cetus: source-to-source Luis Garrido NCTU IEE5008 Memory Systems 2012 21
Memory Opt. and Parallelism Pipeline Main limitation: cannot change algorithm structure Biggest performance increase: Merging of thread Merging of thread-blocks Luis Garrido NCTU IEE5008 Memory Systems 2012 22
Use of Demand-Fetched Caches Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs Caches are highly configurable Main contributions: Characterization of application performance Provides taxonomy Algorithm to identify an application’s memory access patterns Presence of a cache hurts or damages performance? Luis Garrido NCTU IEE5008 Memory Systems 2012 23
Use of Demand-Fetched Caches It is bandwidth, not latency Estimated traffic: what does it indicate? A block runs on a single SM Remember, L1 cache are not coherent. Configurability of L1 caches Capacity ON/OFF Cached or not? Kernels are analyzed independently Luis Garrido NCTU IEE5008 Memory Systems 2012 24
Use of Demand-Fetched Caches Classification of kernels in three groups: Texture and constant Shared memory Global memory Luis Garrido NCTU IEE5008 Memory Systems 2012 25
Use of Demand-Fetched Caches No correlation between hit rates and performance Analysis of traffic generated from L2 to L1 The impact of cache line size: Fraction fetched from DRAM The whole line into L1 Luis Garrido NCTU IEE5008 Memory Systems 2012 26
Use of Demand-Fetched Caches Changes in L2-to-L1 memory traffic Algorithm can reveal the access patterns: Mem. address <->threadID Estimate traffic Determines which instructions to cache Luis Garrido NCTU IEE5008 Memory Systems 2012 27
Use of Demand-Fetched Caches Steps of the algorithm: Analyze access patterns Estimate memory traffic Determine which instructions will use cache Based on the thread ID Memory traffic <-> number of bytes for thread How to decide whether or not to cache the instruction? Luis Garrido NCTU IEE5008 Memory Systems 2012 28
Use of Demand-Fetched Caches Four possible cases: Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 29
Use of Demand-Fetched Caches Four possible cases: Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 30
Use of Demand-Fetched Caches Keeping caches for all kernels: 5.8% improvement on average Conservative strategy: 16.9% Aggressive approach: 18.0% Luis Garrido NCTU IEE5008 Memory Systems 2012 31
Recommend
More recommend