techniques for caches in gpus
play

Techniques for Caches in GPUs Gnther Schindler Seminar Talk - PowerPoint PPT Presentation

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1. Introduction 1.1 GPU vs. CPU 1.2 GPU Architecture Introduction 1.3 Caches in GPUs 2. Methods Methods 2.1 Atomic Operations 2.2 Software


  1. Techniques for Caches in GPUs Günther Schindler Seminar Talk 2015/16 Chair ASC

  2. Outline 1. Introduction 1.1 GPU vs. CPU 1.2 GPU Architecture Introduction 1.3 Caches in GPUs 2. Methods Methods 2.1 Atomic Operations 2.2 Software Controlled Cache-Bypassing 2.3 Hardware Controlled Cache-Bypassing Conclusion 3. Conclusion 4. Discussion Discussion 26.01.2016 2

  3. GPU vs. CPU CPU GPU „Latency-oriented“ „Throughput-oriented“ Score Alu Alu Low-overhead thread Ctr Ctrl performance via Alu Alu scheduling and hide out-of-order memory latencies Cache Cache processing and via multi-threading. Introduction large Caches. Cache 24 B L1$/Thread 16 KB L1$/Thread (Worst Case: DRAM (Intel Haswell) Methods 8 blocks per SM Nvidia Kepler) Unit Intel i7-4770 Intel i7-6700 Tesla Fermi Kepler Maxwell (Haswell) [1] (Skylake) [1] GT200 [2] GF106 [2] GK104 [2] GM107 [2] Conclusion L1 D$ (cycles) 4-5 4-5 X 45 30 X L2 D$ (cycles) 12 12 X 310 175 194 Discussion L3 D$ (cycles) 36 42 X X X X SMem (cycles) X X 38 50 33 28 RAM (cycles) 36 + 57ns 36 + 57ns 440 685 300 350 L1 D$ Size 32 KB 32 KB X 48 KB 48 KB 24 KB L2 size 256 KB 256 KB X 768 KB 1536 KB 2048 KB L3 size 8 MB 8 MB X X X X GPU chips spend more die-space on ALUs and less on caches. GPU chips spend more die-space on ALUs and less on caches. 26.01.2016 (1) http://www.7-cpu.com 3 (2) Michael Andersch, Jan Lucas, Mauricio Alvarez-Mesa, Ben Juurlink, “Analyzing GPGPU Pipeline Latency”, Poter 2014.

  4. GPU Architecture Memory Model SM 1 SM 16 L1 caches are Ratio of L1/SM is not coherent. reconfigurable. L1 $ / Shared Memory L1 $ / Shared Memory Shared Memory is Introduction Interconnection Network software controlled cache. L2 Cache L2 Cache L2 Cache Methods L2 cache is L2 is coherent . partitioned into Mem. Controller several banks. Mem. Controller Mem. Controller 1 2 6 Conclusion Off-chip GDDR. DRAMs DRAMs DRAMs Discussion Last Recently Used (LRU) Policy store store store store a0 a1 a2 a3 MRU Most Recently Used Shared a0 a1 a2 a3 Cache a0 a1 a2 LRU Last Recently Used DRAM Access WR: a0 WR: a1 26.01.2016 4

  5. Caches in GPUs Motivation ● Caches improve the performance of atomic operations . ● Shared cache in CPU-GPU heterogeneous processors improve communication and save die space. Introduction ● Improves inter-block communication . Methods ● Avoiding off-chip accesses and increasing bandwidth and save energy . Limitations of existing cache management techniques Conclusion ● Improvement in cache performance does not directly translate into improved program performance (due to multi-threading) . Discussion ● Unique GPU characteristics. ● Small cache size. ● Negative effect of caches on performance . 26.01.2016 5

  6. Atomic Operations Motivation ● Slow atomic operations currently limit applicability. ● CPU atomic mechanisms require L1 coherence. ● Need cost-effective adaptation to improve atomics. Introduction ● Franey et al. ⁽⁰⁾ restrict coherence to atomic data and implemented a complexity-effective coherence mechanism. Methods State-of-the-art Conclusion ● Executed like non-atomic instructions in the Node A Node B Atom OP shader core. ● Traverse the interconnect to the appropriate Discussion L2 bank. Interconnect ● Operation is ordered, data is acquired, and the operation is performed. L2 Cache ● Response is sent back to the core containing the previous value of the data. Goal: Avoiding the latency of traversing the interconnect (atomic operations must be Goal: Avoiding the latency of traversing the interconnect (atomic operations must be . performed locally) . performed locally) 26.01.2016 6 (0) S. Franey and M. Lipasti, “Accelerating atomic operations on GPGPUs,” in Seventh IEEE/ACM International Symposium on Networks on Chip (NoCS), 2013, pp. 1–8.

  7. AtomNaive Approach: Restrict coherence to atomic data with Mutex. Approach: Restrict coherence to atomic data with Mutex. Rotating Token T M M M M (Modulo operation Nodes that would Introduction on cycle count) . need to acquire mutex (e.g. shader M M M M Mutex-Status- core) . Methods Tables (state of „Busy-Wire“ to mutexes, '0' or '1') . indicate to nodes M M M M Conclusion when an update is in ACQ Update flight ('0' or '1') . Acquire Mutex: -> Wait for Token. Discussion -> Mark it. M M M M -> Update other nodes. + Ensures acquisition correctness + Ensures acquisition correctness - Long latency to acquire token - Long latency to acquire token - Additional latency for updates - Additional latency for updates 26.01.2016 7

  8. AtomDir Approach: Adapting techniques used in directory-based cache Approach: Adapting techniques used in directory-based cache coherence. coherence. Introduction Replace token ACQ Request rotation with request communication . Methods Round-trip communication with Conclusion the owner. O Remove updates by unique home Discussion nodes. + Ensures acquisition correctness + Ensures acquisition correctness - Round-trip latency - Round-trip latency - Minimal performance improvement - Minimal performance improvement 26.01.2016 8

  9. Hybrid Topology Approach: Effectively finding a middle point between the AtomNaive and Approach: Effectively finding a middle point between the AtomNaive and AtomDir configurations. AtomDir configurations. y AtomNaive : Replicated Introduction mutex status tables with T Ring 1 „Busy-Wire“ and Token (update communication). Methods T Ring 2 AtomDir : Mutex state is O distributed across some number of logical rings Conclusion T (request communication). Ring 3 ACQ Req. AtomDir: Δx + Δy latency AtomDir: Δx + Δy latency (round-trip) (round-trip) Discussion T Ring 4 AtomNaive: Δx/2 latency AtomNaive: Δx/2 latency (one-way trip) (one-way trip) x Hybrid: Hybrid: Δx/2 + Δy latency Δx/2 + Δy latency - Mutex acquisition delays fetch - Mutex acquisition delays fetch + Issue fetch in parallel with mutex acquisition. + Issue fetch in parallel with mutex acquisition. 26.01.2016 9

  10. Evaluation Performance ● „AtomDir” shows the benefit of being able to cache atomic data. ● „Topology” shows the benefit of distributing ownership. ● “SpecFetch” shows the advantage of issuing speculative memory fetches along with mutex acquisition. Introduction Methods Conclusion Sean Franey, ”Accelerating Atomic Discussion Operations on GPGPUs”, talk 2013. Hybrid Summary ● Proposed mechanisms show good performance improvements . ● High overhead for control logic and storage . ● Needs resources (wires) from the underlying interconnection network. ● L2 cache latency has reduced since Fermi (Fermi 310 cycles, Maxwell 194 cycles). 26.01.2016 10

  11. Communication Through Caches Motivation ● GPU applications suffer from the lack of an efficient inter-block synchronization mechanism. ● Exit the current kernel and re-launch the successive kernel after a global Introduction synchronization by the host. Kernel Launch Methods Kernel Launch Conclusion ● L2 cache can be used to provide a buffer for inter-block communication. store store store store load load load load Discussion a0 a1 a2 a3 a0 a1 a2 a3 miss miss miss miss Shared a0 a1 a2 a3 a0 a1 a2 a3 Cache a0 a1 a2 a3 a0 a1 a2 DRAM LD: a0 LD: a1 LD: a2 LD: a3 Access WR: a2 WR: a3 WR: a0 WR: a1 WR: a0 WR: a1 global synchronization Amount of off-chip memory accesses is the same, whether there is L2 cache or Amount of off-chip memory accesses is the same, whether there is L2 cache or not. not. 26.01.2016 11

  12. Write-buffering (for inter-block communication) Approach ● L2 cache works as a FIFO (LRU replacement policy). ● Choi et al. ⁽⁰⁾ prevent this by modifying the cache management scheme. 1-bit status flag (C) Introduction is added to every store store store store load load load load cache line. a0 a1 a2 a3 a0 a1 a2 a3 Write miss [C=0]: C hit hit miss miss Methods Shared 1 a0 1 a1 1 a1 1 a1 1 a1 0 a2 0 a3 Cache line is Cache 1 a0 1 a0 1 a0 0 a2 allocated and C is DRAM set. LD: a2 LD: a3 Access WR: a2 WR: a3 Conclusion Write miss [C=1]: global synchronization Line is not selected for replacement . store store store store load load load load Discussion a0 a1 a2 a3 a0 a1 a2 a3 Every C is set: miss miss miss miss Shared Bypass L2 cache a0 a1 a2 a3 a0 a1 a2 a3 Cache a0 a1 a2 a3 a0 a1 a2 to off-chip memory. DRAM LD: a0 LD: a2 LD: a3 Access WR: a2 WR: a0 WR: a1 WR: a0 WR: a1 global synchronization Two writes and two reads for a0 and a1 are reduced when compared with the LRU policy. 26.01.2016 12 (0) H. Choi, J. Ahn, and W. Sung, “Reducing off-chip memory traffic by selective cache management scheme in GPGPUs,” in 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM, 2012, pp. 110–119.

Recommend


More recommend