memory hierarchies in the era of specialization
play

Memory Hierarchies in the Era of Specialization Sarita Adve - PowerPoint PPT Presentation

Coherence, Consistency, and Dj vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung


  1. Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung and numerous colleagues and students over > 25 years This work was supported in part by the NSF and by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO and DARPA .

  2. Silver Bullets for End of Moore’s Law? Parallelism Specialization BUT impact software, hardware, hardware-software interface This talk: Memory hierarchy for specialized, parallel systems Global address space, coherence, consistency But first …

  3. My Story (1988  2017) • 1988 to 1989: What is a memory consistency model? – Simplest model: sequential consistency (SC) [Lamport79] • Memory operations execute one at a time in program order • Simple, but inefficient – Implementation/performance-centric view LD LD • Order in which memory operations execute ST ST • Different vendors w/ different models (orderings) Fence – Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, … LD ST LD ST • Many ambiguities due to complexity, by design(?), … Memory model = What value can a read return? HW/SW Interface: affects performance, programmability, portability 3

  4. My Story (1988  2017) • 1988 to 1989: What is a memory model? – What value can a read return? • 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …] – Sequential consistency for data-race-free programs • 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10] – DRF model + big mess (but after 20 years, convergence at last) • 2008-14: Software-centric view for coherence: DeNovo protocol – More performance-, energy-, and complexity-efficient than MESI [PACT12, ASPLOS14, ASPLOS15] • 2014-: Déjà vu: Heterogeneous systems [ISCA15, Micro15, ISCA17] 4

  5. Traditional Heterogeneous SoC Memory Hierarchies • Loosely coupled memory hierarchies – Local memories don’t communicate with each other – Unnecessary data movement CPU CPU GPU Modem Vect. Vect. A/V HW Accels. GPS L1 L1 Cache Cache Multi- DSP DSP DSP L2 Cache media Interconnect Main Memory A tightly coupled memory hierarchy is needed 5

  6. Tightly Coupled SoC Memory Hierarchies • Tightly coupled memory hierarchies: unified address space – Entering mainstream, especially CPU-GPU – Accelerator can access CPU’s data using same address Accelerator CPU Cache Cache L2 $ L2 $ Bank Bank Interconnection n/w Inefficient coherence and consistency Specialized private memories still used for efficiency 6

  7. Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators 7

  8. Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators 8

  9. Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators Focus: CPU-GPU systems with caches and scratchpads 9

  10. CPU Coherence: MSI CPU CPU L1 Cache L1 Cache L2 Cache, L2 Cache, Directory Directory Interconnection n/w • Single writer, multiple reader – On write miss, get ownership + invalidate all sharers – On read miss, add to sharer list  Directory to store sharer list  Many transient states Complex + inefficient  Excessive traffic, indirection 10

  11. GPU Coherence with DRF GPU CPU Flush dirty Invalidate Cache Cache Dirty Valid Valid data all data L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) Simple, but inefficient at synchronization 11

  12. GPU Coherence with DRF • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) 12

  13. GPU Coherence with HRF heterogeneous HRF • With data-race-free (DRF) memory model [ASPLOS ’14] – No data races; synchs must be explicitly distinguished heterogeneous and their scopes – At all synch points global • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points Global – Synchronization accesses must go to last level cache (LLC) – No overhead for locally scoped synchs • But higher programming complexity • Adopted for HSA, OpenCL, CUDA 13

  14. Modern GPU Coherence & Consistency Consistency Coherence De Facto Data-race-free (DRF) High overhead on synchs Simple Inefficient No overhead for local synchs Heterogeneous-race-free (HRF) Recent Scoped synchronization Efficient for local synch Complex Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right! DeNovoA+DRF: Efficient AND simpler memory model 14

  15. A Classification of Coherence Protocols • Read hit: Don’t return stale data • Read miss: Find one up-to-date copy Invalidator Writer Reader Track Ownership MESI DeNovo up-to- date Writethrough GPU copy • Reader-initiated invalidations – No invalidation or ack traffic, directories, transient states • Obtaining ownership for written data – Reuse owned data across synchs (not flushed at synch points) 15

  16. DeNovo Coherence with DRF GPU CPU Invalidate Cache Cache Obtain Own Dirty Valid non-owned data ownership L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data Can reuse Obtain ownership for dirty data owned data • Invalidate all data all non-owned data can be at L1 – Synchronization accesses must go to last level cache (LLC) • 3% state overhead vs. GPU coherence + HRF 16

  17. Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment: – GEMS, Simics, GPGPU-Sim, GPUWattch, McPAT • Workloads – Rodinia, Parboil: no fine-grained synch – UC-Davis microbenchmarks + UTS from HRF paper – Graph analytics workloads

  18. Key Evaluation Questions 1. Streaming apps without fine-grained synch – How does DeNovo+DRF compare to GPU+DRF? DeNovo does not hurt performance for streaming apps 2. Apps with locally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? HRF’s complexity not needed 3. Apps with globally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF?

  19. Global Synch – Execution Time FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% DD GH DD DD GH DD GH G* DD D* GH G* D* G* D* GH G* D* G* D* DeNovo has 28% lower execution time than GPU with global synch

  20. Global Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% DD GH DD DD GH DD GH DD GH GH G* D* G* D* G* D* G* D* G* D* DeNovo has 51% lower energy than GPU with global synch

  21. Key Evaluation Questions 1. Streaming apps without fine-grained synch – How does DeNovo+DRF compare to GPU+DRF? DeNovo does not hurt performance for streaming apps 2. Apps with locally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? HRF’s complexity not needed for local scope 3. Apps with globally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? DeNovo+DRF provides much better performance and energy DeNovo+DRF = Efficient and simple Impact: AMD/HRF authors agree, refined DeNovo [Micro16] But … Deja Vu!

  22. Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators 22

Recommend


More recommend