c ohesion a hybrid memory model for accelerators
play

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - PowerPoint PPT Presentation

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign Chip Multiprocessors Today General-purpose + accelerators


  1. C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign

  2. Chip Multiprocessors Today • General-purpose + accelerators (e.g., GPUs) • General-purpose CMP Challenges: 1. Programmability 2. Power/perf density of ILP-centric cores 3. Scalability of HW coherence, strict memory models • Accelerator Challenges: 1. Inflexible programming/execution models 2. Hard to scale irregular parallel apps 3. Lack of conventional memory model John H. Kelm 2

  3. Chip Multiprocessors Tomorrow • Industry Trend: Integration over time • Hybrids: Accelerators + CPUs together on die • More core/compute heterogeneity but… …more homogeneity in memory model Past… …Present… …Future. CPU CPU CPU GPU GPU MEM MEM GPU Accelerator MEM MEM Our Proposal: Hybrid Memory Model John H. Kelm 3

  4. CMP Memory Model Choices Conventional Multicore CPU Contemporary Accelerator • Ex: Intel i7, Sun Niagara • Ex: NVIDIA GPU, IBM Cell • Optimized for: • Optimized for: – Minimal latency – Maximum throughput – Tightly coupled sharing – Loosely coupled sharing – Fine-grained synchronization – Coarse-grained synchronization – Minimal programmer effort – Short silicon design cycle Provides: Provides: • • – Single address space – Multiple address spaces – Hardware caching – Scratchpad memories – Strong ordering – Relaxed ordering – HW-managed coherence – SW-managed coherence John H. Kelm 4

  5. Roadmap • Motivation and context • Problem statement • C OHESION design • Use cases and programming examples Addressed in this talk: 1. Opportunity : Is combining protocols worthwhile? 2. Feasibility : How does one implement hybrid memory models? 3. Tradeoffs : What are the tradeoffs in HW cc v. SW cc ? 4. Benefit : What does hybrid coherence get you? John H. Kelm 5

  6. Problem: Scalable Coherence • Available architectures: – Accelerators : 100s of cores, TFLOPS, no coherence – CMPs : <10s of cores, GFLOPS, HW coherence – Multiple memory models on-die • What devs want in heterogeneous CMPs: – Hardware caches (locality) – Single address space (marshalling) – Minimal changes to current practices • Accelerator scalability w/CMP memory model John H. Kelm 6

  7. Baseline Architecture 1024-core Processor Organization (To Memory) DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Bank 1 Bank 3 Bank 5 Bank 7 Bank 0 Bank 2 Bank 4 Bank 6 L3$ 0 L3$ 1 L3$ 2 L3$ 3 L3$ 4 L3$ 5 L3$ 6 L3$ 7 Unordered Multistage Interconnect L3 Directory Cache … Directory Bank Cluster 1 Cluser 126 Cluster 127 C 4 C 5 C 6 C 7 Controller L2 Cache C 0 C 1 C 2 C 3 (To Interconnect) Rigel Cluster • Variant of the Rigel Architecture [Kelm et al. ISCA’09] • 1024-core CMP, HW caches, single address space, MIMD John H. Kelm 7

  8. Opportunity: HW cc v. SW cc Shootout 1.4 x Runtime Normalized to IdealSWcc 1.2 x 1.0 x 0.8 x SW cc HW cc 0.6 x Parity Wins Wins 0.4 x 0.2 x 0.0 x cg dmm gjk sobel kmeans mri march heat stencil Ideal SWcc Best SWcc (of 4 policies) Full Directory (ideal on-die) • Note: Lower bars are better • Question : Can we leverage both HW+SW protocols? * SWcc based on the Task-centric Memory Model [Kelm et al. PACT’09][Kelm et al. IEEEMicro’10] John H. Kelm 8

  9. Opportunity: Network Traffic Reduction 2.5 Relative Number of Messages 2.0 Read Releases Probe Responses 1.5 Software Flushes 1.0 Cache Evictions Uncached/Atomic Operations 0.5 Instruction Requests 0.0 Write Requests SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc Read Requests cg dmm gjk heat kmeans mri sobel stencil • SW cc w/baseline arch ( left ), HW cc ww/D IR FULL ( right ) • SW cc : Fewer L2 messages in network, some flush overhead • HW cc : Extraneous msgs for unshared data (Wr Request , Rd Release ) John H. Kelm 9

  10. Opportunity: Reduce Directory Utilization 250K Average # Directory Max. possible entries Entries Allocated 200K 150K Stack Heap/Global 100K Code 50K Maximum Allocated 0K • Not all entries used  Wasted die area • For many, 256K maximum never reached (red line) • Observations: 1. Use SW cc when possible to reduce network traffic 2. Build smaller sparse directory for common case John H. Kelm 10

  11. C OHESION : Toward a Hybrid Memory Model • Support for coherence domain transitions 1. Protocol for safe migration SW cc  HW cc K 2. Minor architecture extensions • Motivation: ↑ HW cc : Supports arbitrary sharing, no SW overhead ↓ HW cc : Area + message overhead ↑ SW cc : Removes HW overheads + design complexity ↓ SW cc : Flush overhead + coherence burden on SW John H. Kelm 11

  12. Protocol Synthesis LD RdReq LD INV LD Shared Immutable LD RdRel HW S SW IM Clean Invalid SW-to-HW SW CL HW I Transitions ST LD WrReq WrReq ST WrRel INV Private Private Modified ST (Clean) (Dirty) Synchronize HW M SW PC SW PD ST WB LD LD LD ST (Per Word) (Per Line) • Create a bridge between SW cc and HW cc • Leverage existing HW cc and SW cc techniques John H. Kelm 12

  13. C OHESION Architecture Coarse-grain Fine-grain Sparse Directory Region Table Region Table (Strided across L3 banks) (One per L3$ bank) (Global Table) code w m-2 w m-1 w 0 w 1 0x00000000 segment set 0 stack set 1 segment base_addr … Coherence global 16 MB table bit vectors data 4 GB memory sharers tag I/M/S (1 bit/line in set n-2 memory) set n-1 start_addr size valid 0xFFFFE000 • Extension to baseline directory protocol – Addition 1: Region table/bit vector in memory – Addition 2: One bit/line in the L2 cache (not shown) • SW writes table  C OHESION controller exec’s transition John H. Kelm 13

  14. Example Software  Hardware Transitions D IRECTORY D IRECTORY S TATE C ACHE 0 C ACHE 1 M EMORY C ACHE 0 C ACHE 1 M EMORY S TATE I I I I I A B I I I I A B Case 1 S A B A B A B A B A B A B Case 2 1 1 M A’ B I I A B A’ B I I A B Case 3 1 0 • App. initiates transitions between SW cc and HW cc • C OHESION controller probes L2’s to reconstruct state • See paper for other cases and HW cc  SW cc John H. Kelm 14

  15. Static C OHESION Example (1 of 3) HW cc (reader) HW cc (reader) Data regions for HW cc (writer) HW cc (writer) two grid blocks from a 2D stencil SW cc SW cc computation (private) (private) • C OHESION provides static partitioning of data • (Large) read-only/private regions SW cc • (Small) shared regions HW cc John H. Kelm 15

  16. Dynamic C OHESION Example (2 of 3) Parallel Sort (on four cores) P 0 SW cc  HWcc SWcc  HWcc (Unsorted) SW cc Data SWcc Data P 0 P 1 HW cc Data HWcc Data 1. Parallel Quicksort P 0 P 2 P 1 P 3 2. Sequential Selection Sort (Phase 0) … … … … 3. Sequential Selection Sort (Phases 1-N) 4. Result Visible to All (Sorted) John H. Kelm 16

  17. System SW C OHESION Example (3 of 3) • Problem: Supporting multitasking w/SW cc • OS process creation workflow 1. Runtime allocates proc’s memory HW cc 2. Start new process 3. Process runs, migrates, SW cc  HW cc transitions 4. Exit process 5. Runtime makes allocated memory HW cc • C OHESION enables: Migration, isolation, cleanup John H. Kelm 17

  18. Network Message Reductions 5.0 Relative Number of Messages 4.5 4.0 3.5 Probe Responses 3.0 Read Releases 2.5 Software Flushes 2.0 Cache Evictions 1.5 Uncached/Atomics 1.0 Instruction Requests 0.5 0.0 Write Requests SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal Read Requests cg dmm gjk heat kmeans mri sobel stencil • HW cc Real: HW cc -only w/sparse directory used by C OHESION • HW cc Ideal: Full on-die directory • Benefit : lessens constraints on network design John H. Kelm 18

  19. Directory Size Sensitivity cg 8.0x 8.0x Normalized Runtime (v. Dir FULL ) dmm 7.0x 7.0x gjk heat 6.0x 6.0x kmeans 5.0x 5.0x mri sobel 4.0x 4.0x stencil 3.0x 3.0x 2.0x 2.0x 1.0x 1.0x 0.0x 0.0x 16384 8192 4096 2048 1024 512 256 16384 8192 4096 2048 1024 512 256 Directory Entries per L3 Cache Bank Directory Entries per L3 Cache Bank (Without C OHESION ) (With C OHESION ) • Reduces perf. cliffs in sparse directory designs • Benefit : Smaller on-die coherence structures John H. Kelm 19

  20. Runtime: C OHESION , SW cc , HW cc 7.09x 9.19x 9.21x 3.88x 2.0x Runtime Normalized to 1.5x Cohesion Cohesion SWcc 1.0x HWccOpt 0.5x HWcc 0.0x cg dmm gjk heat kmeans mri sobel stencil • Perf. close to SW cc and full-directory HW cc • Reduce network/directory overhead w/o perf. loss • Further HW cc  SW cc optimizations possible John H. Kelm 20

  21. Conclusions • Why C OHESION ? CMPs w/multiple mem. models • Usage scenarios identified – System software/migratory tasks w/SW cc – App uses: Static, dynamic, and host+accel – Optimization Path: Piecemeal HW cc  SW cc • Hybrid memory model has potential – Reduces strain on HW cc implementation – Reduces network constraints – Competitive performance John H. Kelm 21

Recommend


More recommend