C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - PowerPoint PPT Presentation

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign

Chip Multiprocessors Today • General-purpose + accelerators (e.g., GPUs) • General-purpose CMP Challenges: 1. Programmability 2. Power/perf density of ILP-centric cores 3. Scalability of HW coherence, strict memory models • Accelerator Challenges: 1. Inflexible programming/execution models 2. Hard to scale irregular parallel apps 3. Lack of conventional memory model John H. Kelm 2

Chip Multiprocessors Tomorrow • Industry Trend: Integration over time • Hybrids: Accelerators + CPUs together on die • More core/compute heterogeneity but… …more homogeneity in memory model Past… …Present… …Future. CPU CPU CPU GPU GPU MEM MEM GPU Accelerator MEM MEM Our Proposal: Hybrid Memory Model John H. Kelm 3

CMP Memory Model Choices Conventional Multicore CPU Contemporary Accelerator • Ex: Intel i7, Sun Niagara • Ex: NVIDIA GPU, IBM Cell • Optimized for: • Optimized for: – Minimal latency – Maximum throughput – Tightly coupled sharing – Loosely coupled sharing – Fine-grained synchronization – Coarse-grained synchronization – Minimal programmer effort – Short silicon design cycle Provides: Provides: • • – Single address space – Multiple address spaces – Hardware caching – Scratchpad memories – Strong ordering – Relaxed ordering – HW-managed coherence – SW-managed coherence John H. Kelm 4

Roadmap • Motivation and context • Problem statement • C OHESION design • Use cases and programming examples Addressed in this talk: 1. Opportunity : Is combining protocols worthwhile? 2. Feasibility : How does one implement hybrid memory models? 3. Tradeoffs : What are the tradeoffs in HW cc v. SW cc ? 4. Benefit : What does hybrid coherence get you? John H. Kelm 5

Problem: Scalable Coherence • Available architectures: – Accelerators : 100s of cores, TFLOPS, no coherence – CMPs : <10s of cores, GFLOPS, HW coherence – Multiple memory models on-die • What devs want in heterogeneous CMPs: – Hardware caches (locality) – Single address space (marshalling) – Minimal changes to current practices • Accelerator scalability w/CMP memory model John H. Kelm 6

Baseline Architecture 1024-core Processor Organization (To Memory) DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Bank 1 Bank 3 Bank 5 Bank 7 Bank 0 Bank 2 Bank 4 Bank 6 L3$ 0 L3$ 1 L3$ 2 L3$ 3 L3$ 4 L3$ 5 L3$ 6 L3$ 7 Unordered Multistage Interconnect L3 Directory Cache … Directory Bank Cluster 1 Cluser 126 Cluster 127 C 4 C 5 C 6 C 7 Controller L2 Cache C 0 C 1 C 2 C 3 (To Interconnect) Rigel Cluster • Variant of the Rigel Architecture [Kelm et al. ISCA’09] • 1024-core CMP, HW caches, single address space, MIMD John H. Kelm 7

Opportunity: HW cc v. SW cc Shootout 1.4 x Runtime Normalized to IdealSWcc 1.2 x 1.0 x 0.8 x SW cc HW cc 0.6 x Parity Wins Wins 0.4 x 0.2 x 0.0 x cg dmm gjk sobel kmeans mri march heat stencil Ideal SWcc Best SWcc (of 4 policies) Full Directory (ideal on-die) • Note: Lower bars are better • Question : Can we leverage both HW+SW protocols? * SWcc based on the Task-centric Memory Model [Kelm et al. PACT’09][Kelm et al. IEEEMicro’10] John H. Kelm 8

Opportunity: Network Traffic Reduction 2.5 Relative Number of Messages 2.0 Read Releases Probe Responses 1.5 Software Flushes 1.0 Cache Evictions Uncached/Atomic Operations 0.5 Instruction Requests 0.0 Write Requests SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc Read Requests cg dmm gjk heat kmeans mri sobel stencil • SW cc w/baseline arch ( left ), HW cc ww/D IR FULL ( right ) • SW cc : Fewer L2 messages in network, some flush overhead • HW cc : Extraneous msgs for unshared data (Wr Request , Rd Release ) John H. Kelm 9

Opportunity: Reduce Directory Utilization 250K Average # Directory Max. possible entries Entries Allocated 200K 150K Stack Heap/Global 100K Code 50K Maximum Allocated 0K • Not all entries used  Wasted die area • For many, 256K maximum never reached (red line) • Observations: 1. Use SW cc when possible to reduce network traffic 2. Build smaller sparse directory for common case John H. Kelm 10

C OHESION : Toward a Hybrid Memory Model • Support for coherence domain transitions 1. Protocol for safe migration SW cc  HW cc K 2. Minor architecture extensions • Motivation: ↑ HW cc : Supports arbitrary sharing, no SW overhead ↓ HW cc : Area + message overhead ↑ SW cc : Removes HW overheads + design complexity ↓ SW cc : Flush overhead + coherence burden on SW John H. Kelm 11

Protocol Synthesis LD RdReq LD INV LD Shared Immutable LD RdRel HW S SW IM Clean Invalid SW-to-HW SW CL HW I Transitions ST LD WrReq WrReq ST WrRel INV Private Private Modified ST (Clean) (Dirty) Synchronize HW M SW PC SW PD ST WB LD LD LD ST (Per Word) (Per Line) • Create a bridge between SW cc and HW cc • Leverage existing HW cc and SW cc techniques John H. Kelm 12

C OHESION Architecture Coarse-grain Fine-grain Sparse Directory Region Table Region Table (Strided across L3 banks) (One per L3$ bank) (Global Table) code w m-2 w m-1 w 0 w 1 0x00000000 segment set 0 stack set 1 segment base_addr … Coherence global 16 MB table bit vectors data 4 GB memory sharers tag I/M/S (1 bit/line in set n-2 memory) set n-1 start_addr size valid 0xFFFFE000 • Extension to baseline directory protocol – Addition 1: Region table/bit vector in memory – Addition 2: One bit/line in the L2 cache (not shown) • SW writes table  C OHESION controller exec’s transition John H. Kelm 13

Example Software  Hardware Transitions D IRECTORY D IRECTORY S TATE C ACHE 0 C ACHE 1 M EMORY C ACHE 0 C ACHE 1 M EMORY S TATE I I I I I A B I I I I A B Case 1 S A B A B A B A B A B A B Case 2 1 1 M A’ B I I A B A’ B I I A B Case 3 1 0 • App. initiates transitions between SW cc and HW cc • C OHESION controller probes L2’s to reconstruct state • See paper for other cases and HW cc  SW cc John H. Kelm 14

Static C OHESION Example (1 of 3) HW cc (reader) HW cc (reader) Data regions for HW cc (writer) HW cc (writer) two grid blocks from a 2D stencil SW cc SW cc computation (private) (private) • C OHESION provides static partitioning of data • (Large) read-only/private regions SW cc • (Small) shared regions HW cc John H. Kelm 15

Dynamic C OHESION Example (2 of 3) Parallel Sort (on four cores) P 0 SW cc  HWcc SWcc  HWcc (Unsorted) SW cc Data SWcc Data P 0 P 1 HW cc Data HWcc Data 1. Parallel Quicksort P 0 P 2 P 1 P 3 2. Sequential Selection Sort (Phase 0) … … … … 3. Sequential Selection Sort (Phases 1-N) 4. Result Visible to All (Sorted) John H. Kelm 16

System SW C OHESION Example (3 of 3) • Problem: Supporting multitasking w/SW cc • OS process creation workflow 1. Runtime allocates proc’s memory HW cc 2. Start new process 3. Process runs, migrates, SW cc  HW cc transitions 4. Exit process 5. Runtime makes allocated memory HW cc • C OHESION enables: Migration, isolation, cleanup John H. Kelm 17

Network Message Reductions 5.0 Relative Number of Messages 4.5 4.0 3.5 Probe Responses 3.0 Read Releases 2.5 Software Flushes 2.0 Cache Evictions 1.5 Uncached/Atomics 1.0 Instruction Requests 0.5 0.0 Write Requests SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal Read Requests cg dmm gjk heat kmeans mri sobel stencil • HW cc Real: HW cc -only w/sparse directory used by C OHESION • HW cc Ideal: Full on-die directory • Benefit : lessens constraints on network design John H. Kelm 18

Directory Size Sensitivity cg 8.0x 8.0x Normalized Runtime (v. Dir FULL ) dmm 7.0x 7.0x gjk heat 6.0x 6.0x kmeans 5.0x 5.0x mri sobel 4.0x 4.0x stencil 3.0x 3.0x 2.0x 2.0x 1.0x 1.0x 0.0x 0.0x 16384 8192 4096 2048 1024 512 256 16384 8192 4096 2048 1024 512 256 Directory Entries per L3 Cache Bank Directory Entries per L3 Cache Bank (Without C OHESION ) (With C OHESION ) • Reduces perf. cliffs in sparse directory designs • Benefit : Smaller on-die coherence structures John H. Kelm 19

Runtime: C OHESION , SW cc , HW cc 7.09x 9.19x 9.21x 3.88x 2.0x Runtime Normalized to 1.5x Cohesion Cohesion SWcc 1.0x HWccOpt 0.5x HWcc 0.0x cg dmm gjk heat kmeans mri sobel stencil • Perf. close to SW cc and full-directory HW cc • Reduce network/directory overhead w/o perf. loss • Further HW cc  SW cc optimizations possible John H. Kelm 20

Conclusions • Why C OHESION ? CMPs w/multiple mem. models • Usage scenarios identified – System software/migratory tasks w/SW cc – App uses: Static, dynamic, and host+accel – Optimization Path: Piecemeal HW cc  SW cc • Hybrid memory model has potential – Reduces strain on HW cc implementation – Reduces network constraints – Competitive performance John H. Kelm 21

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - PowerPoint PPT Presentation

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign Chip Multiprocessors Today General-purpose + accelerators

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Application Accelerators: Application Accelerators: Application Accelerators: Application

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Applications on Heterogeneous Platforms with Accelerators Accelerators and Hybrid Exascale

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

Cohesive Devices Cohesive devices Cohesive devices are like

CSSE 220 Coupling and Cohesion Scoping Please checkout VideoStore from your SVN The plan

From Logic to Behavior Modern semantics and complexity theory in cognitive modeling Jakub

Computable cognitive model based on based on Computable cognitive model social evidence and

Object-Oriented Design Lecture 7: Cohesion and Coupling 1 Sharif University of Technology

Object-Oriented Design I Single responsibility High cohesion SWEN-261 Information expert

Zero, and some other infinitesimal levels of a cohesive topos M. Menni Conicet and

Logical Topology and Axiomatic Cohesion David Jaz Myers Johns Hopkins University March 12, 2019

Sambuz

Useful Links

Newsletter

Mail Us