Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)
Specialized architectures are increasingly important in all compute domains 2
Specialization Requires Better Memory Systems Traditional heterogeneity: Shared coherent memory: CPU Mem Shared Space Mem Space CPU CPU data in coherent Existing solutions: complex and inflexible data data out Accelerator Accelerator Accelerator Mem Space No fine-grain synchronization ✓ Fine-grain synchronization No irregular access patterns ✓ Irregular access Wasteful data movement ✓ Implicit data reuse 3
Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity 4
Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 5
Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical GPU workloads: spatial locality, throughput sensitive
MESI protocol fits CPU workloads Properties MESI GPU coherence DeNovo ✓ Spatial locality Reads: Line Reads: Flexible Granularity Line False sharing Writes: Word Writes: Word ✓ Temporal locality for reads Invalidation Writer-invalidate Self-invalidate Self-invalidate Overheads limit throughput ✓ Temporal locality for writes Updates Ownership Write-through Ownership Indirection if low locality MESI Good for: CPU 7
GPUs prefer simpler protocols Properties MESI GPU coherence DeNovo ✓ No false sharing Reads: Line Reads: Flexible Granularity Line Synch limits spatial locality Writes: Word Writes: Word ✓ Simple, scalable Invalidation Writer-invalidate Self-invalidate Self-invalidate Synch limits read reuse ✓ Simple, low overhead Updates Ownership Write-through Ownership Synch limits write reuse GPU MESI coh. Good for: CPU GPU 8
DeNovo is a good fit for CPU and GPU Properties MESI GPU coherence DeNovo Reads: Line Reads: Flexible Granularity Line Writes: Word Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership GPU DeNovo MESI coh. Good for: CPU GPU CPU or GPU 9
Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 MESI L1 coh. L1 coh. L1 L1 MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 10
Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 11
Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 12
Spandex: Flexible Heterogeneous Coherence Interface Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 coh. L1 coh. L1 L1 Spandex Adapts to exploit individual device’s workload attributes Better performance, lower complexity ⇒ Fits like a glove for any heterogeneous system! 13
Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 14
Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 15
Device Request Interface Action Request Indicates ReqV Self-invalidation Read ReqS Writer-invalidation ReqWT Write-through Write ReqO Ownership only ReqWT+data Atomic write-through Read+ Write ReqO+data Ownership + Data Writeback ReqWB Owned data eviction Requests also specify granularity and (optionally) a bitmask 16
Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 17
Spandex LLC Accel ? CPU GPU • States: I, V, O, S ST ST • Allocation at line granularity GPU coh. L1 DeNovo L1 MESI L1 • Ownership at word granularity • Data field tracks owner ID • May generate requests to owner/sharer ReqWT ReqO RspWT RspO Tag State O Mask Data ✓ No false sharing V ID ID ✓ Non-blocking ownership transfer Spandex LLC 18
Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 19
External Request Interface Must handle if Accel ? External Request CPU GPU supports state ReqV O ReqO O GPU coh. L1 DeNovo L1 MESI L1 ReqO+data O RvkO O States: I, V States: I, V, O States: I, S, O Inv S ReqS S and O ReqV ReqV RspV • Translation Unit may implement Spandex LLC O functionality if not supported by device Spandex LLC 20
Evaluation: Configurations Hierarchical MESI Configuration LLC protocol CPU protocol GPU protocol GPU L1 CPU L1 GPU L1 CPU L1 … … HMG Hierarchical MESI MESI GPU coherence GPU L2 HMD Hierarchical MESI MESI DeNovo MESI LLC SMG Spandex MESI GPU coherence Spandex SMD Spandex MESI DeNovo CPU L1 GPU L1 GPU L1 CPU L1 … … SDG Spandex DeNovo GPU coherence SDD Spandex DeNovo DeNovo Spandex LLC CPU-GPU workloads from Pannotia and Chai benchmark suites 21
Evaluation: CPU-GPU Applications 120% Execution Time (cycles) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Different workloads prefer different protocols • Spandex flexibility ⇒ consistently better execution time (avg 16% lower) 22
Evaluation: CPU-GPU Applications Probe ReqWT+data ReqWB/WT ReqO[+data] ReqV/S Network Traffic (flits) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Spandex flexibility ⇒ consistently better NW traffic (avg 27% lower) 23
Conclusion and Future Work Producer Consumer MESI LLC ⇒ Simple, Flexible, Efficient Future Work: exploit SW or HW hints about data access patterns • Dynamic Spandex request selection • Producer-consumer forwarding • Extended granularity flexibility 24
Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)
Recommend
More recommend