scd a s calable c oherence d irectory
play

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012 Executive Summary 2 Directories are hard to scale, degrade performance


  1. SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012

  2. Executive Summary 2  Directories are hard to scale, degrade performance  SCD: A scalable directory with performance guarantees  Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries  Scalability  Use ZCache  Efficient high associativity, analytical models  Negligible invalidations with minimal overprovisioning (~10%)  At 1024 cores, SCD is 13x smaller than a sparse directory, and 2x smaller, faster, simpler than a hierarchical directory

  3. Outline 3  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  4. Directory-Based Coherence 4 Main Memory Shared L3 Directory Private Private Private Private Private Private Private Private L2 L2 L2 L2 L2 L2 L2 L2 Core Core Core Core Core Core Core Core  Scalable coherence protocols use a directory  Tracks contents of private caches  Ordering point for conflicting requests

  5. Directory-Induced Invalidations 5 Main Memory Shared L3 Directory Limited associativity  To track INV B INV B GET A A, must invalidate B, C, D, or E Private Private Private Private Private Private Private Private L2 0 L2 1 L2 2 L2 3 L2 4 L2 5 L2 6 L2 7 GET A INV B INV B ld A Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 ld B  MISS

  6. Desirable Directory Properties 6 Scalability 1. Latency, energy, area  Constant or log(cores) growth  Minimal complexity 2. No changes to coherence protocol  Exact sharer information 3. Negligible directory-induced invalidations 4. With minimal, bounded overprovisioning 

  7. Sparse Full-Map Directories 7  Associative array indexed by address  Sharer sets encoded in a bit-vector Way 1 Way 2 Way 3 Way 4 Directory Entry Format Line Address Coherence State Sharer Set 0xF00 Shared 0 1 0 0 1 1 0 0  Single lookup  Low latency, energy-efficient  Bit-vectors grow with # cores  Area scales poorly  Limited associativity  Directory-induced invalidations, overprovisioning (~2x)

  8. Hierarchical Sparse Directories 8  Multi-level hierarchy of sparse directories … Level-2 Directory L1 Dirs 0-31 … 32 Level-1 Directories Cores 0-31 Cores 32-63 Cores 992-1023  Small bit-vectors  Scalable area & energy  Multiple lookups in critical path  Additional latency  Needs hierarchical coherence protocol  More complexity  Directory-induced invalidations more expensive

  9. Single-Level Dirs with Inexact Sharer Sets 9  Coarse-grain bit-vectors (e.g., 1 bit for every 4 cores)  Limited pointers: Maintain a few sharer pointers, invalidate or broadcast on overflow  Tagless [MICRO 09]: Encode sharers with Bloom filters  SPACE [PACT 10]: De-duplicate sharing patterns  Reduced area & energy overheads  Overheads still not scalable  Inexact sharers  Broadcasts, invalidations or spurious lookups

  10. Efficient Highly-Associative Caches 10  ZCache [MICRO 10]: High-associativity cache with few ways  Draws from skew-associativity and Cuckoo hashing  Hits take a single lookup Indexes Way1 Way2 Way3  In a miss, replacement process H1 provides many candidates Line H2 address  Provides cheap high associativity H3 (e.g., 64-way associativity with 4 ways)  Described by simple & accurate analytical models  Cuckoo Directory [Ferdman et al., HPCA 11]:  Apply Cuckoo hashing to sparse directories  Empirically show that smaller overprovisioning (~25%) eliminates most invalidations

  11. Outline 11  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  12. Scalable Coherence Directory: Insights 12  Use ZCache  Cheap high associativity  Analytical models  Bounds on overprovisioning  Negligible difference with ideal directory regardless of workload  Validated in simulation  Provision space per tracked sharer, not line  Flexible sharer set encoding: Lines with few sharers use a single entry, widely shared lines use additional entries

  13. SCD Array 13  ZCache array indexed by (Line Address, Entry Number)  Allows multiple entries per line Indexes Way1 Way2 Way3 H1 (Line Address, Entry Number) H2 H3  Insertions walk array until an unused entry is found, or a limit of candidates (R) is reached, then invalidate one  Could use a replacement policy to decide victim  Evictions are negligible  no need for replacement policy

  14. SCD Entry Formats 14  Example: 1024 sharers Line Address Type 37b (44b) (2b) Unused I NVALID 0 0 (37b) Coherence State #ptrs 3x 10-bit sharer pointers L IMITED P OINTERS 0 1 (5b) (2b) (30b) Coherence State Root bit-vector R OOT B IT -V ECTOR 1 0 (5b) (32b) Leaf number Leaf bit-vector L EAF B IT -V ECTOR 1 1 (5b) (32b)  Lines with one or few sharers use a limited pointer entry  Lines with >3 sharers use root + leaves bit-vector entries

  15. Example: Adding a Sharer 15 0x5CA1AB1E 01 S 3 37 265 267 (L IM P TRS ) Add sharer 64 to address 0x5CA1AB1E : Lookup (0x5CA1AB1E, 0), all pointers are used  switch to multi-entry format 1 Allocate entries (0x5CA1AB1E, leafNum+1) with leafNum=1,2,8 2 3 Write leaf bit-vectors 4 Write (0x5CA1AB1E, 0) as a root bit-vector 0x5CA1AB1E 10 S 01100000 10000000 0…0 0…0 (R OOT ) (L EAF ) 0x5CA1AB1E 11 1 00000010 00000000 0…0 0…0 0x5CA1AB1E 11 2 10000000 00000000 0…0 0…0 0x5CA1AB1E 11 8 00000000 10100000 0…0 0…0

  16. SCD & Desirable Properties 16 Scalability 1. Flexible sharer set encoding  Scalable energy and area  Coherence state stored in a single entry  Most operations  have 1 lookup on critical path  Scalable latency Minimal complexity 2. All entries in the same array  No coherence protocol changes  Exact sharer information 3. Negligible directory-induced invalidations ? 4. With minimal, bounded overprovisioning 

  17. Outline 17  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  18. Analytical Models 18  Directories built with ZCache arrays can be characterized with simple, workload-independent analytical models W Ways R Replacement candidates occ Occupancy (fraction of used entries) Average lookups Fraction of insertions that per replacement cause a directory invalidation Determines performance Determines replacement impact, interference latency and energy R  1 occ R   AvgLookups P occ W inv  1 occ

  19. Bounding Invalidations 19  SCD bounds invalidations with minimal overprovisioning  Bounded worst-case behavior independent of workload  For Pinv=10 -3  W=4, R=64, 11% overprovisioning  Max directory occupancy 90%  Overprovisioning is:  Smaller than previous empirical results (25%-2x)  Bounded  Strict guarantees, no design-time uncertainty

  20. Outline 20  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  21. Methodology 21 64-tile CMP (1024 cores)  Simulated system: 1024-core tiled CMP  In-order cores with split L1s  Private inclusive L2s, 128KB/core  Shared non-inclusive L3, 256MB  MESI directory protocol  Directory implementations:  Sparse, 2-level Hierarchical, SCD Core Core Core Core Core Core Core Core  Directories 100%-provisioned for L2s Mem  All directories use ZCache arrays  Dir Ctrl L3 Bank Bank negligible invalidations Router Core Core Core Core Core Core Core Core  14 workloads from PARSEC, SPLASH2, SPECOMP/JBB, BioParallel suites 16-core tile

  22. Area 22 Cores Sparse Hierarchical SCD Sparse/SCD Hier/SCD 128 34.2% 21.1% 10.9% 3.12x 1.93x 256 59.2% 24.2% 12.5% 4.73x 1.94x 512 109.2% 27.0% 13.9% 7.87x 1.95x 1024 209.2% 30.9% 15.8% 13.22x 1.95x  Area given as a percentage of L2 caches  At 1024 cores, SCD is:  13x smaller than Sparse  2x smaller than Hierarchical  Takes ~3% of total die area

  23. Performance 23 12 Ideal Directory (%) Slowdown over 10 8 Hierarchical 6 Sparse 4 SCD 2 0 bscholes applu jbb ocean svm canneal  Hierarchical up to 10% slower than Ideal  Sparse has Ideal-like performance, but too expensive  SCD as fast as Ideal & Sparse, cheapest

  24. Energy Efficiency 24  Directory energy = Accesses * Energy/access 97% 25 SCD array accesses over Sparse (%) 20 15 10 5 0 bscholes applu jbb ocean svm canneal  SCD performs slightly more accesses (lookups, writes) than Sparse  Some operations require multiple lookups  SCD has higher occupancy, replacements take longer  SCD energy/access is smaller (narrow entries)

  25. Analytical Models 25  Empirical results on invalidations match analytical models  Bounds worst-case invalidations with minimal overprovisioning  Can provision directory using simple formulas  Set-associative arrays do not meet analytical models  Need significant overprovisioning (~2x), no bounds  Similar results for Sparse & Hierarchical

Recommend


More recommend