the c ost of u pdates to s hared d ata
play

THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS - PowerPoint PPT Presentation

E XPLOITING C OMMUTATIVITY TO R EDUCE THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN , D ANIEL S ANCHEZ MICRO 2015 Executive summary 2 Updates to shared data limit parallelism in current


  1. E XPLOITING C OMMUTATIVITY TO R EDUCE THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN , D ANIEL S ANCHEZ MICRO 2015

  2. Executive summary 2  Updates to shared data limit parallelism in current systems  Insight: Many updates are commutative  Coup extends cache coherence protocols to make commutative updates as cheap as reads  Maintains coherence and consistency  Accelerates update-heavy applications significantly

  3. Updates are expensive 3 Shared cache Core 0 Core 1 A: 20 add(A, 1); add(A, 2); add(A, 1); add(A, 2); add(A, 1); add(A, 2); +2 +1 read(A); A: 21 A: 23 Core/$ 0 Core/$ 1 Time Traffic Serialization

  4. Updates are expensive, even with RMOs 4 Shared cache Core 0 Core 1 ALU A: 23 A: 21 A: 20 add(A, 1); add(A, 2); +1 +2 add(A, 1); add(A, 2); add(A, 1); add(A, 2); read(A); Core/$ 0 Core/$ 1 Traffic Time Serialization Complicates consistency

  5. Coup: exploiting commutativity 5 Shared cache Core 0 Core 1 ALU A: 20 A: 23 A: 29 add(A, 1); add(A, 2); add(A, 1); add(A, 2); add(A, 1); add(A, 2); +2 +1 read(A); A: +0 A: +1 A: +0 A: +2 Core/$ 0 Core/$ 1 Low traffic Time Concurrent updates Simple consistency Less general than RMOs

  6. Commutative updates are common 6  Operations  Applications Reduction variables Iterative algorithms Graph traversal Reference counting

  7. Software privatization vs. Coup 7 X.0 Privatization X.1 X … … Reduction X.N Multiple thread-private, One read-only copy update-only copies Software privatization Coup Needs to amortize No overheads privatization/reduction costs Wastes shared cache & No wasted capacity memory capacity Must apply selectively Apply to any update that might commute

  8. Outline 8  Introduction  Coup  Evaluation

  9. Structural changes 9 Reduction Shared cache/dir unit Coherence states Private Private Cache 0 … … U M S I Cache N-1 ISA … Core 0 Core N-1 … comm_add (&x, v) load (&x) comm_or (&x, v) Store (&x, v) … ...

  10. Example: extending MSI 10 M M MSI MUSI W W W R R C C S S U W W W W R W, C W, R R W R C I I Legend Initiated by own core (gain permissions) Transitions Initiated by others (lose permissions) States M odified S hared (read-only) I nvalid U pdate-only Requests R ead W rite C ommutative update

  11. Coherence and consistency 11  Coherence is maintained  Consistency is not affected  See paper for proofs

  12. Implementation and verification 12 IM IM Legend xMS xMN SM NM States Stable Transient IS IN Split Race M M M E E N S S I I IM xMI M E I Transitions initiated by ISI xNI Own request (R,W,C,wback) NN xMI xMI WB WB WBI WBI Response to own request Inval/downgrade request No extra stable states Easy to verify

  13. Evaluation Methodology 13 to L4 chips … L4 cache L4 cache L4 cache … Shared L3 and chip directory & global & global & global dir chip dir chip dir chip L2 0 L2 15 … L1I L1D L1I L1D Processor Processor Processor … chip chip chip Core 0 Core 15 1-8 processor and L4 chips Processor chip organization 8 sockets × 16 cores/socket = 128 cores

  14. Coup vs. Atomic Operations 14 MESI COUP histogram pagerank bfs fluidanimate spmv 60 60 25 25 70 70 100 100 50 50 50 50 60 60 20 20 80 80 40 40 Speedup Speedup 50 50 40 40 15 15 60 60 40 40 30 30 30 30 30 30 10 10 40 40 20 20 20 20 20 20 5 5 20 20 10 10 10 10 10 10 0 0 0 0 0 0 0 0 0 0 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 Cores Cores Cores Cores Cores Cores Cores Cores Cores Cores Fraction of commutative instructions 1.0% 2.4% 4.9% 0.40% 0.96% MESI COUP 1.2 Normalized AMAT 1 0.8 0.6 0.4 0.2 0 histogram spmv pagerank bfs fluidanimate

  15. Modifying algorithms to exploit Coup 15 Delayed deallocation reference counting Scheme Data structure Refcache [1] Hash table Coup implementation Hierarchical bit vectors + comm add/or 2.5 2 Performance 1.5 1 0.5 0 Refcache Coup [1] Clements et al, EuroSys 2013

  16. Conclusions 16  Coup allows concurrent commutative updates  Maintains coherence and consistency  Coup implementation accelerates single-word updates  Minor hardware overhead  Accelerates update-heavy applications by up to 2.4x  Coup opens exciting research avenues  Commutativity-aware hardware transactional memory  Support arbitrary update functions, semantic commutativity

  17. T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME !

Recommend


More recommend