E XPLOITING C OMMUTATIVITY TO R EDUCE THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN , D ANIEL S ANCHEZ MICRO 2015
Executive summary 2 Updates to shared data limit parallelism in current systems Insight: Many updates are commutative Coup extends cache coherence protocols to make commutative updates as cheap as reads Maintains coherence and consistency Accelerates update-heavy applications significantly
Updates are expensive 3 Shared cache Core 0 Core 1 A: 20 add(A, 1); add(A, 2); add(A, 1); add(A, 2); add(A, 1); add(A, 2); +2 +1 read(A); A: 21 A: 23 Core/$ 0 Core/$ 1 Time Traffic Serialization
Updates are expensive, even with RMOs 4 Shared cache Core 0 Core 1 ALU A: 23 A: 21 A: 20 add(A, 1); add(A, 2); +1 +2 add(A, 1); add(A, 2); add(A, 1); add(A, 2); read(A); Core/$ 0 Core/$ 1 Traffic Time Serialization Complicates consistency
Coup: exploiting commutativity 5 Shared cache Core 0 Core 1 ALU A: 20 A: 23 A: 29 add(A, 1); add(A, 2); add(A, 1); add(A, 2); add(A, 1); add(A, 2); +2 +1 read(A); A: +0 A: +1 A: +0 A: +2 Core/$ 0 Core/$ 1 Low traffic Time Concurrent updates Simple consistency Less general than RMOs
Commutative updates are common 6 Operations Applications Reduction variables Iterative algorithms Graph traversal Reference counting
Software privatization vs. Coup 7 X.0 Privatization X.1 X … … Reduction X.N Multiple thread-private, One read-only copy update-only copies Software privatization Coup Needs to amortize No overheads privatization/reduction costs Wastes shared cache & No wasted capacity memory capacity Must apply selectively Apply to any update that might commute
Outline 8 Introduction Coup Evaluation
Structural changes 9 Reduction Shared cache/dir unit Coherence states Private Private Cache 0 … … U M S I Cache N-1 ISA … Core 0 Core N-1 … comm_add (&x, v) load (&x) comm_or (&x, v) Store (&x, v) … ...
Example: extending MSI 10 M M MSI MUSI W W W R R C C S S U W W W W R W, C W, R R W R C I I Legend Initiated by own core (gain permissions) Transitions Initiated by others (lose permissions) States M odified S hared (read-only) I nvalid U pdate-only Requests R ead W rite C ommutative update
Coherence and consistency 11 Coherence is maintained Consistency is not affected See paper for proofs
Implementation and verification 12 IM IM Legend xMS xMN SM NM States Stable Transient IS IN Split Race M M M E E N S S I I IM xMI M E I Transitions initiated by ISI xNI Own request (R,W,C,wback) NN xMI xMI WB WB WBI WBI Response to own request Inval/downgrade request No extra stable states Easy to verify
Evaluation Methodology 13 to L4 chips … L4 cache L4 cache L4 cache … Shared L3 and chip directory & global & global & global dir chip dir chip dir chip L2 0 L2 15 … L1I L1D L1I L1D Processor Processor Processor … chip chip chip Core 0 Core 15 1-8 processor and L4 chips Processor chip organization 8 sockets × 16 cores/socket = 128 cores
Coup vs. Atomic Operations 14 MESI COUP histogram pagerank bfs fluidanimate spmv 60 60 25 25 70 70 100 100 50 50 50 50 60 60 20 20 80 80 40 40 Speedup Speedup 50 50 40 40 15 15 60 60 40 40 30 30 30 30 30 30 10 10 40 40 20 20 20 20 20 20 5 5 20 20 10 10 10 10 10 10 0 0 0 0 0 0 0 0 0 0 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 1 1 32 32 64 64 96 96 128 128 Cores Cores Cores Cores Cores Cores Cores Cores Cores Cores Fraction of commutative instructions 1.0% 2.4% 4.9% 0.40% 0.96% MESI COUP 1.2 Normalized AMAT 1 0.8 0.6 0.4 0.2 0 histogram spmv pagerank bfs fluidanimate
Modifying algorithms to exploit Coup 15 Delayed deallocation reference counting Scheme Data structure Refcache [1] Hash table Coup implementation Hierarchical bit vectors + comm add/or 2.5 2 Performance 1.5 1 0.5 0 Refcache Coup [1] Clements et al, EuroSys 2013
Conclusions 16 Coup allows concurrent commutative updates Maintains coherence and consistency Coup implementation accelerates single-word updates Minor hardware overhead Accelerates update-heavy applications by up to 2.4x Coup opens exciting research avenues Commutativity-aware hardware transactional memory Support arbitrary update functions, semantic commutativity
T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME !
Recommend
More recommend