for relaxed atomics on heterogeneous systems
play

for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, - PowerPoint PPT Presentation

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at:


  1. Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at: http://rsim.cs.illinois.edu/pubs.html

  2. No Formal Specification for Relaxed Atomics C++17 "specification" for relaxed atomics Races that don't order other accesses • Implementations should ensure no “out -of-thin- air” • values are computed that circularly depend on their own “C++ (relaxed) atomics were the worst idea ever. I just computation spent days (and days) trying to get something to work. … My example only has 2 addresses and 4 accesses, it shouldn’t be this hard. Can you help?” - Email from employee at major research lab Formal specification for relaxed atomics is a longstanding problem 2

  3. Why Use Relaxed Atomics? 27X 28X 99X 20X Speedup 10X 0X • But generally use simple, SW-based coherence – Cost of staying away from relaxed atomics too high! 3

  4. Our Approach • Previous work – Goal: formal semantics for all possible relaxed atomics uses – No widely accepted formal semantics after ~15 years of effort • Insight: analyze how real codes use relaxed atomics – What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them? 4

  5. Contributions Identified common uses of relaxed atomics • – Work queues, event counters, ref counters, seqlocks , … Data-race-free-relaxed (DRFrlx) memory model: • SC-centric semantics + efficiency – Evaluated benefits of using relaxed atomics • Up to 53% less cycles (33% avg), 40% less energy (20% avg) – Everyone can safely use RAts 5

  6. Outline • Motivation • Background • Data-race-free-relaxed • Results • Conclusion 6

  7. Atomics Background • Default: Data-race-free- 0 (DRF0) [ISCA ‘90] – Identify all races as synchronization accesses (C++: atomics) // each thread for i = 0:n … synch (atomic) ADD R4, A[i], R1 synch (atomic) ADD R5, B[i], R1 … – All atomics order data accesses – Atomics order other atomics  Ensures SC semantics if no data races 7

  8. Atomics Background (Cont.) • Default: Data-race-free- 0 (DRF0) [ISCA ‘90] – All atomics order data accesses – Atomics order other atomics  Ensures SC semantics if no data races • Data-race-free- 1 (DRF1): unpaired atomics [TPDS ‘93] + Unpaired atomics do not order data accesses – Atomics order other atomics  Ensures SC semantics if no data races • Relaxed atomics [PLDI ‘08] + Do not order data or other atomics  But can violate SC and no formal specification 8

  9. Outline • Motivation • Background • Data-race-free-relaxed • Results • Conclusion 9

  10. Identifying Relaxed Atomic Use Cases • Our Approach – What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them? • Contacted vendors, developers, and researchers How do relaxed atomics work in Event Counters? 10

  11. Event Counter Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 1 1 1 1 0 0 0 0 1 1 0 0 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter 11

  12. Event Counter (Cont.) Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 1 1 1 1 2 1 3 2 1 1 1 1 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter Increments race, so have to use atomics – 12

  13. Event Counter (Cont.) Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 7 1 9 1 5 3 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter Increments race, so have to use atomics – Commutative increments: order does not affect final result How to formalize? 13

  14. Incorporating Commutativity Into DRFrlx • New relaxed atomic category: commutative • Formalism (Intuition): – Accesses are commutative – Intermediate values must not be observed  Final result is always SC 14

  15. Commutative Definitions for an SC Execution • Commutativity – Two accesses to a memory location M are commutative if: • Can be performed in any order and • Yield the same final result for M • X and Y form a commutative race iff: 1. X and Y form a race, 2. At least one of X and Y is distinguished as commutative, & 3. X and Y are: • Not commutative or • Value loaded by either is used by another instr. in its thread 15

  16. Commutative Program and Model Definitions • DRFrlx Program – A program is DRFrlx iff for every SC execution of program: • No data races or commutative races in the execution • DRFrlx Model – A system obeys DRFrlx iff: • Result of every execution of DRFrlx program is result of an SC execution of the program What about other use cases? 16

  17. Incorporating Other Use Cases Into DRFrlx Use Case Category Semantics Work Queues Unpaired SC Flags Non-Ordering Final result always SC Event Counters Commutative Seqlocks Speculative Ref Counters Quantum SC-centric: non-SC parts isolated Split Counters 17

  18. Outline • Motivation • Background • Data-race-free-relaxed • Results • Conclusion 18

  19. Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment – GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT • Study DRF0, DRF1, DRFrlx w/ GPU & DeNovoA coherence • Workloads – Microbenchmarks for each use case – Benchmarks with biggest RAts speedups on discrete GPU • UTS, PageRank (PR), Betweeness Centrality (BC) 19

  20. Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 104 100% GD0 = GPU coherence + DRF0 80% GD1 = GPU coherence + DRF1 60% GDR = GPU coherence + DRFrlx DD0 = DeNovoA coherence + DRF0 40% DD1 = DeNovoA coherence + DRF1 DDR = DeNovoA coherence + DRFrlx 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG 20

  21. Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Relaxed atomics reduce cycles up to ~50% DRF1 increases data reuse (21% avg vs. GD0) DRFrlx overlaps atomics (15% avg vs. GD1) 21

  22. Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 104 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Relaxed atomics reduce cycles up to ~50% DeNovoA increases reuse over GPU: 10% avg. for DRFrlx 22

  23. Relaxed Atomics Applications – Energy N/W L2 $ L1 D$ Scratch GPU Core+ 104 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Energy similar to execution time trends DeNovoA’s reuse reduces energy over GPU: 29% avg. for DRFrlx 23

  24. Conclusion • Cost of avoiding relaxed atomics too high • Difficult to use correctly: no formal specification • Insight: Analyze how real codes use relaxed atomics DRFrlx: SC-centric semantics + efficiency Everyone can safely use RAts 24

Recommend


More recommend