for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, - PowerPoint PPT Presentation

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at: http://rsim.cs.illinois.edu/pubs.html

No Formal Specification for Relaxed Atomics C++17 "specification" for relaxed atomics Races that don't order other accesses • Implementations should ensure no “out -of-thin- air” • values are computed that circularly depend on their own “C++ (relaxed) atomics were the worst idea ever. I just computation spent days (and days) trying to get something to work. … My example only has 2 addresses and 4 accesses, it shouldn’t be this hard. Can you help?” - Email from employee at major research lab Formal specification for relaxed atomics is a longstanding problem 2

Why Use Relaxed Atomics? 27X 28X 99X 20X Speedup 10X 0X • But generally use simple, SW-based coherence – Cost of staying away from relaxed atomics too high! 3

Our Approach • Previous work – Goal: formal semantics for all possible relaxed atomics uses – No widely accepted formal semantics after ~15 years of effort • Insight: analyze how real codes use relaxed atomics – What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them? 4

Contributions Identified common uses of relaxed atomics • – Work queues, event counters, ref counters, seqlocks , … Data-race-free-relaxed (DRFrlx) memory model: • SC-centric semantics + efficiency – Evaluated benefits of using relaxed atomics • Up to 53% less cycles (33% avg), 40% less energy (20% avg) – Everyone can safely use RAts 5

Outline • Motivation • Background • Data-race-free-relaxed • Results • Conclusion 6

Atomics Background • Default: Data-race-free- 0 (DRF0) [ISCA ‘90] – Identify all races as synchronization accesses (C++: atomics) // each thread for i = 0:n … synch (atomic) ADD R4, A[i], R1 synch (atomic) ADD R5, B[i], R1 … – All atomics order data accesses – Atomics order other atomics  Ensures SC semantics if no data races 7

Atomics Background (Cont.) • Default: Data-race-free- 0 (DRF0) [ISCA ‘90] – All atomics order data accesses – Atomics order other atomics  Ensures SC semantics if no data races • Data-race-free- 1 (DRF1): unpaired atomics [TPDS ‘93] + Unpaired atomics do not order data accesses – Atomics order other atomics  Ensures SC semantics if no data races • Relaxed atomics [PLDI ‘08] + Do not order data or other atomics  But can violate SC and no formal specification 8

Identifying Relaxed Atomic Use Cases • Our Approach – What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them? • Contacted vendors, developers, and researchers How do relaxed atomics work in Event Counters? 10

Event Counter Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 1 1 1 1 0 0 0 0 1 1 0 0 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter 11

Event Counter (Cont.) Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 1 1 1 1 2 1 3 2 1 1 1 1 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter Increments race, so have to use atomics – 12

Event Counter (Cont.) Accel … L1 L1 L1 L1 Cache Cache Cache Cache … Counters 7 1 9 1 5 3 L2 Cache • Threads concurrently update counters – Read part of a data array, updates its counter Increments race, so have to use atomics – Commutative increments: order does not affect final result How to formalize? 13

Incorporating Commutativity Into DRFrlx • New relaxed atomic category: commutative • Formalism (Intuition): – Accesses are commutative – Intermediate values must not be observed  Final result is always SC 14

Commutative Definitions for an SC Execution • Commutativity – Two accesses to a memory location M are commutative if: • Can be performed in any order and • Yield the same final result for M • X and Y form a commutative race iff: 1. X and Y form a race, 2. At least one of X and Y is distinguished as commutative, & 3. X and Y are: • Not commutative or • Value loaded by either is used by another instr. in its thread 15

Commutative Program and Model Definitions • DRFrlx Program – A program is DRFrlx iff for every SC execution of program: • No data races or commutative races in the execution • DRFrlx Model – A system obeys DRFrlx iff: • Result of every execution of DRFrlx program is result of an SC execution of the program What about other use cases? 16

Incorporating Other Use Cases Into DRFrlx Use Case Category Semantics Work Queues Unpaired SC Flags Non-Ordering Final result always SC Event Counters Commutative Seqlocks Speculative Ref Counters Quantum SC-centric: non-SC parts isolated Split Counters 17

Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment – GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT • Study DRF0, DRF1, DRFrlx w/ GPU & DeNovoA coherence • Workloads – Microbenchmarks for each use case – Benchmarks with biggest RAts speedups on discrete GPU • UTS, PageRank (PR), Betweeness Centrality (BC) 19

Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 104 100% GD0 = GPU coherence + DRF0 80% GD1 = GPU coherence + DRF1 60% GDR = GPU coherence + DRFrlx DD0 = DeNovoA coherence + DRF0 40% DD1 = DeNovoA coherence + DRF1 DDR = DeNovoA coherence + DRFrlx 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG 20

Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Relaxed atomics reduce cycles up to ~50% DRF1 increases data reuse (21% avg vs. GD0) DRFrlx overlaps atomics (15% avg vs. GD1) 21

Relaxed Atomics Applications – Execution Time DD1 GD0 GD1 GDR DD0 DDR 104 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Relaxed atomics reduce cycles up to ~50% DeNovoA increases reuse over GPU: 10% avg. for DRFrlx 22

Relaxed Atomics Applications – Energy N/W L2 $ L1 D$ Scratch GPU Core+ 104 100% 80% 60% 40% 20% 0% GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 UTS PR-1 PR-2 PR-3 PR-4 BC-1 BC-2 BC-3 BC-4 AVG Energy similar to execution time trends DeNovoA’s reuse reduces energy over GPU: 29% avg. for DRFrlx 23

Conclusion • Cost of avoiding relaxed atomics too high • Difficult to use correctly: no formal specification • Insight: Analyze how real codes use relaxed atomics DRFrlx: SC-centric semantics + efficiency Everyone can safely use RAts 24

for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, - PowerPoint PPT Presentation

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at:

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at

General Atomics Aeronautical Systems, Inc. (Looped Video) 1 This document does not contain U.S.

Relaxed Separation Logic Tutorial @ POPL14 Viktor Vafeiadis MPI-SWS 20 January 2014

A solution of A solution of the cusp problem the cusp problem in relaxed halos in relaxed

5th STL Workshop, June 2005 Title: Relaxed weak queues: an alternative to run-relaxed heaps

Planning and Optimization C2. Delete Relaxation: Finding Relaxed Plans Malte Helmert and Gabriele

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Program logics for relaxed consistency UPMARC Summer School 2014 Viktor Vafeiadis Max Planck

Entropy, continued UNIT 4 Day 7 Demonstration Stretched vs. Relaxed Rubber Bands POLL: iClicker

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Relaxed memory models No sequential consistency (SC) in chips today Chip designers

Community Detection by Decomposing a Graph into Relaxed Cliques Fabio Furini, Timo Gschwind,

Scrambling as the Combination of Relaxed Context-Free Grammars in a Model-Theoretic Grammar

Robustness against Relaxed Memory Models Memory Models Roland Meyer Technische Universit at

Adiabatic control of many-particle states in coupled quantum dots Paul Eastham Trinity College

Analysis of a prototypical multiscale method coupling atomistic and continuum mechanics Fr

Chiral dynamical aspects of recently measured (low energy) reactions at MAMI, ELSA, GRAAL, and

Coflow Scheduling Erez Kantor Hamid Jahanjou Rajmohan Rajaraman Northeastern University,

A Julia-based Parallel Simulator for the Description of the Coupled Flow, Thermal and Geological

Feebly coupled Dark Matter and long-lived particles at the LHC Alberto Mariotti HEP@ Based on:

Specific context: Climate reanalysis The ERA-CLIM and ERA-CLIM2 projects CERA: a system for

Numerical Analysis of Coupled Circuit and Device Models Caren Tischendorf Humboldt University of