Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich
...but first, we’re hiring! • Young institute dedicated to basic research and graduate education • Located near Vienna , Austria • Fully English-speaking • Graduate School • 1+3 years PhD Program • Full-time positions with competitive salary • Internships (2018): email d.alistarh@gmail.com • PhD & Postdoc Positions • Projects: • Concurrent Data Structures • Distributed Machine Learning • Molecular Computation
Why Co Concurrent Data Structures ? Clock rate and #cores over the past 45 years. To get speedup on newer hardware . Scaling : more threads should imply more useful work.
The Problem with Concurrency Throughput of a Concurrent Packet Processing Queue 6.00E+06 Throughput (Events/Second) 5.00E+06 > $10000 / 4.00E+06 machine 3.00E+06 2.00E+06 1.00E+06 < $1000 / 0.00E+00 0 10 20 30 40 50 60 70 machine Number of Threads Is this problem inherent for some data structures?
Inherent Sequential Bottlenecks Data structures with strong ordering semantics • Stacks, Queues, Priority Queues, Exact Counters Theorem: Given n threads , any deterministic , strongly ordered data structure has executions in which a processor takes linear in n time to return. [Ellen, Hendler, Shavit, SICOMP 2013] [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] This is important because of Amdahl’s Law • Assume single-threaded computation takes 7 days • Inherently sequential component (e.g., queue) takes 15% = 1 day • Then maximum speedup < 7x , even with infinite number of threads
Today’s Class Theorem: Given n threads , any deterministic , strongly ordered data structure has an execution in which a processor takes linear in n time to return. [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] How can we circumvent this? Theory ↔ Software ↔ Hardware New Notions of Progress / Correctness! New Data Structure Designs!
Lock-Free Data Structures • Based on atomic instructions (CAS, Fetch&Inc, etc.) • Blocking of one thread doesn’t stop the whole system • Implementations: HashTables, Lists, B-Trees, Queues, Stacks, SkipLists, etc. • Known to scale well for many data structures Preamble Memory location R; … void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); Scan & … } while (!Bool_CAS ( &R, val, val + 1 )); Validate return val; } CAS ( R, old, new ) Example: Lock-free counter success
The Lock-Free Paradox Counter Value R 2 1 0 Memory location R; void fetch-and-increment ( ) { int val; do { Thread 1 Thread 0 val = Read( R ); 0 1 val val 0 1 val val new_val = val + 1; } while (! Compare&Swap ( &R, val, new_val )); return val; } Example: Lock-free counter. Theory : threads could starve in optimistic lock-free implementations . Use more complex wait-free algorithms. Practice : this doesn’t happen . Threads don’t starve .
Starvation? Lock-Free Stack, 16 threads Number of iterations before an operation succeeds 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Percentage of operations Try distribution, SkipList Inserts, 16 threads, 50% mutations 15000000 10000000 Number of operations 5000000 0 Why ? 1 2 3 4 5 6 Queue SkipList Counter
Part 1: Understanding Lock-free Progress 1. We focus on contended workloads 2. We focus on the scheduler • Sequence of accesses to shared data • Not adversarial , but relaxed • Stochastic model 3. We focus on long-term behavior • How long does an operation take to complete on average ? • Are there operations that never complete ? How does the “scheduler” behave in the long run ?
A simplified view of “the scheduler” • Complex combination of • Input (workload) • Code • Hardware 1 3 4 1 … • Single variable contention (Intel TM )
The Scheduler • Pick random time t • What’s the probability that p i is scheduled? • Scheduler: • Either chooses a request from the pool in each “step,” or leaves the variable with the current owner • The Schedule: • Under contention, a sequence of thread ids, e.g.: 2, 1, 4, 5, 2, 3, …. • Sequential access to contended data item • Stochastic Scheduler: • Every thread can be scheduled in each step, with probability > 0.
Examples • Assume n processes • The uniform stochastic scheduler: • θ = 1 / n • Each process gets scheduled uniformly • A standard adversary : • Take any adversarial strategy • The distribution gives probability 1 to the process picked by the strategy, 0 to all others • Not stochastic • Quantum-based schedulers • Stochastic if quantum length not fixed, but random variable • E.g.: [1, 1, 1], [3], [4, 4, 4, 4], [2, 2], [1], [4, 4], … • Common for OS scheduling
Lock-Free Algorithms and Stochastic Schedulers • Lock-Free • There’s a time bound B for the system to complete some new operation • Wait-Free • There’s a (local) time bound for each operation to complete Theorem: Under any stochastic scheduler, any lock-free algorithm is wait-free with probability 1. [Alistarh, Censor-Hillel, Shavit, STOC14/JACM16] Proof intuition: • Given any time t, if some thread p is scheduled for B consecutive time steps, it has to complete some new operation • There’s a non-zero probability that the scheduler might decide to schedule thread p B steps in a row. • By the “Infinite Monkey Theorem,” this will eventually occur. • Hence, with probability 1, every operation eventually succeeds
Comments Theorem: Under any stochastic scheduler , any boundedlock-free algorithm is wait-free, with probability 1. Minimal Maximal Progress Progress Deadlock-free Starvation-Free Lock-Free Wait-Free (Non-blocking) • Practically , not that insightful • The probability that an operation succeeds could be as low as (1 / n) n • Does not necessarily hold if the scheduler is not stochastic • For instance, on NUMA systems, scheduler can be non-stochastic
The Story So Far • The Goal • Lock-Free Algorithms in Practice • The Stochastic Scheduler Model • Lock-Free ≈ Wait-Free (in Theory) • Performance Upper Bounds • A general class of lock-free algorithms • Uniform stochastic scheduler Disclaimer : We do not claim that the scheduler is uniform generally . We only use this as a lower bound for its long-run behavior .
Single-CAS Universal 1 … • Can implement any object lock-free q (Herlihy’s Universal Construction) 1 … • Blueprint for many efficient implementations s (Treiber Stack, Counters) CAS ( R, old, new ) success What is the average number of steps a process takes until completing a method call? Step Complexity What is the average number of steps the system takes until completing a method call? System Latency = Throughput -1
Special Case: The Counter Memory location R; void fetch-and-inc ( ) { unsigned val = 0; READ (R ) READ (R ) do { val = Read( R ); CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) } while (!Bool_CAS ( &R, val, val + 1 )); return val; success success } Example: Lock-free counter • Example Schedule: • 1, 2, 2, 1 Assuming a uniform stochastic scheduler and n threads, what is the average step complexity?
Part 2: Step Complexity Analysis n, 2, 1, … READ (R ) READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success success In each step, we pick an element from 1 to n randomly . How many steps (in expectation) before an element is chosen twice ?
� � The Birthday Problem • n = 365 days in a year • k people in a room • What is the probability that there are two with the same birthday? $ & )*$ • Pr[ no birthday collision ] = 1 1 − 1 − % … (1 − % ) % • Approximation: 𝑓 𝑦 ≈ 1 + 𝑦 (for 𝑦 close to 0). • Pr[ no birthday collision ] ≈ 𝑓 *)() *$)/&% • This is constant for 𝒍 = 𝒐 Moral of the story: 1. Two people in this room probably share birthdays 2. After ~ 𝑜 steps are scheduled, some thread wins
� � � � The Execution: A Sequential View 2, 1, 4, ..., 2, 4, 1, 3 USELESS … P4: Read P1:CAS P2: Read P1: Read P2: CAS P4:CAS P3: Read Time Moral of the story: 1. After ~ 𝑜 steps are scheduled, some thread wins 2. That thread’s CAS will cause ~ 𝑜 other threads to fail Average latency of the system is O( 𝑜 ) (this is tight). By symmetry, average step complexity for a counter operation is O( 𝑜 ). 21
� Warning: Not Formally Correct 1. We have assumed a uniform initial configuration 2. A process which fails a CAS will have to pay an extra step READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success 3. We have only given upper bounds on the number of steps • But 𝑜 is indeed the tight bound here 4. Latency <-> Step Complexity argued only by symmetry • Formally, by Markov Chain lifting
Recommend
More recommend