The Performance of Spin Lock Alternatives for Shared-Memory - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 / 31

Outline ◮ Introduction ◮ Multiprocessor architectures overview ◮ Simple approaches to spin-waiting ◮ Software alternatives ◮ Queueing approach ◮ Evaluation ◮ Hardware solutions ◮ Conclusions Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 2 / 31

Introduction ◮ Shared-memory multiprocessors • Various different architectures ◮ Mutual exclusion • Software • Hardware ◮ This paper focuses on spin locks ◮ Goals • Scalable • Low-latency Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 3 / 31

Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 4 / 31

Multiprocessor Architectures Overview ◮ Interconnect type – multistage interconnection network Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 5 / 31

Multiprocessor Architectures Overview ◮ Interconnect type – single bus Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 6 / 31

Multiprocessor Architectures Overview ◮ CC strategy – with CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 7 / 31

Multiprocessor Architectures Overview ◮ CC strategy – without CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 8 / 31

Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 9 / 31

Multiprocessor Architectures Overview ◮ Interconnection network with CC using directories Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 10 / 31

Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 11 / 31

Multiprocessor Architectures Overview ◮ Write-back vs write-through vs distributed-write Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 12 / 31

Simple Approaches to Spin-waiting Spin on Test-and-Set while (TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ Problems • The lock holder must contend with spinning processors for exclusive access to the lock location. • Each spinning processor consumes internetwork bandwidth. ◮ Tradeoff: The more frequently polling of a processor, the faster it will acquire the lock, but the more other processors will be disrupted. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 13 / 31

Simple Approaches to Spin-waiting Spin on Read while (lock = BUSY or TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ It is a good idea, isn’t it? ◮ Problems • There is a separation between detecting the lock has been released and attempting to acquire it • Cache copies of the lock value are invalidated by a test-and-set instruction even if the value is not changed. • Invalidation-based CC requires O(P) bus or network cycles to broadcast a value to P waiting processors. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 14 / 31

Simple Approaches to Spin-waiting – Performance Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 15 / 31

Simple Approaches to Spin-waiting – Performance ◮ Quiescence time for spin on read Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 16 / 31

Simple Approaches to Spin-waiting – Performance ◮ Why quiescence is so slow for spin on read • When the lock is released, all cached copies are invalidated • Subsequent reads of all processors will incur a read miss • Many processors will see the lock free • Try to execute test-and-set • All but one attempt fails • When the last processor does a test-and-set, every other processor does a read miss and then quiesce Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 17 / 31

Software Alternatives The author presents five software alternatives: ◮ Four based on CSMA network protocols, which differ by: • Where to wait � Delay after the lock has been released � Delay after every separate access to the lock • Whether the size of the delay is set statically or dynamically ◮ One novel approach—explicit queueing Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 18 / 31

Software Alternatives – Where to Delay Delay after Spinner Notices Released Lock while(lock = BUSY or TestAndSet(lock) = BUSY) begin while (lock = BUSY); Delay(); end; Delay between Each Memory Reference while(lock = BUSY or TestAndTest(lock) = BUSY) Delay(); Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 19 / 31

Software Alternatives – How to Set the Size of Delay Static Delay ◮ Each spinning processor is statically assigned a fixed amount of time(slot) to delay which is different from each other. ◮ Good performance with: • Fewer spinning processors with fewer slots • More spinning processors with more slots Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 20 / 31

Software Alternatives – How to Set the Size of Delay Dynamic Delay ◮ Like Ethernet’s exponential backoff ◮ If a processor “collides”with another processor, it backs off for a longer delay ◮ Problems • The spinning processor will continue to back off while the lock is held when the critical section is long • How long should the initial delay be set Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 21 / 31

Queueing ◮ Delay-based approaches separate contending accesses in time ◮ Queuing separates contending accesses in space ◮ A naive approach • Maintain an explicit queue of spinning processors • Each arriving processor enqueues itself and spins on a separate flag • Reduce invalidations if each processor’s flag is kept in a separate cache block • But maintaining queues is expensive: the enqueue and dequeue operations must themselves be locked Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 22 / 31

Queueing – Implementation ◮ A more efficient approach Init flags[0] := HAS LOCK; flags[1..P-1] := MUST WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST WAIT); < critical section > Unlock flags[myPlace mod P] := MUST WAIT; flags[(myPlace + 1) mod P] := HAS LOCK; Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 23 / 31

Queueing – Implementation ◮ Distributed-write coherence • All processors can spin on a single counter ◮ Invalidation-based coherence • Each processor should wait on a flag in a separate cache block ◮ Multistage network without CC • Each flag should be placed in a separate memory module ◮ Bus without CC • A delay is needed Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 24 / 31

Queueing – Problems ◮ What if an architecture does not support an atomic read-and-increment instruction ◮ Increases lock latency when there is no contention ◮ Makes preemption problem more severe ◮ Makes it more difficult to wait for multiple events Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 25 / 31

Evaluation – Principal Performance Comparison Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 26 / 31

Evaluation – Spinning-waiting Overhead for a Burst Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 27 / 31

Hardware Solutions – Multistage Network ◮ Combining networks • Requests to the same location can be combined and forwarded as a single request • Better performance, but may increase the latency ◮ Hardware queueing • Eliminates polling across the network • Speeds passing control of the lock ◮ Goodman’s queue links • Stores the name of the next processor in the queue directly in each processor’s cache • The next processor can be notified without going through the original memory module Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 28 / 31

The Performance of Spin Lock Alternatives for Shared-Memory - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 /

SEPG 2007 SEPG 2007 SPIN Panel SPIN Panel SEPG2007 - SPIN Panel Session SEPG2007 - SPIN Panel

Tutorial 2: Promela/Spin Running Spin General Usage and Tips CISC422/853 Advice for

9/2/2015 Spin Currents An overview Sources of Spin Currents Spin current introduction Spin

1 Reader/Writer Lock: Second Try Reader/Writer Lock: Second Try Guidelines for Condition

Guest Speaker Joe Cornell, CFA Spin-Off Advisors, LLC Publishers of Spin-Off Research

Vorticity and spin polarization Vorticity and spin polarization Vorticity and spin polarization

Thermoelectric spin voltage in graphene 2017/12/22

Mairbek Chshiev European School on Magnetism Spintronics Conventional spintronics Spin-orbit

Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange

Spin Lock Performance Introduction Shared memory multiprocessors o Various different

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Synchronization: Going Deeper Synchronization: Going Deeper SharedLock : Reader/Writer Lock :

SPIN database of SPIN database of funding opportunities funding opportunities Peter R. Barcher

What is the execution time of spin(n) when n = 1 000 000? Function spin(n) : void spin(int n) {

Non-equilibrium physics in Spin Ice Spin Glass in Spin Ice Ludovic Jaubert Ludovic Jaubert,

Kernel support for user debugging: ptrace, utrace, and what's next Roland McGrath Kernel support

VMware vSphere Provider The VMware vSphere provider gives Terraform the ability to work with

A Scalable Tools Communication Infrastructure Darius Buntinas, George Bosilca, Richard L. Graham,

OFFLINE SCHEDULER BOF Raz Ben Yehuda Linux plumbers conference 2009 CONCEPT PARTIAL

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading

Advanced features and capabilities Platform architecture Ayman Hamed Solutions Architect How We

R-trees Computational Geometry Heuristics Buffer Paradigm 1 Spatial Data Spatial data: points,

The R-Tree Yufei Tao ITEE University of Queensland INFS4205/7205, Uni of Queensland The R-Tree