the performance of spin lock alternatives for shared
play

The Performance of Spin Lock Alternatives for Shared-Memory - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 /


  1. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Author: Thomas E. Anderson Presenter: Bin Lin Department of Computer Science Nov 25, 2013 Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 1 / 31

  2. Outline ◮ Introduction ◮ Multiprocessor architectures overview ◮ Simple approaches to spin-waiting ◮ Software alternatives ◮ Queueing approach ◮ Evaluation ◮ Hardware solutions ◮ Conclusions Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 2 / 31

  3. Introduction ◮ Shared-memory multiprocessors • Various different architectures ◮ Mutual exclusion • Software • Hardware ◮ This paper focuses on spin locks ◮ Goals • Scalable • Low-latency Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 3 / 31

  4. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 4 / 31

  5. Multiprocessor Architectures Overview ◮ Interconnect type – multistage interconnection network Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 5 / 31

  6. Multiprocessor Architectures Overview ◮ Interconnect type – single bus Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 6 / 31

  7. Multiprocessor Architectures Overview ◮ CC strategy – with CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 7 / 31

  8. Multiprocessor Architectures Overview ◮ CC strategy – without CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 8 / 31

  9. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 9 / 31

  10. Multiprocessor Architectures Overview ◮ Interconnection network with CC using directories Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 10 / 31

  11. Multiprocessor Architectures Overview ◮ Two dimensions • Interconnect type (bus or multistage network) • Cache coherence (CC) strategy ◮ Six architectures considered in this paper • Multistage interconnection network without CC • Multistage interconnection network with invalidation-based CC using remote directories • Bus without CC • Bus with snoopy write-through invalidation-based CC • Bus with snoopy write-back invalidation-based CC • Bus with snoopy distributed-write CC Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 11 / 31

  12. Multiprocessor Architectures Overview ◮ Write-back vs write-through vs distributed-write Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 12 / 31

  13. Simple Approaches to Spin-waiting Spin on Test-and-Set while (TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ Problems • The lock holder must contend with spinning processors for exclusive access to the lock location. • Each spinning processor consumes internetwork bandwidth. ◮ Tradeoff: The more frequently polling of a processor, the faster it will acquire the lock, but the more other processors will be disrupted. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 13 / 31

  14. Simple Approaches to Spin-waiting Spin on Read while (lock = BUSY or TestAndSet(lock) = BUSY); < criticial section > lock := CLEAR; ◮ It is a good idea, isn’t it? ◮ Problems • There is a separation between detecting the lock has been released and attempting to acquire it • Cache copies of the lock value are invalidated by a test-and-set instruction even if the value is not changed. • Invalidation-based CC requires O(P) bus or network cycles to broadcast a value to P waiting processors. Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 14 / 31

  15. Simple Approaches to Spin-waiting – Performance Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 15 / 31

  16. Simple Approaches to Spin-waiting – Performance ◮ Quiescence time for spin on read Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 16 / 31

  17. Simple Approaches to Spin-waiting – Performance ◮ Why quiescence is so slow for spin on read • When the lock is released, all cached copies are invalidated • Subsequent reads of all processors will incur a read miss • Many processors will see the lock free • Try to execute test-and-set • All but one attempt fails • When the last processor does a test-and-set, every other processor does a read miss and then quiesce Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 17 / 31

  18. Software Alternatives The author presents five software alternatives: ◮ Four based on CSMA network protocols, which differ by: • Where to wait � Delay after the lock has been released � Delay after every separate access to the lock • Whether the size of the delay is set statically or dynamically ◮ One novel approach—explicit queueing Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 18 / 31

  19. Software Alternatives – Where to Delay Delay after Spinner Notices Released Lock while(lock = BUSY or TestAndSet(lock) = BUSY) begin while (lock = BUSY); Delay(); end; Delay between Each Memory Reference while(lock = BUSY or TestAndTest(lock) = BUSY) Delay(); Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 19 / 31

  20. Software Alternatives – How to Set the Size of Delay Static Delay ◮ Each spinning processor is statically assigned a fixed amount of time(slot) to delay which is different from each other. ◮ Good performance with: • Fewer spinning processors with fewer slots • More spinning processors with more slots Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 20 / 31

  21. Software Alternatives – How to Set the Size of Delay Dynamic Delay ◮ Like Ethernet’s exponential backoff ◮ If a processor “collides”with another processor, it backs off for a longer delay ◮ Problems • The spinning processor will continue to back off while the lock is held when the critical section is long • How long should the initial delay be set Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 21 / 31

  22. Queueing ◮ Delay-based approaches separate contending accesses in time ◮ Queuing separates contending accesses in space ◮ A naive approach • Maintain an explicit queue of spinning processors • Each arriving processor enqueues itself and spins on a separate flag • Reduce invalidations if each processor’s flag is kept in a separate cache block • But maintaining queues is expensive: the enqueue and dequeue operations must themselves be locked Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 22 / 31

  23. Queueing – Implementation ◮ A more efficient approach Init flags[0] := HAS LOCK; flags[1..P-1] := MUST WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST WAIT); < critical section > Unlock flags[myPlace mod P] := MUST WAIT; flags[(myPlace + 1) mod P] := HAS LOCK; Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 23 / 31

  24. Queueing – Implementation ◮ Distributed-write coherence • All processors can spin on a single counter ◮ Invalidation-based coherence • Each processor should wait on a flag in a separate cache block ◮ Multistage network without CC • Each flag should be placed in a separate memory module ◮ Bus without CC • A delay is needed Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 24 / 31

  25. Queueing – Problems ◮ What if an architecture does not support an atomic read-and-increment instruction ◮ Increases lock latency when there is no contention ◮ Makes preemption problem more severe ◮ Makes it more difficult to wait for multiple events Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 25 / 31

  26. Evaluation – Principal Performance Comparison Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 26 / 31

  27. Evaluation – Spinning-waiting Overhead for a Burst Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 27 / 31

  28. Hardware Solutions – Multistage Network ◮ Combining networks • Requests to the same location can be combined and forwarded as a single request • Better performance, but may increase the latency ◮ Hardware queueing • Eliminates polling across the network • Speeds passing control of the lock ◮ Goodman’s queue links • Stores the name of the next processor in the queue directly in each processor’s cache • The next processor can be notified without going through the original memory module Bin Lin (Department of Computer Science) CS533 Fall 2013 Nov 25, 2013 28 / 31

Recommend


More recommend