synchronization
play

Synchronization presented by Radu Teodorescu CS533 Why we need it? - PowerPoint PPT Presentation

Synchronization presented by Radu Teodorescu CS533 Why we need it? Parallel programs share data! Consistency of shared data structures Access serialization Coordination between processors Allows queueing, ordering 2 For


  1. Synchronization presented by Radu Teodorescu CS533

  2. Why we need it? • Parallel programs share data! • Consistency of shared data structures • Access serialization • Coordination between processors • Allows queueing, ordering 2

  3. For today • How synchronization works • Synchronization operations • Hardware primitives • Implementations

  4. How it works • Components of a synchronization event: • ACQUIRE method - access right to the synchronization • WAITING algorithm - wait for synchronization to become available • RELEASE method - allow other processes to proceed past synchronization

  5. Waiting algorithms Busy-waiting • The process spins in a loop, repeatedly testing for status change • PROS: low latency • CONS: blocks processor, higher traffic (network, bus, cache)

  6. Waiting algorithms Blocking • The process suspends, releases the processor, waits to be awakened • PROS: releases the processor for other jobs • CONS: higher overhead Scheduling overhead < expected wait time

  7. Synch operations • Locks: grant access to one process only • Barriers: no process advances beyond it until all have arrived • Semaphores • Monitors • … Some combination of hardware primitives and software

  8. Hardware primitives • Synchronization requires some ATOMIC operation • Some flavor of atomic Read&Modify • Atomic exchange • Test & set • Fetch & increment Implement locks, barriers, etc.

  9. Test & set • Test a value and set it if test passed • Also used to implement locks lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock • In cache coherent machines - lots of invalidations

  10. Test and test & set • Take advantage of locality • Test before write (to avoid an invalidation) lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock

  11. Atomic exchange • Exchange a value in a register with memory EXCH Reg, Mem • Can be used to implement a spin lock lock: LD R2, (R1) BNEZ R2, lock ADD R2, R0, #1 EXCH R2, (R1) BNEZ R2, lock

  12. Fetch and increment • Reads a memory value and increments it atomically • Useful for barrier implementation • Can be used to determine how many processes are waiting for it

  13. Performance • How expensive is locking with test&set? • N processors waiting • No hold time lock: ADD R2, R0, #1 T&S R2, (R1) BNEZ R2, lock ∑ (2i+1) = N 2 +2N

  14. Hardware complexity • Single, atomic memory operation • Hard to implement in hardware • complicates the coherence protocol • must be uninterrupted • avoid deadlocks

  15. Reducing complexity • Maintain the atomicity requirement • Have two linked instructions • The second indicates if the pair executed atomically • MIPS/SGI: Load-linked (LL), Store-conditional (SC)

  16. LL/SC • LL returns value of memory location • SC • LL/SC not atomic - returns 0, doesn’t update memory • LL/SC is atomic - returns 1, updates memory

  17. LL/SC lock: LL R2, (R1) BNEZ R2, lock ADD R2, R0, #1 SC R2, (R1) BEQZ R2, lock • LL/SC also fails if processor context switches between LL and SC • Can implement other primitives: exchange, fetch-and-add,...

  18. Implementations • HEP Multiprocessor • NYU Ultracomputer • IBM RP3 • Illinois Cedar

  19. HEP multiprocessor • Each word in memory has Full/Empty bit • Bit is tested in hardware before a RD/WR • The RD/WR blocks until the test succeeds: • RD until full • WR until empty • When test succeeds, the bit is set to the opposite value

  20. HEP multiprocessor • PROS: • Very efficient for low level dependences (compare to locks) • CONS: • Requires complex hardware: • F/E bits • Support to queue a process if test fails • Logic to implement indivisible ops

  21. NYU ultracomputer • Implements fetch-and-add • PROS: • Can use message combining, scales well • Efficient barrier implementation • CONS: • Very complex network • Adders in each memory module

  22. Message combining F&A(X,3) Switch Memory X 5 F&A(X,1) F&A(X,4) Switch Memory X 3 5 5 Switch Memory X 9 3 5 Switch Memory X 9 8

  23. IBM RP3 • Implements fetch-and-phi, where phi can be: • Add, And, Or • Min, Max, Store • Store if zero • PRO: generality • CON: hardware complexity

  24. Illinois Cedar • General atomic instruction that operates on synchronization variables • Synch variable has 2 words: Key and Value • Synch instruction: {addr;(cond);op on key;op on value} • F/E bit test for a read: {X; (X.key==1)*; decrement; fetch} • Complex hardware, special processor for each memory module

  25. That’s it for today!

Recommend


More recommend