synchronization 2 locks memory ordering
play

synchronization 2: locks / memory ordering 1 last time pthread - PowerPoint PPT Presentation

synchronization 2: locks / memory ordering 1 last time pthread create/join racing where data is stored in stacks/etc. (not) making locks from atomic load/store making locks (started) 2 implementing locks: single core intuition: context


  1. C++: preventing reordering example (1) .L2: ... cmpl $0, no_milk jne .L2 // if (note_from_bob == 0) repeat fence cmpl $0, note_from_bob // make sure store is visible to other cores before loading mfence movl $1, note_from_alice #include <atomic> Alice: } if (no_milk) {++milk;} } while (note_from_bob); std::atomic_thread_fence(std::memory_order_seq_cst); do { note_from_alice = 1; void Alice() { 18 // note_from_alice ← 1

  2. C++ atomics: no reordering movl $1, note_from_alice ... jne .L2 testl %eax, %eax movl note_from_bob, %eax .L2: mfence Alice: std::atomic< int > note_from_alice, note_from_bob; } if (no_milk) {++milk;} do { note_from_alice.store(1); void Alice() { 19 } while (note_from_bob.load());

  3. mfence x86 instruction mfence make sure all loads/stores in progress fjnish …and make sure no loads/stores were started early fairly expensive Intel ‘Skylake’: order 33 cycles + time waiting for pending stores/loads 20

  4. GCC: built-in atmoic functions used to implement std::atomic, etc. predate std::atomic builtin functions starting with __sync and __atomic these are what xv6 uses 21

  5. GCC: preventing reordering example (1) .L3: ... cmpl $0, no_milk jne .L3 // if (note_from_bob == 0) repeat fence cmpl $0, note_from_bob // make sure store is visible to other cores before loading mfence 22 void Alice() { movl $1, note_from_alice Alice: } if (no_milk) {++milk;} } while (note_from_bob); __atomic_thread_fence(__ATOMIC_SEQ_CST); do { note_from_alice = 1; // note_from_alice ← 1 // on x86: not needed on second + iteration of loop

  6. GCC: preventing reordering example (2) movl $1, note_from_alice ... jne .L2 testl %eax, %eax movl note_from_bob, %eax .L2: mfence Alice: void Alice() { } if (no_milk) {++milk;} do { __atomic_store(&note_from_alice, &one, __ATOMIC_SEQ_CST); int one = 1; 23 } while (__atomic_load_n(&note_from_bob, __ATOMIC_SEQ_CST));

  7. connecting CPUs and memory multiple processors, common memory how do processors communicate with memory? 24

  8. shared bus CPU1 CPU2 CPU3 CPU4 MEM1 MEM2 tagged messages — everyone gets everything, fjlters contention if multiple communicators some hardware enforces only one at a time 25

  9. shared buses and scaling shared buses perform poorly with “too many” CPUs so, there are other designs we’ll gloss over these for now 26

  10. shared buses and caches remember caches? each CPU wants to keep local copies of memory what happens when multiple CPUs cache same memory? 27 memory is pretty slow

  11. the cache coherency problem value When does this change? When does this change? CPU1 writes 101 to 0xA300? CPU2’s cache 200 0xC500 100 0xA300 172 0x9300 address CPU1 CPU1’s cache 300 0xE500 200 0xC400 100 0xA300 value address MEM1 CPU2 28

  12. the cache coherency problem value When does this change? When does this change? CPU1 writes 101 to 0xA300? CPU2’s cache 200 0xC500 100 0xA300 172 0x9300 address CPU1 CPU1’s cache 300 0xE500 200 0xC400 100101 0xA300 value address MEM1 CPU2 28

  13. “snooping” the bus every processor already receives every read/write to memory take advantage of this to update caches idea: use messages to clean up “bad” cache entries 29

  14. cache coherency states extra information for each cache block overlaps with/replaces valid, dirty bits update states based on reads, writes and heard messages on bus difgerent caches may have difgerent states for same block sample states: Modifjed: cache has updated value Shared: cache is only reading, has same as memory/others Invalid 30 stored in each cache

  15. cache coherency states extra information for each cache block overlaps with/replaces valid, dirty bits update states based on reads, writes and heard messages on bus difgerent caches may have difgerent states for same block sample states: Modifjed: cache has updated value Shared: cache is only reading, has same as memory/others Invalid 30 stored in each cache

  16. scheme 1: MSI Modifjed nothing to do — no other CPU can have a copy example: write while Modifjed can send read later to get value from writer change to Invalid example: hear write while Shared then change to Modifjed must send write — inform others with Shared state example: write while Shared blue: transition requires sending message on bus — — to Invalid to Shared to Modifjed from state — to Invalid — Shared to Modifjed to Shared — — Invalid write read hear write hear read 31

  17. scheme 1: MSI Modifjed nothing to do — no other CPU can have a copy example: write while Modifjed can send read later to get value from writer change to Invalid example: hear write while Shared then change to Modifjed must send write — inform others with Shared state example: write while Shared blue: transition requires sending message on bus — — to Invalid to Shared to Modifjed from state — to Invalid — Shared to Modifjed to Shared — — Invalid write read hear write hear read 31

  18. scheme 1: MSI Modifjed nothing to do — no other CPU can have a copy example: write while Modifjed can send read later to get value from writer change to Invalid example: hear write while Shared then change to Modifjed must send write — inform others with Shared state example: write while Shared blue: transition requires sending message on bus — — to Invalid to Shared to Modifjed from state — to Invalid — Shared to Modifjed to Shared — — Invalid write read hear write hear read 31

  19. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Shared 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 100 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  20. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 100101 Modifjed 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  21. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 101102 Modifjed 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  22. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 102 Modifjed 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  23. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Invalid 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 102 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  24. MSI example CPU1 writes 102 to 0xA300 0xC500 200 Shared “CPU1 is writing 0xA3000” CPU1 writes 101 to 0xA300 cache sees write: invalidate 0xA300 maybe update memory? modifjed state — nothing communicated! 100102 will “fjx” later if there’s a read nothing changed yet (writeback) “What is 0xA300?” CPU2 reads 0xA300 modifjed state — must update for CPU2! “Write 102 into 0xA300” CPU2 reads 0xA300 written back to memory early (could also become Invalid at CPU1) Shared 0xA300 CPU1 0xC400 CPU2 MEM1 address value state 0xA300 102 Shared 200 Shared Shared 0xE500 300 Shared address value state 0x9300 172 32

  25. MSI: update memory to write value (enter modifjed state), need to invalidate others can avoid sending actual value (shorter message/faster) 33 “I am writing address X ” versus “I am writing Y to address X ”

  26. MSI: on cache replacement/writeback still happens — e.g. want to store something else requires writeback if modifjed (= dirty bit) 34 changes state to invalid

  27. MSI state summary Modifjed value may be difgerent than memory and I am the only one who has it Shared Invalid I don’t have the value; I will need to ask for it 35 value is the same as memory

  28. MSI extensions extra states for unmodifjed values where no other cache has a copy avoid sending “I am writing” message later allow values to be sent directly between caches (MSI: value needs to go to memory fjrst) support not sending invalidate/etc. messages to all cores requires some tracking of what cores have each address only makes sense with non-shared-bus design 36

  29. atomic read-modfjy-write really hard to build locks for atomic load store and normal load/stores aren’t even atomic… one instruction that atomically reads and modifjes and writes back a value 37 …so processors provide read/modify/write operations

  30. x86 atomic exchange lock xchg (%ecx), %eax atomic exchange …without being interrupted by other processors, etc. 38 temp ← M[ECX] M[ECX] ← EAX EAX ← temp

  31. test-and-set: using atomic exchange one instruction that… writes a fjxed new value and reads the old value write: mark a locked as TAKEN (no matter what) read: see if it was already TAKEN (if so, only us) 39

  32. test-and-set: using atomic exchange one instruction that… writes a fjxed new value and reads the old value write: mark a locked as TAKEN (no matter what) read: see if it was already TAKEN (if so, only us) 39

  33. implementing atomic exchange get cache block into Modifjed state do read+modify+write operation while state doesn’t change recall: Modifjed state = “I am the only one with a copy” 40

  34. x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (unlocked) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (locked) ret // then, set the_lock to 0 movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 41 // %eax ← 1

  35. x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (unlocked) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (locked) ret // then, set the_lock to 0 movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 41 // %eax ← 1

  36. x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (unlocked) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (locked) ret // then, set the_lock to 0 movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 41 // %eax ← 1

  37. x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (unlocked) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (locked) ret // then, set the_lock to 0 movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 41 // %eax ← 1

  38. x86-64 spinlock with xchg mfence or mfence instruction no reordering of loads/stores across a lock Intel’s manual says: allows looping acquire to fjnish release lock by setting it to 0 (unlocked) “spin” until lock is released elsewhere if lock was already locked retry read old value set lock variable to 1 (locked) ret // then, set the_lock to 0 movl $0, the_lock // for memory order reasons release: lock variable in shared memory: the_lock ret try again // jne acquire // if the_lock wasn't 0 before: test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock lock xchg %eax, the_lock movl $1, %eax acquire: if 1: someone has the lock; if 0: lock is free to take 41 // %eax ← 1

  39. some common atomic operations (1) // x86: emulate with exchange test_and_set(address) { old_value = memory[address]; memory[address] = 1; return old_value != 0; // e.g. set ZF flag } // x86: xchg REGISTER, (ADDRESS) exchange(register, address) { temp = memory[address]; memory[address] = register; register = temp; } 42

  40. some common atomic operations (2) } } register = old_value; memory[address] += register; old_value = memory[address]; fetch_and_add(address, register) { // x86: lock xaddl REGISTER, (ADDRESS) } // x86: clear ZF flag // x86: mov OLD_VALUE, %eax; lock cmpxchg NEW_VALUE, (ADDRESS) return false ; } else { // x86: set ZF flag return true ; memory[address] = new_value; if (memory[address] == old_value) { compare_and_swap(address, old_value, new_value) { 43

  41. append to singly-linked list /* } ); NULL, new_last_node) } do { */ store for the compare_and_swap assumption 2: the processor will not previous reoreder stores but nodes are not being removed, reordered, etc. assumption 1: other threads may be appending to list, 44 into *new_last_node to take place after the void append_to_list(ListNode *head, ListNode *new_last_node) { ListNode *current_last_node = head; while (current_last_node − >next) { current_last_node = current_last_node − >next; } while ( !compare_and_swap(&current_last_node − >next,

  42. common atomic operation pattern try to acquire lock, or update next pointer, or … detect if try failed if so, repeat 45

  43. exercise: fetch-and-add with compare-and-swap exercise: implement fetch-and-add with compare-and-swap compare_and_swap(address, old_value, new_value) { if (memory[address] == old_value) { memory[address] = new_value; return true ; // x86: set ZF flag } else { return false ; // x86: clear ZF flag } } 46

  44. solution long old_value; do { while (!compare_and_swap(p, old_value, old_value + amount); return old_value; } 47 long my_fetch_and_add( long *p, long amount) { old_value = *p;

  45. xv6 spinlock: acquire __sync_synchronize(); (but compiler might need more hints) on x86, xchg alone avoids processor’s reordering avoid load store reordering (including by compiler) same as loop above xchg wraps the lock xchg instruction held by non-running thread don’t want to be waiting for lock } ... // references happen after the lock is acquired. void // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... { 48 acquire( struct spinlock *lk) pushcli(); // disable interrupts to avoid deadlock. while (xchg(&lk − >locked, 1) != 0)

  46. xv6 spinlock: acquire __sync_synchronize(); (but compiler might need more hints) on x86, xchg alone avoids processor’s reordering avoid load store reordering (including by compiler) same as loop above xchg wraps the lock xchg instruction held by non-running thread don’t want to be waiting for lock } ... // references happen after the lock is acquired. void // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... pushcli(); // disable interrupts to avoid deadlock. { 48 acquire( struct spinlock *lk) while (xchg(&lk − >locked, 1) != 0)

  47. xv6 spinlock: acquire __sync_synchronize(); (but compiler might need more hints) on x86, xchg alone avoids processor’s reordering avoid load store reordering (including by compiler) same as loop above xchg wraps the lock xchg instruction held by non-running thread don’t want to be waiting for lock } ... // references happen after the lock is acquired. void // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... { 48 acquire( struct spinlock *lk) pushcli(); // disable interrupts to avoid deadlock. while (xchg(&lk − >locked, 1) != 0)

  48. xv6 spinlock: acquire __sync_synchronize(); (but compiler might need more hints) on x86, xchg alone avoids processor’s reordering avoid load store reordering (including by compiler) same as loop above xchg wraps the lock xchg instruction held by non-running thread don’t want to be waiting for lock } ... // references happen after the lock is acquired. void // past this point, to ensure that the critical section's memory // Tell the C compiler and the processor to not move loads or stores ; // The xchg is atomic. ... { 48 acquire( struct spinlock *lk) pushcli(); // disable interrupts to avoid deadlock. while (xchg(&lk − >locked, 1) != 0)

  49. xv6 spinlock: release // Release the lock, equivalent to lk->locked = 0. turns into mov into lk >locked } popcli(); $0, asm volatile ("movl // not be atomic. A real OS would use C atomics here. void // This code can't use a C assignment, since it might __sync_synchronize(); // stores; __sync_synchronize() tells them both not to. // Both the C compiler and the hardware may re-order loads and // section are visible to other cores before the lock is released. // past this point, to ensure that all the stores in the critical // Tell the C compiler and the processor to not move loads or stores ... 49 release( struct spinlock *lk) ␣ ␣ %0" : "+m" (lk − >locked) : );

  50. xv6 spinlock: release // Release the lock, equivalent to lk->locked = 0. turns into mov into lk >locked } popcli(); $0, asm volatile ("movl // not be atomic. A real OS would use C atomics here. void // This code can't use a C assignment, since it might __sync_synchronize(); // stores; __sync_synchronize() tells them both not to. // Both the C compiler and the hardware may re-order loads and // section are visible to other cores before the lock is released. // past this point, to ensure that all the stores in the critical // Tell the C compiler and the processor to not move loads or stores ... 49 release( struct spinlock *lk) ␣ ␣ %0" : "+m" (lk − >locked) : );

  51. xv6 spinlock: release // Release the lock, equivalent to lk->locked = 0. turns into mov into lk >locked } popcli(); $0, asm volatile ("movl // not be atomic. A real OS would use C atomics here. void // This code can't use a C assignment, since it might __sync_synchronize(); // stores; __sync_synchronize() tells them both not to. // Both the C compiler and the hardware may re-order loads and // section are visible to other cores before the lock is released. // past this point, to ensure that all the stores in the critical // Tell the C compiler and the processor to not move loads or stores ... 49 release( struct spinlock *lk) ␣ ␣ %0" : "+m" (lk − >locked) : );

  52. xv6 spinlock: release // Release the lock, equivalent to lk->locked = 0. turns into mov into lk >locked } popcli(); $0, asm volatile ("movl // not be atomic. A real OS would use C atomics here. void // This code can't use a C assignment, since it might __sync_synchronize(); // stores; __sync_synchronize() tells them both not to. // Both the C compiler and the hardware may re-order loads and // section are visible to other cores before the lock is released. // past this point, to ensure that all the stores in the critical // Tell the C compiler and the processor to not move loads or stores ... 49 release( struct spinlock *lk) ␣ ␣ %0" : "+m" (lk − >locked) : );

  53. xv6 spinlock: debugging stufg if (!holding(lk)) } ... panic("release"); } // Record info about lock acquisition for debugging. ... panic("acquire") if (holding(lk)) ... 50 void acquire( struct spinlock *lk) { lk − >cpu = mycpu(); getcallerpcs(&lk, lk − >pcs); void release( struct spinlock *lk) { lk − >pcs[0] = 0; lk − >cpu = 0;

  54. xv6 spinlock: debugging stufg if (!holding(lk)) } ... panic("release"); } // Record info about lock acquisition for debugging. ... panic("acquire") if (holding(lk)) ... 50 void acquire( struct spinlock *lk) { lk − >cpu = mycpu(); getcallerpcs(&lk, lk − >pcs); void release( struct spinlock *lk) { lk − >pcs[0] = 0; lk − >cpu = 0;

  55. xv6 spinlock: debugging stufg if (!holding(lk)) } ... panic("release"); } // Record info about lock acquisition for debugging. ... panic("acquire") if (holding(lk)) ... 50 void acquire( struct spinlock *lk) { lk − >cpu = mycpu(); getcallerpcs(&lk, lk − >pcs); void release( struct spinlock *lk) { lk − >pcs[0] = 0; lk − >cpu = 0;

  56. xv6 spinlock: debugging stufg if (!holding(lk)) } ... panic("release"); } // Record info about lock acquisition for debugging. ... panic("acquire") if (holding(lk)) ... 50 void acquire( struct spinlock *lk) { lk − >cpu = mycpu(); getcallerpcs(&lk, lk − >pcs); void release( struct spinlock *lk) { lk − >pcs[0] = 0; lk − >cpu = 0;

  57. spinlock problems spinlocks can send a lot of messages on the shared bus makes every non-cached memory access slower… wasting CPU time waiting for another thread could we do something useful instead? 51

  58. spinlock problems spinlocks can send a lot of messages on the shared bus makes every non-cached memory access slower… wasting CPU time waiting for another thread could we do something useful instead? 52

  59. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Invalid --- lock state address CPU1 lock CPU2 CPU3 MEM1 address value state locked Invalid Modifjed address value state lock --- 53

  60. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Invalid --- lock state address CPU1 lock CPU2 CPU3 MEM1 address value state --- Modifjed Invalid address value state lock locked 53

  61. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Modifjed locked lock state address CPU1 lock CPU2 CPU3 MEM1 address value state --- Invalid Invalid address value state lock --- 53

  62. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Invalid --- lock state address CPU1 lock CPU2 CPU3 MEM1 address value state --- Modifjed Invalid address value state lock locked 53

  63. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Modifjed locked lock state address CPU1 lock CPU2 CPU3 MEM1 address value state --- Invalid Invalid address value state lock --- 53

  64. ping-ponging value some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Invalid lock state address CPU1 Invalid --- lock state value address unlocked Modifjed lock state value address MEM1 CPU3 CPU2 53

  65. ping-ponging address some CPU (this example: CPU2) acquires lock “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” (to see it is still locked) CPU3 read-modify-writes lock “I want to modify lock ” (to see it is still locked) CPU2 read-modify-writes lock “I want to modify lock ?” Invalid lock state value Modifjed CPU1 locked lock state value address Invalid --- lock state value address MEM1 CPU3 CPU2 53

  66. ping-ponging test-and-set problem: cache block “ping-pongs” between caches each waiting processor reserves block to modify each transfer of block sends messages on bus …so bus can’t be used for real work like what the processor with the lock is doing 54

  67. test-and-test-and-set (pseudo-C) do { } 55 acquire( int *the_lock) { while (ATOMIC − READ(the_lock) == 0) { /* try again */ } } while (ATOMIC − TEST − AND − SET(the_lock) == ALREADY_SET);

  68. test-and-test-and-set (assembly) lock xchg %eax, the_lock ret try again // jne acquire // if the_lock wasn't 0 (someone else got it first): test %eax, %eax // sets %eax to prior value of the_lock // sets the_lock to 1 // swap %eax and the_lock 56 acquire: movl $1, %eax // ... so try wtih atomic swap: // before we get a chance to // but another processor might lock // lock possibly free // try again (still locked) jne acquire // unlike lock xchg --- keeps lock in Shared state! // test the lock non-atomically cmp $0, the_lock // %eax ← 1

  69. less ping-ponging CPU3 reads lock “I want to read lock ?” CPU2 reads lock (to see it is still locked) “set lock to locked” CPU1 writes back lock value, then CPU2 reads it “I want to read lock ” (to see it is still locked) --- CPU2, CPU3 continue to read lock from cache no messages on the bus “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” some CPU (this example: CPU2) acquires lock (CPU1 writes back value, then CPU2 reads + modifjes it) Invalid lock CPU1 locked CPU2 CPU3 MEM1 address value state lock Modifjed state address value state lock --- Invalid address value 57

  70. less ping-ponging CPU3 reads lock CPU2 reads lock (to see it is still locked) “set lock to locked” CPU1 writes back lock value, then CPU2 reads it “I want to read lock ” (to see it is still locked) Invalid CPU2, CPU3 continue to read lock from cache no messages on the bus “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” some CPU (this example: CPU2) acquires lock (CPU1 writes back value, then CPU2 reads + modifjes it) “I want to read lock ?” lock CPU1 locked CPU2 CPU3 MEM1 address value state lock Modifjed state address value state lock Invalid address value 57

  71. less ping-ponging CPU3 reads lock “I want to read lock ?” CPU2 reads lock (to see it is still locked) “set lock to locked” CPU1 writes back lock value, then CPU2 reads it “I want to read lock ” (to see it is still locked) lock CPU2, CPU3 continue to read lock from cache no messages on the bus “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” some CPU (this example: CPU2) acquires lock (CPU1 writes back value, then CPU2 reads + modifjes it) Invalid state CPU1 locked CPU2 CPU3 MEM1 address value state lock Shared value address value state lock locked Shared address 57

  72. less ping-ponging CPU3 reads lock “I want to read lock ?” CPU2 reads lock (to see it is still locked) “set lock to locked” CPU1 writes back lock value, then CPU2 reads it “I want to read lock ” (to see it is still locked) locked CPU2, CPU3 continue to read lock from cache no messages on the bus “I want to modify lock ” CPU1 sets lock to unlocked “I want to modify lock ” some CPU (this example: CPU2) acquires lock (CPU1 writes back value, then CPU2 reads + modifjes it) Shared lock CPU1 locked CPU2 CPU3 MEM1 address value state lock Shared state address value state lock locked Shared address value 57

Recommend


More recommend