5 chip multiprocessors ii
play

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) - PowerPoint PPT Presentation

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based


  1. 5 • Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil)‐ Robert Mullins

  2. Overview • Synchronization hardware primitives • Cache Coherency Issues – Coherence misses – Cache coherence and interconnects • Directory-based Coherency Protocols Chip Multiprocessors (ACS MPhil)‐ 2

  3. Synchronization • The lock problem – The lock is suppose to provide atomicity for lock: ld reg, lock-addr critical sections cmp reg, #0 – Unfortunately, as bnz lock implemented this lock is st lock-addr, #1 lacking atomicity in its ret own implementation unlock: st lock-addr, #0 – Multiple processors ret could read the lock as free and progress past the branch simultaneously Culler p.338 Chip Multiprocessors (ACS MPhil)‐ 3

  4. Synchronization • Test and Set – Executes the following atomically: • reg=m[lock-addr] • m[lock-addr]=1 lock: t&s reg, lock-addr – The branch makes sure bnz reg, lock ret that if the lock was already taken we try unlock: st lock-addr, #0 again ret – A more general, but similar, instruction is swap • reg1=m[lock-addr] • m[lock-addr]=reg2 Chip Multiprocessors (ACS MPhil)‐ 4

  5. Synchronization • We could implement test&set with two bus transactions – A read and a write transaction – We could lock down the bus for these two cycles to ensure the sequence is atomic – More difficult with a split-transaction bus • performance and deadlock issues Culler p.391 Chip Multiprocessors (ACS MPhil)‐ 5

  6. Synchronization • If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: – Issue a read exclusive (BusRdX) transaction then perform the read and write (in the cache)‐ without giving up ownership – Any incoming requests to the block are buffered until the data is written in the cache • Any other processors are forced to wait Chip Multiprocessors (ACS MPhil)‐ 6

  7. Synchronization • Other common synchronization instructions: – swap – fetch&op • fetch&inc • fetch&add – compare&swap – Many x86 instructions can be prefixed with the “lock” modifier to make them atomic • A simpler general purpose solution? Chip Multiprocessors (ACS MPhil)‐ 7

  8. LL/SC • LL/SC – Load-Linked (LL) • Read memory • Set lock flag and put address in lock register – Intervening writes to the address in the lock register will cause the lock flag to be reset – Store-Conditional (SC) • Check lock flag to ensure an intervening conflicting write has not occurred • If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 Chip Multiprocessors (ACS MPhil)‐ 8

  9. LL/SC reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil)‐ 9

  10. LL/SC This SC will fail as the lock flag will be reset by the store from P2 Culler p.391 Chip Multiprocessors (ACS MPhil)‐ 10

  11. LL/SC • LL/SC can be implemented using the CC protocol: – LL – loads cache line with write permission (issues BusRdX, holds line in state M)‐ – SC – Only succeeds if cache line is still in state M, otherwise fails Chip Multiprocessors (ACS MPhil)‐ 11

  12. LL/SC • Need to ensure forward progress. May prevent LL giving up M state for n cycles (or after repeated fails guarantee success, i.e. simply don't give up M state)‐ • We normally implement a restricted form of LL/SC called RLL/RSC: – SC may experience spurious failures • e.g. due to context switches and TLB misses – We add restrictions to avoid cache line (holding lock variable)‐ from being replaced • Disallow memory memory-referencing instructions between LL and SC • Prohibit out-of-order execution between LL and SC Chip Multiprocessors (ACS MPhil)‐ 12

  13. Coherence misses • Remember your 3 C's! – Compulsory • Cold-start of first-reference misses – Capacity • If cache is not large enough to store all the blocks needed during the execution of the program – Conflict (or collision)‐ • Conflict misses occur due to direct-mapped or set associative block placement strategies – Coherence • Misses that arise due to interprocessor communication Chip Multiprocessors (ACS MPhil)‐ 13

  14. True sharing • A block typically contains many words ( e.g. 4- 8)‐. Coherency is maintained at the granularity of cache blocks – True sharing miss • Misses that arise from the communication of data • e.g., the 1 st write to a shared block (S)‐ will causes an invalidation to establish ownership • Additionally, subsequent reads of the invalidated block by another processor will also cause a miss • Both these misses are classified as true sharing if data is communicated and they would occur irrespective of block size. Chip Multiprocessors (ACS MPhil)‐ 14

  15. False sharing • False sharing miss – Different processors are writing and reading different words in a block, but no communication is taking place • e.g. a block may contain words X and Y • P1 repeatedly writes to X, P2 repeatedly writes to Y • The block will be repeatedly invalidated (leading to cache misses)‐ even though no communication is taking place – These are false misses and are due to the fact that the block contains multiple words • They would not occur if the block size = a single word For more details see “Coherence miss classification for performance debugging in multi-core processors”, Venkataramani et al. Interact-2013 Chip Multiprocessors (ACS MPhil)‐ 15

  16. Cache coherence and interconnects • Broadcast-based snoopy protocols – These protocols rely on bus-based interconnects • Buses have limited scalability • Energy and bandwidth implications of broadcasting – They permit direct cache-to-cache transfers • Low-latency communication – 2 “hops” » 1. broadcast » 2. receive data from remote cache • Very useful for applications with lots of fine-grain sharing Chip Multiprocessors (ACS MPhil)‐ 16

  17. Cache coherence and interconnects • Totally-ordered interconnects – All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch – e.g. a bus or pipelined broadcast tree – Traditional snoopy protocols are built around the concept of a bus (or virtual bus)‐: • (1)‐ Broadcast - All transactions are visible to all components connected to the bus • (2)‐ The interconnect provides a total order of messages Chip Multiprocessors (ACS MPhil)‐ 17

  18. Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional snooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)‐] The centralised switch guarantees a total ordering of messages, i.e. messages are sent to the root switch then broadcast. Chip Multiprocessors (ACS MPhil)‐ 18

  19. Cache coherence and interconnects • Unordered interconnects – Networks (e.g. mesh, torus)‐ can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. • Point-to-point ordering – Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. – e.g. a mesh with a single VC and deterministic dimension ordered (XY)‐ routing Chip Multiprocessors (ACS MPhil)‐ 19

  20. Directory-based cache coherence • The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus • We want to avoid the need to broadcast. So maintain the state of each block explicitly – We store this information in the directory – Requests can be made to the appropriate directory entry to read or write a particular block – The directory orchestrates the appropriate actions necessary to satisfy the request Chip Multiprocessors (ACS MPhil)‐ 20

  21. Directory-based cache coherence • The directory provides a per-block ordering point to resolve races – All requests for a particular block are made to the same directory. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects Chip Multiprocessors (ACS MPhil)‐ 21

  22. Broadcast-based directory protocols • A number of recent coherence protocols broadcast transactions over unordered interconnects: – Similar to snoopy coherence protocols – They provide a directory, or coherence hub, that serves as an ordering point. The directory simply broadcasts requests to all nodes (no sharer state is maintained)‐ – The ordering point also buffers subsequent coherence requests to the same cache line to prevent races with a request in progress – e.g. early example AMD's Hammer protocol – High bandwidth requirements, but simple, no need to maintain/read sharer state Chip Multiprocessors (ACS MPhil)‐ 22

  23. Directory-based cache coherence • The directory keeps track of who has a copy of the block and their states – Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers – The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs (esp. in systems with a large number of processors)‐ Chip Multiprocessors (ACS MPhil)‐ 23

Recommend


More recommend