directory coherence
play

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School - PowerPoint PPT Presentation

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Tonight: project proposal This lecture Snooping wrap-up


  1. DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

  2. Overview ¨ Upcoming deadline ¤ Tonight: project proposal ¨ This lecture ¤ Snooping wrap-up ¤ Directory coherence ¤ Implementation challenges ¤ Token-based coherence protocol

  3. Recall: Cache Coherence ¨ Definition of coherence ¤ Write propagation n Write are visible to other processors ¤ Write serialization n All write to the same location are seen in the same order by all processes P1 P2 A

  4. Implementation Challenges ¨ MSI implementation ¤ Stable States [Vantrease’11]

  5. Implementation Challenges ¨ MSI implementation ¤ Stable States ¤ Busy states [Vantrease’11]

  6. Implementation Challenges ¨ MSI implementation ¤ Stable States ¤ Busy states ¤ Races Unexpected events from concurrent requests to same block [Vantrease’11]

  7. Cache Coherence Complexity ¨ A broadcast snooping bus (L2 MOETSI) [Lepak’03]

  8. Implementation Tradeoffs n Reduce unnecessary invalidates and transfers of blocks n Optimize the protocol with more states and prediction mechanisms n Adding more states and optimizations n Difficult to design and verify n lead to more cases to take care of n race conditions n Gained benefit may be less than costs (diminishing returns)

  9. Coherence Cache Miss ¨ Recall: cache miss classification ¤ Cold (compulsory): first access to block ¤ Capacity: due to limited capacity ¤ Conflict: many blocks are mapped to the same set ¨ New class: misses due to sharing ¤ True vs. false sharing A B

  10. Summary of Snooping Protocols ¨ Advantages ¤ Short miss latency ¤ Shared bus provides global point of serialization ¤ Simple implementation based on buses in uniprocessors ¨ Disadvantages ¤ Must broadcast messages to preserve the order ¤ The global point of serialization is not scalable n It needs a virtual bus (or a totally-ordered interconnect)

  11. Scalable Coherence Protocols ¨ Problem: shared interconnect is not scalable ¨ Solution: make explicit requests for blocks ¨ Directory-based coherence: every cache block has additional information ¤ To track of copies of cached blocks and their states ¤ To track ownership for each block ¤ To coordinate invalidation appropriately

  12. Directory Information ¨ P+1 additional bits for every cache block ¤ One bit used to indicate the block is in each cache ¤ One exclusive bit to indicate the cache has the only copy (can update without notifying others) ¨ On a read, set the cache’s bit and arrange the supply of data ¨ On a write, invalidate all caches that have the block and reset their bits P=4 E Cache Block How to organize directory information?

  13. Directory Organization ¨ Example: central directory for P processors ¤ For each cache block in memory n p presence bits, 1 dirty bit ¤ For each cache block in cache n 1 valid bit, and 1 dirty (owner) bit P P 1 valid, 1 dirty (exclusive) per block Cache Cache Interconnection Network • • • Memory Directory presence bits dirty bit

  14. Directory Protocol ¨ Three states (similar to snoopy protocol) ¤ Shared: more than one processors have data, memory up- to-date ¤ Uncached: no processor has it; not valid in any cache ¤ Exclusive: one processor has data; memory out-of-date ¨ Basic terminology ¤ Local node, where a request originates ¤ Home node, where the memory location of an address resides ¤ Remote node, has copy of a cache block, whether exclusive or shared

  15. Read Request ¨ P0 reads a cache location 1. Read P0 Home 2. DatEx (DatShr) P1 [Culler/Singh]

  16. ReadEx Request ¨ Avoid roundtrip to home by sending data directly from owner 1. RdEx 2. Invl P0 Home 3a. Rev Owner 3b. DatEx [Culler/Singh]

  17. Write Contention ¨ NACKing mechanism 1a. RdEx 1b. RdEx 4. Invl 3. RdEx L J P0 Home P1 5a. Rev 2a. DatEx 2b. NACK J 5b. DatEx What are the challenges? [Culler/Singh]

  18. Design Challenges ¨ Fairness: which requester is preferred on a conflict? ¤ Consider distance and delivery order of interconnect ¨ Race condition: how to keep the proper sequence ¤ NACK requests to busy blocks (pending invalidate) n Original requestor retries ¤ Queuing requests and granting in sequence

  19. Summary of Directory Protocols ¨ Advantages ¨ Does not require broadcast to all caches ¨ Exactly as scalable as interconnect and directory storage (much more scalable than bus) ¨ Disadvantages ¨ Adds indirection to miss latency (critical path) ¨ request à directory à memory ¨ Requires extra storage space to track directory states ¨ Protocols and race conditions are more complex

  20. Avoid Indirection ¨ Can we get the best of both snooping and directory protocols? ¤ Direct cache-to-cache misses (broadcast is ok) ¤ What if unordered interconnect (e.g., mesh) was used? Directory Protocol Hybrid Protocol 1 1 P P P M P P P M 3 2 2

  21. An Example Problem Delayed in interconnect Request to write 1 No Copy No Copy Read/Write 2 P 0 P 1 P 2 Ack 3 Request to read •P 0 issues a request to write (delayed to P 2 ) •P 1 issues a request to read

  22. An Example Problem Read-only Read-only 1 No Copy No Copy Read/Write 2 P 0 P 1 P 2 4 3 •P 2 responds with data to P 1

  23. An Example Problem Read-only Read-only 1 No Copy No Copy Read/Write 5 2 P 0 P 1 P 2 4 3 •P 0 ’s delayed request arrives at P 2

  24. An Example Problem 6 No Copy Read-only Read-only Read/Write 1 No Copy Read/Write 5 2 P 0 P 1 P 2 7 4 3 •P 2 responds to P 0

  25. An Example Problem No Copy Read-only Read-only Read/Write 1 No Copy Read/Write 5 2 P 0 P 1 P 2 7 4 3 Problem: P 0 and P 1 are in inconsistent states Locally “correct” operation, globally inconsistent

  26. Token Coherence Max Tokens Request to write Delayed 1 T=0 T=0 T=16 (R/W) 2 P 0 P 1 P 2 3 Request to read Delayed •P 0 issues a request to write (delayed to P 2 ) [Martin’03] •P 1 issues a request to read

  27. Token Coherence T=1(R) T=15(R) 1 T=0 T=0 T=16 (R/W) 2 P 0 P 1 P 2 4 T=1 3 •P 2 responds with data to P 1 [Martin’03]

  28. Token Coherence T=1(R) T=15(R) 1 T=0 T=0 T=16 (R/W) 5 2 P 0 P 1 P 2 4 3 •P 0 ’s delayed request arrives at P 2 [Martin’03]

  29. Token Coherence 6 T=15 T=0 T=1(R) T=15(R) 1 T=15(R) T=0 T=16 (R/W) 5 2 P 0 P 1 P 2 7 4 3 •P 2 responds to P 0 [Martin’03]

  30. Token Coherence T=15(R) T=1(R) T=0 P 0 P 1 P 2 Now what? (P 0 wants all tokens) [Martin’03]

  31. Token Coherence 8 T=15(R) T=1(R) T=0 Timeout! P 0 P 1 P 2 T=1 9 •P 0 reissues request [Martin’03] •P 1 responds with a token

  32. Token Coherence T=16 (R/W) T=0 T=0 P 0 P 1 P 2 One final issue: What about starvation? •P 0 ’s request completed [Martin’03]

Recommend


More recommend