parallel memory architecture
play

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last one! This lecture


  1. PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

  2. Overview ¨ Announcement ¤ Homework 6 is due tonight n The last one! ¨ This lecture ¤ Communication in multiprocessors ¤ Parallel memory architecture ¤ Cache coherence protocol

  3. Example Code I ¨ A sequential application runs as a single thread Kernel Function: Memory A void kern (int start, int end) { int i; 1 n … for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Single Thread main() { … kern ( 1 , n ); … }

  4. Example Code I ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { A[i] = A[i] * A[i] + 5; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

  5. Performance of Parallel Processing ¨ Recall: Amdahl’s law for theoretical speedup ¤ Overall speedup is limited to the fraction of the program that can be executed in parallel ! speedup = f : sequential fraction "# $%& ' Speedup vs. Sequential Fraction 10 10 x 8 Speedup 6 5 x 4 ~ 2 x 2 ~ 1 x 0 0 50 100 150 Number of Processors 10% 20% 40% 60% 90%

  6. Example Code II ¨ A single location is updated every time Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Thread 0 main() { … kern ( 1 , n ); … }

  7. Example Code II ¨ Two threads operating on separate partitions Kernel Function: Memory A void kern (int start, int end) { int i; 1 n for(i=start; i<=end; ++i) { sum sum = sum * A[i]; } } Processor Processor Thread 0 main() { … Thread 1 kern ( 1 , n/2 ); … kern ( n/2+1 , n ); }

  8. Communication in Multiprocessors ¨ How multiple processor cores communicate? Shared Memory Message Passing § Multiple threads employ § Explicit communication shared memory through interconnection § Easy for programmers network § Simple hardware (loads and stores) Core Core Core Core … … 1 N 1 N Mem Mem Shared Memory Interconnection Network

  9. Shared Memory Architectures Uniform Memory Access Non-Uniform Memory Access ¨ Equal latency for all ¨ Access latency is processors proportional to proximity ¨ Simple software ¤ Fast local accesses control Example UMA Example NUMA Core Core Core Core … … Mem Mem 1 4 1 4 Router Router Memory

  10. Network Topologies Shared Network Point to Point Network ¨ Low latency ¨ High latency ¨ Low bandwidth ¨ High bandwidth ¨ Simple control ¨ Complex control ¤ e.g., bus ¤ e.g., mesh, ring Core Core Mem Mem 1 2 Core Core … Mem Mem Router Router 1 4 Router Router Router Router 4 3 Mem Mem Core Core

  11. Challenges in Shared Memories ¨ Correctness of an application is influenced by ¤ Memory consistency n All memory instructions appear to execute in the program order n Known to the programmer ¤ Cache coherence n All the processors see the same data for a particular memory address as they should have if there were no caches in the system n Invisible to the programmer

  12. Cache Coherence Problem ¨ Multiple copies of each cache block ¤ In main memory and caches ¨ Multiple copies can get inconsistent when writes happen ¤ Solution: propagate writes from one core to others core Core … 1 N Cache Cache 1 N Main Memory

  13. Scenario 1: Loading From Memory ¨ Variable A initially has value 0 ¨ P1 stores value 1 into A ¨ P2 loads A from memory and sees old value 0 P1 P2 Cache Cache Bus A:0 Memory

  14. Scenario 2: Loading From Cache ¨ P1 and P2 both have variable A (value 0) in their caches ¨ P1 stores value 1 into A ¨ P2 loads A from its cache and sees old value P1 P2 Cache Cache Bus A:0 Memory

  15. Cache Coherence ¨ The key operation is update/invalidate sent to all or a subset of the cores ¤ Software based management n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid ¤ Hardware based management n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?

  16. Snoopy Protocol ¨ Relying on a broadcast infrastructure among caches ¤ For example shared bus ¨ Every cache monitors (snoop) the traffic on the shared media to keep the states of the cache block up to date … Core Core … Core Core L1 L1 L1 L1 LLC LLC Memory Memory

  17. Simple Snooping Protocol ¨ Relies on write-through, write no-allocate cache ¨ Multiple readers are allowed ¤ Writes invalidate replicas ¨ Employs a simple state machine for each cache unit P1 P2 Cache Cache Bus A:0 Memory

  18. Simple Snooping State Machine ¨ Every node updates its one-bit valid flag using a simple finite Load/-- Store/BusWr state machine (FSM) Valid ¨ Processor actions Evict/-- BusWr/-- Load/BusRd ¤ Load, Store, Evict Invalid ¨ Bus traffic Store/BusWr ¤ BusRd, BusWr Transaction by local actions Transaction by bus traffic

Recommend


More recommend