now handout page 1
play

NOW Handout Page 1 CS258 S99 1 Relationship between Perspectives - PDF document

Meta-message today powerful high-level abstraction boils down to specific, simple low-level mechanisms each detail has significant implications Shared Memory Multiprocessors Topic: THE MEMORY ABSTRACTION sequence of reads and


  1. Meta-message today • powerful high-level abstraction boils down to specific, simple low-level mechanisms – each detail has significant implications Shared Memory Multiprocessors • Topic: THE MEMORY ABSTRACTION – sequence of reads and writes – each read returns the “last value written” to the address CS 252, Spring 2005 David E. Culler MEMORY Computer Science Division U.C. Berkeley • ILP -> TLP 3/1/05 CS252 s05 smp 2 A take on Moore’s Law Uniprocessor View Bit-level parallelism Instruction-level Thread-level (?) • Performance depends heavily on memory 100,000,000 hierarchy � � � • Managed by hardware 10,000,000 � � � � � � • Time spent by a program � R10000 � � � � � � � � � � � � � – Timeprog(1) = Busy(1) + Data Access(1) � �� � � � � � � � � � �� � � � � � � 1,000,000 � � – Divide by cycles to get CPI equation P � Pentium � ransistors • Data access time can be reduced by: � � � ILP has been extended � i80386 � 10 0 Data-local With MPs for TLP, since i80286 � � � R3000 – Optimizing machine T 100,000 � the 60s. Now it is more � R2000 � Busy-useful » bigger caches, lower latency... Critical and more 7 5 attractive than ever with – Optimizing program � i8086 ) s e ( CMPs. 10,000 » temporal and spatial locality im i8080 5 0 � T � � i8008 … and it is all about � � i4004 memory systems! 2 5 1,000 1970 1975 1980 1985 1990 1995 2000 2005 3/1/05 CS252 s05 smp 3 3/1/05 CS252 s05 smp 4 Same Processor-Centric Perspective What is a Multiprocessor? 100 1 0 0 • A collection of communicating processors Synchronization Data-local – Goals: balance load, reduce inherent communication and Data-remote extra work Busy-useful Busy-overhead 75 75 Time (s) Time (s) ... P P P • A multi-cache, multi-memory system 50 50 – Role of these components essential regardless of programming model 25 25 – Prog. model and comm. abstr. affect specific performance tradeoffs P 0 P 1 P 2 P 3 ... (b) Parallel with four proc essors (a) Sequential P P P 3/1/05 CS252 s05 smp 5 3/1/05 CS252 s05 smp 6 NOW Handout Page 1 CS258 S99 1

  2. Relationship between Perspectives Back to Basics • Parallel Architecture = Computer Architecture + Parallelization step(s) Performance issue Pr ocessor time component Communication Architecture Decomposition/ Load imbalance and Synch wait assignment/ synchronization • Small-scale shared memory orchestration – extend the memory system to support multiple processors Decomposition/ Extra work Busy-overhead – good for multiprogramming throughput and parallel computing assignment – allows fine-grain sharing of resources Decomposition/ Inherent Data-remote • Naming & synchronization assignment communication volume – communication is implicit in store/load of shared physical address Artifactual – synchronization is performed by operations on shared addresses Orchestration Data-local communication • Latency & Bandwidth and data locality – utilize the normal migration within the storage to avoid long latency Orchestration/ Communication operations and to reduce bandwidth mapping structure – economical medium with fundamental BW limit Busy(1) + Data(1) => focus on eliminating unnecessary traffic Speedup < Busy useful (p)+Data local (p)+Synch(p)+Data remote (p)+Busy overhead (p) 3/1/05 CS252 s05 smp 7 3/1/05 CS252 s05 smp 8 Natural Extensions of Memory System Bus-Based Symmetric Shared Memory P P 1 n P n P 1 Scale Switch $ $ (Interleaved) First -level $ P Pn Bus 1 $ $ Mem I/O devices (Interleaved) Main memory Interconnection network • Dominate the server market – Building blocks for larger systems; arriving to desktop Mem Mem Shared Cache • Attractive as throughput servers and for parallel programs – Fine-grain resource sharing Centralized Memory Dance Hall, UMA P n P1 – Uniform access via loads/stores – Automatic data movement and coherent replication in caches $ $ Mem Mem – Cheap and powerful extension • Normal uniprocessor mechanisms to access data Interconnection network – Key is extension of memory hierarchy to support multiple processors • Now Chip Multiprocessors Distributed Memory (NUMA) 3/1/05 CS252 s05 smp 9 3/1/05 CS252 s05 smp 10 Caches are Critical for Performance Example Cache Coherence Problem • Reduce average latency P P 2 P 1 3 – automatic replication closer to u= ? u= ? 3 processor 4 5 $ $ $ • Reduce average bandwidth u= 7 u :5 u :5 • Data is logically transferred from producer to consumer to memory I/O devices 1 2 u :5 – store reg --> mem P P P – load reg <-- mem Memory • Many processors can – Processors see different values for u after event 3 shared data efficiently – With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when • What happens when store & load are executed » Processes accessing main memory may see very stale value on different processors ? – Unacceptable to programs, and frequent! 3/1/05 CS252 s05 smp 11 3/1/05 CS252 s05 smp 12 NOW Handout Page 2 CS258 S99 2

  3. Caches and Cache Coherence Shared Cache: Examples • Alliant FX-8 • Caches play key role in all cases – early 80’s – Reduce average data access time – eight 68020s with x-bar to 512 KB interleaved cache – Reduce bandwidth demands placed on shared interconnect • Encore & Sequent • private processor caches create a problem – first 32-bit micros (N32032) – Copies of a variable can be present in multiple caches – two to a board with a shared cache – A write by one processor may not become visible to others • coming soon to microprocessors near you... » They’ll keep accessing stale value in their caches 100000000 => Cache coherence problem • What do we do about it? P1 Pn 10000000 R10000 Switch Pentium R4400 – Organize the mem hierarchy to make it go away i80486 1000000 (Interleaved) – Detect and take actions to eliminate the problem i80386 Cache 100000 i80286 R3010 i80x86 SU MIPS i8086 M68K (Interleaved) 10000 Main Memory MIPS i4004 1000 1970 1975 1980 1985 1990 1995 2000 2005 3/1/05 CS252 s05 smp 13 3/1/05 CS252 s05 smp Year 14 Advantages Disadvantages • Cache placement identical to single cache • Fundamental BW limitation P1 Pn Switch – only one copy of any cached block • Increases latency of all accesses • fine-grain sharing (Interleaved) – X-bar Cache – communication latency determined level in the storage – Larger cache hierarchy where the access paths meet (Interleaved) – L1 hit time determines proc. cycle time !!! P1 Pn Main Memory » 2-10 cycles Switch • Potential for negative interference » Cray Xmp has shared registers! (Interleaved) – one proc flushes data needed by another Cache • Potential for positive interference – one proc prefetches data for another (Interleaved) Main Memory • Smaller total storage • Many L2 caches are shared today – only one copy of code/data used by both proc. • CMP makes cache sharing attractive • Can share data within a line without “ping-pong” – long lines without false sharing 3/1/05 CS252 s05 smp 15 3/1/05 CS252 s05 smp 16 Intuitive Memory Model Snoopy Cache-Coherence Protocols P State P 1 P n Bus snoop Address L1 Data 100:67 $ $ L2 100:35 Cache-memory transaction I/O devices Mem Memory • Bus is a broadcast medium & Caches know what Disk 100:34 they have • Reading an address should return the last value • Cache Controller “snoops” all transactions on written to that address the shared bus • Easy in uniprocessors – relevant transaction if for a block it contains – except for I/O – take action to ensure coherence • Cache coherence problem in MPs is more » invalidate, update, or supply value pervasive and more performance critical – depends on state of the block and the protocol 3/1/05 CS252 s05 smp 17 3/1/05 CS252 s05 smp 18 NOW Handout Page 3 CS258 S99 3

Recommend


More recommend