Scalable Distributed Memory Multiprocessors 1
Outline Scalability • physical, bandwidth, latency and cost • level of integration Realizing Programming Models • network transactions • protocols • safety – input buffer problem: N-1 – fetch deadlock Communication Architecture Design Space • how much hardware interpretation of the network transaction? 2
Limited Scaling of a Bus Characteristic Bus Physical Length ~ 1 ft Number of Connections fixed Maximum Bandwidth fixed Interface to Comm. medium memory inf Global Order arbitration Protection Virt -> physical Trust total OS single comm. abstraction HW Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components 3
Workstations in a LAN? Characteristic Bus LAN Physical Length ~ 1 ft KM Number of Connections fixed many Maximum Bandwidth fixed ??? Interface to Comm. medium memory inf peripheral Global Order arbitration ??? Protection Virt -> physical OS Trust total none OS single independent comm. abstraction HW SW No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart 4
Scalable Machines What are the design trade-offs for the spectrum of machines between? • specialize or commodity nodes? • capability of node-to-network interface • supporting programming models? What does scalability mean? • avoids inherent design limits on resources • bandwidth increases with P • latency does not • cost increases slowly with P 5
Bandwidth Scalability Typical switches Bus S S S S Crossbar Multiplexers P M M P M M P M M P M M What fundamentally limits bandwidth? • single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch? 6
Dancehall MP Organization M M ° ° ° M Scalable network Switch Switch Switch ° ° ° $ $ $ $ P P P P Network bandwidth? Bandwidth demand? • independent processes? • communicating processes? Latency? 7
Generic Distributed Memory Org. Scalable network Switch Switch Switch ° ° ° M CA $ P Network bandwidth? Bandwidth demand? • independent processes? • communicating processes? Latency? 8
Key Property Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved • effects propagated through additional transactions 9
Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Overhead? Channel Time(n) = n/B --- BW at bottleneck RoutingDelay(h,n) 10
Typical example max distance: log n number of switches: α α n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T 64 (128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward sf (128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us T 64 11
Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup 12
Cost Effective? 2000 1500 Speedup = P/(1+ logP) Costup = 1 + 0.1 P 1000 500 0 0 500 1000 1500 2000 Processors 2048 processors: 475 fold speedup at 206x cost 13
Physical Scaling Chip-level integration Board-level System level 14
nCUBE/2 Machine Organization Basic module 1024 Nodes DRAM interface channels Router MMU DMA I-Fetch Operand Hypercube network & $ configuration decode Single-chip node Execution unit 64-bit integer IEEE floating point Entire machine synchronous at 40 MHz 15
CM-5 Machine Organization Diagnostics network Control network Data network PM PM Processing Processing Control I/O partition partition partition processors SPARC FPU Data Control networks network $ $ NI ctrl SRAM MBUS Vector Vector unit unit DRAM DRAM DRAM DRAM ctrl ctrl ctrl ctrl DRAM DRAM DRAM DRAM 16
System Level Integration Power 2 IBM SP-2 node CPU L 2 $ Memory bus General inter connection 4-way network formed from Memory interleaved 8-port switches controller DRAM MicroChannel bus NIC I/O DMA DRAM i860 NI 17
Outline Scalability • physical, bandwidth, latency and cost • level of integration Realizing Programming Models • network transactions • protocols • safety – input buffer problem: N-1 – fetch deadlock Communication Architecture Design Space • how much hardware interpretation of the network transaction? 18
Programming Models Realized by Protocols CAD Database Scientific modeling Parallel applications Multiprogramming Shared Message Data Programming models address passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Hardware/software boundary Communication har dware Physical communication medium Network Transactions 19
Network Transaction Primitive Communication Network serialized msg ° ° ° input buf fer output buf fer Destination Node Source Node one-way transfer of information from a source output buffer to a dest. input buffer • causes some action at the destination • occurrence is not directly visible at source deposit data, state change, reply 20
Bus Transactions vs Net Transactions Issues: protection check V->P ?? format wires flexible output buffering reg, FIFO ?? global local media arbitration destination naming and routing input buffering limited many source action completion detection 21
Shared Address Space Abstraction Source Destination r ← [ Global address] (1) Initiate memory access Load (2) Address translation (3) Local/remote check Read request (4) Request transaction Read request (5) Remote memory access Memory access Wait Read response (6) Reply transaction Read response (7) Complete memory access Time Fundamentally a two-way request/response protocol • writes have an acknowledgement Issues • fixed or variable length (bulk) transfers • remote virtual or physical address, where is action performed? • deadlock avoidance and input buffer full coherent? consistent? 22
The Fetch Deadlock Problem Even if a node cannot issue a request, it must sink network transactions. Incoming transaction may be a request, which will generate a response. Closed system (finite buffering) 23
Consistency while (flag==0); A=1; flag=1; print A; P P P 1 2 3 Memory Memory Memory A:0 flag:0->1 Delay 3: load A 1: A=1 2: flag=1 Interconnection network (a) P P 3 2 P 1 Congested path (b) write-atomicity violated without caching 24
Key Properties of SAS Abstraction Source and destination data addresses are specified by the source of the request • a degree of logical coupling and trust no storage logically “outside the application address space(s)” – may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory • logically does not require intervention of the remote processor 25
Message passing Bulk transfers Complex synchronization semantics • more complex protocols • More complex action Synchronous • Send completes after matching recv and source data sent • Receive completes after data transfer complete from matching send Asynchronous • Send completes after send buffer may be reused 26
Synchronous Message Passing Source Destination Recv P src , local VA, len (1) Initiate send (2) Address translation on P src Send P dest , local VA, len (3) Local/remote check (4) Send-ready request Send-rdy req (5) Remote check for posted receive Wait Tag check Processor (assume success) Action? (6) Reply transaction Recv-rdy reply (7) Bulk data transfer ¡ Dest VA or ID Source VA Data-xfer req Time Constrained programming model. Deterministic! What happens when threads added? Destination contention very limited. User/System boundary? 27
Asynch. Message Passing: Optimistic Destination Source (1) Initiate send (2) Address translation Send (P dest , local VA, len) (3) Local/remote check (4) Send data (5) Remote check for posted receive; on fail, Tag match allocate data buffer Data-xfer req Allocate buffer Recv P src , local VA, len Time More powerful programming model Wildcard receive => non-deterministic Storage required within msg layer? 28
Asynch. Msg Passing: Conservative Destination Source (1) Initiate send (2) Address translation on P dest Send P dest , local VA, len (3) Local/remote check Send-rdy req (4) Send-ready request (5) Remote check for posted receive (assume fail); Return and compute record send-ready Tag check (6) Receive-ready request Recv P src , local VA, len (7) Bulk data reply ¡ Dest VA or ID Source VA Recv-rdy req Data-xfer reply Time Where is the buffering? Contention control? Receiver initiated protocol? Short message optimizations 29
Recommend
More recommend