SIMD+ Overview Illiac IV History � Early machines � First massively parallel (SIMD) computer � Illiac IV (first SIMD) � Sponsored by DARPA, built by various � Cray-1 (vector processor, not a SIMD) companies, assembled by Burroughs, under the direction of Daniel Slotnick at � SIMDs in the 1980s and 1990s the University of Illinois � Thinking Machines CM-2 (1980s) � Plan was for 256 PEs, in 4 quadrants of 64 PEs, but only one quadrant was built � General characteristics � Used at NASA Ames Research Center in � Host computer to interact with user and mid-1970s execute scalar instructions, control unit to send parallel instructions to PE array � 100s or 1000s of simple custom PEs, each with its own private memory � PEs connected by 2D torus, maybe also by row/column bus(es) or hypercube � Broadcast / reduction network 1 Fall 2005, MIMD 2 Fall 2005, MIMD Illiac IV Architectural Overview Illiac IV Routing and I/O � CU (control unit) + � Data routing 64 PUs (processing units) � CU bus —instructions or data can be fetched from a PEM and sent to the CU � PU = 64-bit PE (processing element) + PEM (PE memory) � CDB (Common Data Bus) — broadcasts information from CU to all PEs � CU operates on scalars, � PE Routing network — 2D torus PEs operate on vector-aligned arrays (A[1] on PE 1, A[2] on PE2, etc.) � Laser memory � All PEs execute the instruction broadcast � 1 Tb write-once read-only laser memory by the CU, if they are in active mode � Thin film of metal on a polyester sheet, on � Each PE can perform various arithmetic a rotating drum and logical instructions on data in 64-bit, 32-bit, and 8-bit formats � DFS (Disk File System) � Each PEM contains 2048 64-bit words � 1 Gb, 128 heads (one per track) � Data routed between PEs various ways � ARPA network link (50 Kbps) � I/O is handled by a separate Burroughs � Illiac IV was a network resource available B6500 computer (stack architecture) to other members of the ARPA network 3 Fall 2005, MIMD 4 Fall 2005, MIMD
Cray-1 History Cray-1 Vector Operations � First famous vector (not SIMD) processor � Vector arithmetic � 8 vector registers, each holding a 64- � In January 1978 there were only 12 non- element vector (64 64-bit words) Cray-1 vector processors worldwide: � Arithmetic and logical instructions operate � Illiac IV, TI ASC (7 installations), CDC on 3 vector registers STAR 100 (4 installations) � Vector C = vector A + vector B � Decode the instruction once, then pipeline the load, add, store operations � Vector chaining � Multiple functional units � 12 pipelined functional units in 4 groups: address, scalar, vector, and floating point � Scalar add = 3 cycles, vector add = 3 cycles, floating-point add = 6 cycles, floating-point multiply = 7 cycles, reciprocal approximation = 14 cycles � Use pipelining with data forwarding to bypass vector registers and send result of one functional unit to input of another 5 Fall 2005, MIMD 6 Fall 2005, MIMD Thinking Machines Corporation � s Cray-1 Physical Architecture Connection Machine CM-2 � Custom implementation � Distributed-memory SIMD (bit-serial) � Register chips, memory chips, low-speed � Thinking Machines Corp. founded 1983 and high-speed gates � CM-1, 1986 (1000 MIPS, 4K processors) � Physical architecture � CM-2, 1987 (2500 MFLOPS, 64K…) � “Cylindrical tower (6.5 � tall, 4.5 � diameter) with 8.5 � diameter seat � Programs run on one of 4 Front-End � Composed of 12 wedge-like columns in Processors, which issue instructions to 270° arc, so a “reasonably trim individual” the Parallel Processing Unit (PE array) can get inside to work � Control flow and scalar operations run on � World � s most expensive love-seat” Front-End Processors, while parallel � “Love seat” hides power supplies and operations run on the PPU plumbing for Freon cooling system � A 4x4 crossbar switch (Nexus) connects the 4 Front-Ends to 4 sections of the PPU � Freon cooling system � Each PPU section is controlled by a � Vertical cooling bars line each wall, Sequencer (control unit), which receives modules have a copper heat transfer assembly language instructions and plate that attaches to the cooling bars broadcasts micro-instructions to each � Freon is pumped through a stainless steel processor in that PPU section tube inside an aluminum casing 7 Fall 2005, MIMD 10 Fall 2005, MIMD
CM-2 Nodes / Processors CM-2 Interconnect � CM-2 constructed of “nodes”, each with: � Broadcast and reduction network � 32 processors (implemented by 2 custom � Broadcast, Spread (scatter) processor chips), 2 floating-point � Reduction (e.g., bitwise OR, maximum, accelerator chips, and memory chips sum), Scan (e.g., collect cumulative results over sequence of processors such � 2 processor chips (each 16 processors) as parallel prefix) � Contains ALU, flag registers, etc. � Sort elements � Contains NEWS interface, router � NEWS grid can be used for nearest- interface, and I/O interface neighbor communication � 16 processors are connected in a 4x4 mesh to their N, E, W, and S neighbors � Communication in multiple dimensions: 256x256, 1024x64, 8x8192, 64x32x32, � 2 floating-point accelerator chips 16x16x16x16, 8x8x4x8x8x4 � First chip is interface, second is FP execution unit � The 16-processor chips are also linked by a 12-dimensional hypercube � RAM memory � Good for long-distance point-to-point � 64Kbits, bit addressable communication 11 Fall 2005, MIMD 12 Fall 2005, MIMD MIMD Overview Thinking Machines CM-5 Overview � MIMDs in the 1980s and 1990s � Distributed-memory MIMD multicomputer � Distributed-memory multicomputers � SIMD or MIMD operation � Thinking Machines CM-5 � Configurable with up to 16,384 � IBM SP2 processing nodes and 512 GB of memory � Distributed-memory multicomputers with hardware to look like shared-memory � Divided into partitions, each managed by a control processor � nCUBE 3 � NUMA shared-memory multiprocessors � Processing nodes use SPARC CPUs � Cray T3D � Silicon Graphics POWER & Origin � General characteristics � 100s of powerful commercial RISC PEs � Wide variation in PE interconnect network � Broadcast / reduction / synch network 16 Fall 2005, MIMD 20 Fall 2005, MIMD
CM-5 Partitions / Control Processors CM-5 Nodes and Interconnection � Processing nodes may be divided into � Processing nodes (communicating) partitions, and are � SPARC CPU (running at 22 MIPS) supervised by a control processor � 8-32 MB of memory � Control processor broadcasts blocks of � (Optional) 4 vector processing units instructions to the processing nodes � SIMD operation: control processor � Each control processor and processing broadcasts instructions and nodes are closely synchronized node connects to two networks � MIMD operation: nodes fetch instructions � Control Network — for operations that independently and synchronize only as involve all nodes at once required by the algorithm � Broadcast, reduction (including parallel prefix), barrier synchronization � Control processors in general � Optimized for fast response & low latency � Schedule user tasks, allocate resources, � Data Network — for bulk data transfers service I/O requests, accounting, etc. between specific source and destination � In a small system, one control processor � 4-ary hypertree may play a number of roles � Provides point-to-point communication for tens of thousands of items simultaneously � In a large system, control processors are � Special cases for nearest neighbor often dedicated to particular tasks (partition manager, I/O cont. proc., etc.) � Optimized for high bandwidth 21 Fall 2005, MIMD 22 Fall 2005, MIMD IBM SP2 Overview SP2 System Architecture � Distributed-memory MIMD multicomputer � RS/6000 as system console � Scalable POWERparallel 1 (SP1) � SP2 runs various combinations of serial, parallel, interactive, and batch jobs � Scalable POWERparallel 2 (SP2) � Partition between types can be changed � RS/6000 � High nodes — interactive nodes for code workstation development and job submission plus 4–128 � Thin nodes — compute nodes POWER2 processors � Wide nodes — configured as servers, with extra memory, storage devices, etc. � POWER2 processors used IBM � s � A system “frame” contains 16 thin in RS 6000 processor or 8 wide processor nodes workstations, � Includes redundant power supplies, compatible nodes are hot swappable within frame with existing software � Includes a high-performance switch for low-latency, high-bandwidth communication 24 Fall 2005, MIMD 25 Fall 2005, MIMD
Recommend
More recommend