MIMD Overview � Intel Paragon XP/S Overview � � � MIMDs in the 1980s and 1990s � � � Distributed-memory MIMD multicomputer � � � Distributed-memory multicomputers � � � 2D array of nodes � � � Intel Paragon XP/S � � � Main memory physically distributed � � Thinking Machines CM-5 � among nodes (16-64 MB / node) � � � IBM SP2 � � � Each node contains two Intel i860 XP � � Distributed-memory multicomputers with processors: application processor to run hardware to look like shared-memory � user program, and message processor � � nCUBE 3 � for inter-node communication � � � Kendall Square Research KSR1 � � � NUMA shared-memory multiprocessors � � � Cray T3D � � � Convex Exemplar SPP-1000 � � � Silicon Graphics POWER & Origin � � � General characteristics � � � 100s of powerful commercial RISC PEs � � � Wide variation in PE interconnect network � � � Broadcast / reduction / synch network � 1 � Fall 2008, MIMD � 2 � Fall 2008, MIMD � XP/S Nodes and Interconnection � XP/S Usage � � � Node composition � � � System OS is based on UNIX, provides distributed system services and full � � 16–64 MB of memory � UNIX to every node � � � Application processor � � � System is divided into partitions, some for � � Intel i860 XP processor (42 MIPS, 50 MHz I/O, some for system services, rest for clock) to execute user programs � user applications � � � Message processor � � � Intel i860 XP processor � � � Users have client/server access, can � � Handles details of sending / receiving a submit jobs over a network, or login message between nodes, including directly to any node � protocols, packetization, etc. � � � Supports broadcast, synchronization, and � � System has a MIMD architecture, but reduction (sum, min, and, or, etc.) � supports various programming models: SPMD, SIMD, MIMD, shared memory, � � 2D mesh interconnection between nodes � vector shared memory � � � Paragon Mesh Routing Chip (PMRC) / iMRC routes traffic in the mesh � � � Applications can run on arbitrary number � � 0.75 µm, triple-metal CMOS � of nodes without change � � � Routes traffic in four directions and to and � � Run on more nodes for large data sets or from attached node at > 200 MB/s � to get higher performance � 3 � Fall 2008, MIMD � 4 � Fall 2008, MIMD �
Thinking Machines CM-5 Overview � CM-5 Partitions / Control Processors � � � Distributed-memory MIMD multicomputer � � � Processing nodes may be divided into (communicating) partitions, and are � � SIMD or MIMD operation � supervised by a control processor � � � Configurable with up to 16,384 � � Control processor broadcasts blocks of processing nodes and 512 GB of instructions to the processing nodes � memory � � � SIMD operation: control processor broadcasts instructions and nodes are � � Divided into partitions, each managed by closely synchronized � a control processor � � � MIMD operation: nodes fetch instructions independently and synchronize only as � � Processing nodes use SPARC CPUs � required by the algorithm � � � Control processors in general � � � Schedule user tasks, allocate resources, service I/O requests, accounting, etc. � � � In a small system, one control processor may play a number of roles � � � In a large system, control processors are often dedicated to particular tasks (partition manager, I/O cont. proc., etc.) � 5 � Fall 2008, MIMD � 6 � Fall 2008, MIMD � CM-5 Nodes and Interconnection � Tree Networks (Reference Material) � � � Processing nodes � � � Binary Tree � � � 2 k –1 nodes arranged into complete binary � � SPARC CPU (running at 22 MIPS) � tree of depth k–1 � � � 8-32 MB of memory � � � Diameter is 2(k–1) � � � (Optional) 4 vector processing units � � � Bisection width is 1 � � � Each control processor and processing � � Hypertree � node connects to two networks � � � Low diameter of a binary tree plus � � Control Network — for operations that improved bisection width � involve all nodes at once � � � Broadcast, reduction (including parallel � � Hypertree of degree k and depth d � prefix), barrier synchronization � � � From “front”, looks like k-ary tree of height d � � � Optimized for fast response & low latency � � � From “side”, looks like upside-down binary � � Data Network — for bulk data transfers tree of height d � between specific source and destination � � � Join both views to get complete network � � � 4-ary hypertree � � � 4-ary hypertree of depth d � � � Provides point-to-point communication for � � 4 d leaves and 2 d (2 d+1 –1) nodes � tens of thousands of items simultaneously � � � Diameter is 2d � � � Special cases for nearest neighbor � � � Bisection width is 2 d+1 � � � Optimized for high bandwidth � 7 � Fall 2008, MIMD � 8 � Fall 2008, MIMD �
IBM SP2 Overview � SP2 System Architecture � � � Distributed-memory MIMD multicomputer � � � RS/6000 as system console � � � Scalable POWERparallel 1 (SP1) � � � SP2 runs various combinations of serial, parallel, interactive, and batch jobs � � � Scalable POWERparallel 2 (SP2) � � � Partition between types can be changed � � � RS/6000 � � � High nodes — interactive nodes for code workstation � development and job submission � plus 4–128 � POWER2 � � � Thin nodes — compute nodes � processors � � � Wide nodes — configured as servers, with extra memory, storage devices, etc. � � � POWER2 � processors � used IBM � s � � � A system “frame” contains 16 thin in RS 6000 � processor or 8 wide processor nodes � workstations, � � � Includes redundant power supplies, compatible � nodes are hot swappable within frame � with existing � software � � � Includes a high-performance switch for low-latency, high-bandwidth communication � 9 � Fall 2008, MIMD � 10 � Fall 2008, MIMD � SP2 Processors and Interconnection � nCUBE 3 Overview � � � POWER2 processor � � � Distributed-memory MIMD multicomputer (with hardware to make it look like � � RISC processor, load-store architecture, shared-memory multiprocessor) � various versions from 20 to 62.5 MHz � � � If access is attempted to a virtual memory � � Comprised of 8 semi-custom chips: page marked as “non-resident”, the Instruction Cache, 4 Data Cache, � system will generate messages to Fixed-Point Unit, Floating-Point Unit, � transfer that page to the local node � and Storage Control Unit � � � nCUBE 3 could have 8–65,536 � � Interconnection network � processors and up to 65 TB memory � � � Routing � � � Can be partitioned into “subcubes” � � � Packet switched = each packet may take a different route � � � Multiple programming paradigms: � � Cut-through = if output is free, starts SPMD, inter-subcube processing, client sending without buffering first � /server � � � Wormhole routing = buffer on subpacket basis if buffering is necessary � � � Multistage High Performance Switch (HPS) network, scalable via extra stages to keep bw to each processor constant � � � Guaranteed fairness of message delivery � 11 � Fall 2008, MIMD � 12 � Fall 2008, MIMD �
Recommend
More recommend