SIMD+ Overview Illiac IV History Early machines First - PDF document

SIMD+ Overview � Illiac IV History � � � Early machines � � � First massively parallel (SIMD) computer � � � Illiac IV (first SIMD) � � � Sponsored by DARPA, built by various � � Cray-1 (vector processor, not a SIMD) � companies, assembled by Burroughs, under the direction of Daniel Slotnick at � � SIMDs in the 1980s and 1990s � the University of Illinois � � � Thinking Machines CM-2 (1980s) � � � Plan was for 256 PEs, in 4 quadrants of 64 PEs, but only one quadrant was built � � � CPP � s DAP & Gamma II (1990s) � � � Used at NASA Ames Research Center in � � General characteristics � mid-1970s � � � Host computer to interact with user and execute scalar instructions, control unit to send parallel instructions to PE array � � � 100s or 1000s of simple custom PEs, each with its own private memory � � � PEs connected by 2D torus, maybe also by row/column bus(es) or hypercube � � � Broadcast / reduction network � 1 � Fall 2008, SIMD+ � 2 � Fall 2008, SIMD+ � Illiac IV Architectural Overview � Illiac IV Routing and I/O � � � CU (control unit) + � � � Data routing � 64 PUs (processing units) � � � CU bus —instructions or data can be fetched from a PEM and sent to the CU � � � PU = 64-bit PE (processing element) + PEM (PE memory) � � � CDB (Common Data Bus) — broadcasts information from CU to all PEs � � � CU operates on scalars, � � � PE Routing network — 2D torus � PEs operate on vector-aligned arrays � (A[1] on PE 1, A[2] on PE2, etc.) � � � Laser memory � � � All PEs execute the instruction broadcast � � 1 Tb write-once read-only laser memory � by the CU, if they are in active mode � � � Thin film of metal on a polyester sheet, on � � Each PE can perform various arithmetic a rotating drum � and logical instructions on data in 64-bit, 32-bit, and 8-bit formats � � � DFS (Disk File System) � � � Each PEM contains 2048 64-bit words � � � 1 Gb, 128 heads (one per track) � � � Data routed between PEs various ways � � � ARPA network link (50 Kbps) � � � I/O is handled by a separate Burroughs � � Illiac IV was a network resource available B6500 computer (stack architecture) � to other members of the ARPA network � 3 � Fall 2008, SIMD+ � 4 � Fall 2008, SIMD+ �

Cray-1 History � Cray-1 Vector Operations � � � First famous vector (not SIMD) processor � � � Vector arithmetic � � � 8 vector registers, each holding a 64 � � In January 1978 there were only 12 non -element vector (64 64-bit words) � -Cray-1 vector processors worldwide: � � � Arithmetic and logical instructions operate � � Illiac IV, TI ASC (7 installations), CDC on 3 vector registers � STAR 100 (4 installations) � � � Vector C = vector A + vector B � � � Decode the instruction once, then pipeline the load, add, store operations � � � Vector chaining � � � Multiple functional units � � � 12 pipelined functional units in 4 groups: address, scalar, vector, and floating point � � � Scalar add = 3 cycles, vector add = 3 cycles, floating-point add = 6 cycles, floating-point multiply = 7 cycles, reciprocal approximation = 14 cycles � � � Use pipelining with data forwarding to bypass vector registers and send result of one functional unit to input of another � 5 � Fall 2008, SIMD+ � 6 � Fall 2008, SIMD+ � Cray-1 Physical Architecture � Cray X-MP, Y-MP, and {CJT}90 � � � Custom implementation � � � At Cray Research, Steve Chen continued to update the Cray-1, producing… � � � Register chips, memory chips, low-speed and high-speed gates � � � X-MP � � � Physical architecture � � � 8.5 ns clock (Cray-1 was 12.5 ns) � � � “Cylindrical tower (6.5 � tall, 4.5 � diameter) � � First multiprocessor supercomputer � with 8.5 � diameter seat � � � 4 vector units with scatter / gather � � � Composed of 12 wedge-like columns in 270° arc, so a “reasonably trim individual” � � Y-MP � can get inside to work � � � 32-bit addressing (X-MP is 24-bit) � � � World � s most expensive love-seat” � � � 6 ns clock � � � “Love seat” hides power supplies and plumbing for Freon cooling system � � � 8 vector units � � � Freon cooling system � � � C90, J90 (1994), T90 � � � Vertical cooling bars line each wall, � � J90 built in CMOS, T90 from ECL (faster) � modules have a copper heat transfer plate that attaches to the cooling bars � � � Up to 16 (C90) or 32 (J90/T90) processors, with one multiply and one � � Freon is pumped through a stainless steel add vector pipeline per CPU � tube inside an aluminum casing � 7 � Fall 2008, SIMD+ � 8 � Fall 2008, SIMD+ �

Thinking Machines Corporation � s Cray-2 & Cray-3 � Connection Machine CM-2 � � � At Cray Research, Steve Chen continued � � Distributed-memory SIMD (bit-serial) � to update the Cray-1 with improved technologies: X-MP, Y-MP, etc. � � � Thinking Machines Corp. founded 1983 � � � CM-1, 1986 (1000 MIPS, 4K processors) � � � Seymour Cray developed Cray-2 in 1985 � � � CM-2, 1987 (2500 MFLOPS, 64K…) � � � 4-processor multiprocessor with vectors � � � Programs run on one of 4 Front-End � � DRAM memory (instead of SRAM), highly interleaved since DRAM is slower � Processors, which issue instructions to the Parallel Processing Unit (PE array) � � � Whole machine immersed in Fluorinert (artificial blood substitute) � � � Control flow and scalar operations run on Front-End Processors, while parallel � � 4.1 ns cycle time (3x faster than Cray-1) � operations run on the PPU � � � Spun off to Cray Computer in 1989 � � � A 4x4 crossbar switch (Nexus) connects the 4 Front-Ends to 4 sections of the � � Seymour Cray developed Cray-3 in 1993 � PPU � � � Replace the “C” shape with a cube so all � � Each PPU section is controlled by a signals take same time to travel � Sequencer (control unit), which receives � � Supposed to have 16 processors, had 1 assembly language instructions and with a 2 ns cycle time � broadcasts micro-instructions to each processor in that PPU section � 9 � Fall 2008, SIMD+ � 10 � Fall 2008, SIMD+ � CM-2 Nodes / Processors � CM-2 Interconnect � � � CM-2 constructed of “nodes”, each with: � � � Broadcast and reduction network � � � 32 processors (implemented by 2 custom � � Broadcast, Spread (scatter) � processor chips), 2 floating-point � � Reduction (e.g., bitwise OR, maximum, accelerator chips, and memory chips � sum), Scan (e.g., collect cumulative results over sequence of processors � � 2 processor chips (each 16 processors) � such as parallel prefix) � � � Contains ALU, flag registers, etc. � � � Sort elements � � � Contains NEWS interface, router � � NEWS grid can be used for nearest interface, and I/O interface � -neighbor communication � � � 16 processors are connected in a 4x4 mesh to their N, E, W, and S neighbors � � � Communication in multiple dimensions: 256x256, 1024x64, 8x8192, 64x32x32, � � 2 floating-point accelerator chips � 16x16x16x16, 8x8x4x8x8x4 � � � First chip is interface, second is FP execution unit � � � The 16-processor chips are also linked by a 12-dimensional hypercube � � � RAM memory � � � Good for long-distance point-to-point � � 64Kbits, bit addressable � communication � 11 � Fall 2008, SIMD+ � 12 � Fall 2008, SIMD+ �

SIMD+ Overview Illiac IV History Early machines First - PDF document

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer Illiac IV (first SIMD) Sponsored by DARPA, built by various Cray-1 (vector processor, not a SIMD)

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different

SIMD Is a Message Digest Gatan Leurent, Pierre-Alain Fouque, Charles Bouillaguet cole Normale

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

CSCI [4|6] 730 Operating Systems CPU Scheduling Maria Hybinette, UGA Maria Hybinette, UGA

Lecture 16 Log into Linux. Copy files from /home/hwang/cs375/lecture16/. Questions?

Unix Process Model What is a processes? Process Control in Unix Properties of a process

Life of an instruction in Clang / LLVM Francis Visoiu Mistrih - francis@lse.epita.fr 1 / 20

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Welcome! , = (, ) , + , ,

CS 543 - Computer Graphics: Illumination and Shading I by Robert W. Lindeman gogo@wpi.edu

SIMD+ Overview Illiac IV History Early machines First - PDF document

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer Illiac IV (first SIMD) Sponsored by DARPA, built by various Cray-1 (vector processor, not a SIMD)

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different

SIMD Is a Message Digest Gatan Leurent, Pierre-Alain Fouque, Charles Bouillaguet cole Normale

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

CSCI [4|6] 730 Operating Systems CPU Scheduling Maria Hybinette, UGA Maria Hybinette, UGA

Lecture 16 Log into Linux. Copy files from /home/hwang/cs375/lecture16/*.* Questions?

Unix Process Model What is a processes? Process Control in Unix Properties of a process

Life of an instruction in Clang / LLVM Francis Visoiu Mistrih - francis@lse.epita.fr 1 / 20

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Welcome! , = (, ) , + , ,

CS 543 - Computer Graphics: Illumination and Shading I by Robert W. Lindeman gogo@wpi.edu

Lecture 16 Log into Linux. Copy files from /home/hwang/cs375/lecture16/. Questions?