CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - PowerPoint PPT Presentation

Architecture CS 5234 –Spring 2013 Advanced Parallel Computing Architecture Yong Cao

Architecture Goals Ø Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Ø Pa Parallel el Ha Hardware e Ø Distributed vs Shared Memory Ø Architec ecture e Classes es Ø Multiple-core Ø Many-core (massive parallel) Ø NVIDIA GPU PU Architec ecture e

Architecture Von-Neumann Machine (VN) PC Ø PC PC: Pr Prog ogram cou ounter er Ø MAR: Mem emor ory addres ess MAR reg egister er MEMORY Ø MDR: Mem emor ory data reg egister er Ø IR: Instruction on reg egister er Ø ALU LU: Arithmet etic Log Logic OP ADDRESS MDR Acc Unit Unit Ø Acc Acc: Accumulator or Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Th The e six phases es of of the e instruction on cycle: e: PC Ø Fetch MAR Ø Decode Ø Evaluate Address MEMORY Ø Fetch Operands Ø Execute Ø Store Result OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Fet Fetch PC Ø MAR ç PC Ø MDR ç MEM[MAR] MAR Ø IR ç MDR MEMORY OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Dec ecod ode e PC Ø DECODER ç IR.OP MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Evaluate e Addres ess PC Ø MAR ç IR.ADDR Ø MDR ç MEM[MAR] MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Exec ecute e PC Ø Acc ç Acc + MDR MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Stor ore e Res esult PC Ø MDR ç Acc MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

Architecture Sequential Execution and Instruction Cycle Ø Reg egister er Fi File e PC MAR MEMORY Register File OP ADDRESS Decoder A L U

Architecture Sequential Execution and Instruction Cycle PC MEMORY IR Register File A L U

Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Multi-C -Cor ore e and Many-C -Cor ore e Architec ecture e

Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY MEMORY MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Cluster er Com omputing, Grid Com omputing, Clou oud Com omputing

Architecture Multi-Core vs Many-Core Ø Defi efinition on of of Cor ore e – Indep epen enden ent ALU LU Ø How How abou out a vec ector or proc oces essor or? Ø SIMD: E.g. Intel’s SSE. Ø How How many is “many”? Ø What if there are too “many” cores in the Multi-core design? Shared ed con ontrol ol log ogic (PC (PC, IR, Sched edule) e)

Architecture Multi-Core Ø Each core has its own control (PC and IR)

Architecture Many-Core Ø A grou oup of of cor ores es shares es the e con ontrol ol (PC (PC, IR and Th Threa ead Sched eduling) )

Architecture NVIDIA Fermi Architecture 16 Stream Multiprocessor (SM) 32 Core for Each SM

Architecture Fermi SM

Architecture Execution in a SM A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

Architecture Data Parallel Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el Ø What to partition? Data or Task? Ø Massive e Data Pa Parallel el Ø Millions (or more) of threads Ø Same instruction, different data elements

Architecture Computing on GPUs Ø Stea eam proc oces essing and Vec ector oriza zation on (S (SIMD) ) Instructions Input Stream SIMD Output Stream

Architecture GPU Programming Model: Stream Ø Strea eam Pr Prog ogramming Mod odel el Ø Strea eams: Stream Ø An array of data units Kernel Ø Ker ernel els: Stream Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

Architecture Why Streams? Ø Ample e com omputation on by ex expos osing parallel elism Ø Stream expose data parallelism Ø Multiple stream elements can be processed in parallel Ø Pipeline (task) parallelism Ø Multiple tasks can be processed in parallel Ø Effi fficien ent com ommunication on Ø Producer-consumer locality Ø Predictable memory access pattern Ø Optimize for throughput of all elements, not latency of one Ø Processing many elements at once allows latency hiding

CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - PowerPoint PPT Presentation

Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

Revised: March 4, 2013 3/19/2013 3/19/2013 2 3/19/2013 3 3/19/2013 4 3/19/2013 5

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Hypothetical Single-cycle Implementation of DLX Assume Each instructions completes in 1 (LONG!!)

The Microarchitecture of the LC-3 LC-3 Data Path Revisited Now Registers and Memory 5-2

Special Microarchitecture based on a lecture by Sanjay Rajopadhye modified by Yashwant Malaiya

Mellivora: Supercapacitor Power Supply Project Overview Team Introduction Project

INTRODUCTION TO GENETIC EPIDEMIOLOGY (EPID0754) Prof. Dr. Dr. K. Van Steen Introduction to

CENG3420 Lab 3-1: LC-3b Datapath Wei Li Department of Computer Science and Engineering The

SI232 Set #15: Multicycle Implementation (Chapter Five) 1 Recall Single Cycle

CSSE232 Computer Architecture I Pipelining Summary of Instruc;on

CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - PowerPoint PPT Presentation

Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

Revised: March 4, 2013 3/19/2013 3/19/2013 2 3/19/2013 3 3/19/2013 4 3/19/2013 5

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Hypothetical Single-cycle Implementation of DLX Assume Each instructions completes in 1 (LONG!!)

The Microarchitecture of the LC-3 LC-3 Data Path Revisited Now Registers and Memory 5-2

Special Microarchitecture based on a lecture by Sanjay Rajopadhye modified by Yashwant Malaiya

Mellivora: Supercapacitor Power Supply Project Overview Team Introduction Project

INTRODUCTION TO GENETIC EPIDEMIOLOGY (EPID0754) Prof. Dr. Dr. K. Van Steen Introduction to

CENG3420 Lab 3-1: LC-3b Datapath Wei Li Department of Computer Science and Engineering The

SI232 Set #15: Multicycle Implementation (Chapter Five) 1 Recall Single Cycle

CSSE232 Computer Architecture I Pipelining Summary of Instruc;on

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &