Architecture CS 5234 –Spring 2013 Advanced Parallel Computing Architecture Yong Cao
Architecture Goals Ø Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Ø Pa Parallel el Ha Hardware e Ø Distributed vs Shared Memory Ø Architec ecture e Classes es Ø Multiple-core Ø Many-core (massive parallel) Ø NVIDIA GPU PU Architec ecture e
Architecture Von-Neumann Machine (VN) PC Ø PC PC: Pr Prog ogram cou ounter er Ø MAR: Mem emor ory addres ess MAR reg egister er MEMORY Ø MDR: Mem emor ory data reg egister er Ø IR: Instruction on reg egister er Ø ALU LU: Arithmet etic Log Logic OP ADDRESS MDR Acc Unit Unit Ø Acc Acc: Accumulator or Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Th The e six phases es of of the e instruction on cycle: e: PC Ø Fetch MAR Ø Decode Ø Evaluate Address MEMORY Ø Fetch Operands Ø Execute Ø Store Result OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Fet Fetch PC Ø MAR ç PC Ø MDR ç MEM[MAR] MAR Ø IR ç MDR MEMORY OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Dec ecod ode e PC Ø DECODER ç IR.OP MAR MEMORY OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Evaluate e Addres ess PC Ø MAR ç IR.ADDR Ø MDR ç MEM[MAR] MAR MEMORY OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Exec ecute e PC Ø Acc ç Acc + MDR MAR MEMORY OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Stor ore e Res esult PC Ø MDR ç Acc MAR MEMORY OP ADDRESS MDR Acc Decoder A L U
Architecture Sequential Execution and Instruction Cycle Ø Reg egister er Fi File e PC MAR MEMORY Register File OP ADDRESS Decoder A L U
Architecture Sequential Execution and Instruction Cycle PC MEMORY IR Register File A L U
Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Multi-C -Cor ore e and Many-C -Cor ore e Architec ecture e
Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY MEMORY MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Cluster er Com omputing, Grid Com omputing, Clou oud Com omputing
Architecture Multi-Core vs Many-Core Ø Defi efinition on of of Cor ore e – Indep epen enden ent ALU LU Ø How How abou out a vec ector or proc oces essor or? Ø SIMD: E.g. Intel’s SSE. Ø How How many is “many”? Ø What if there are too “many” cores in the Multi-core design? Shared ed con ontrol ol log ogic (PC (PC, IR, Sched edule) e)
Architecture Multi-Core Ø Each core has its own control (PC and IR)
Architecture Many-Core Ø A grou oup of of cor ores es shares es the e con ontrol ol (PC (PC, IR and Th Threa ead Sched eduling) )
Architecture NVIDIA Fermi Architecture 16 Stream Multiprocessor (SM) 32 Core for Each SM
Architecture Fermi SM
Architecture Execution in a SM A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.
Architecture Data Parallel Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el Ø What to partition? Data or Task? Ø Massive e Data Pa Parallel el Ø Millions (or more) of threads Ø Same instruction, different data elements
Architecture Computing on GPUs Ø Stea eam proc oces essing and Vec ector oriza zation on (S (SIMD) ) Instructions Input Stream SIMD Output Stream
Architecture GPU Programming Model: Stream Ø Strea eam Pr Prog ogramming Mod odel el Ø Strea eams: Stream Ø An array of data units Kernel Ø Ker ernel els: Stream Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together
Architecture Why Streams? Ø Ample e com omputation on by ex expos osing parallel elism Ø Stream expose data parallelism Ø Multiple stream elements can be processed in parallel Ø Pipeline (task) parallelism Ø Multiple tasks can be processed in parallel Ø Effi fficien ent com ommunication on Ø Producer-consumer locality Ø Predictable memory access pattern Ø Optimize for throughput of all elements, not latency of one Ø Processing many elements at once allows latency hiding
Recommend
More recommend