cs 5234 spring 2013 advanced parallel computing
play

CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - PowerPoint PPT Presentation

Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs


  1. Architecture CS 5234 –Spring 2013 Advanced Parallel Computing Architecture Yong Cao

  2. Architecture Goals Ø Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Ø Pa Parallel el Ha Hardware e Ø Distributed vs Shared Memory Ø Architec ecture e Classes es Ø Multiple-core Ø Many-core (massive parallel) Ø NVIDIA GPU PU Architec ecture e

  3. Architecture Von-Neumann Machine (VN) PC Ø PC PC: Pr Prog ogram cou ounter er Ø MAR: Mem emor ory addres ess MAR reg egister er MEMORY Ø MDR: Mem emor ory data reg egister er Ø IR: Instruction on reg egister er Ø ALU LU: Arithmet etic Log Logic OP ADDRESS MDR Acc Unit Unit Ø Acc Acc: Accumulator or Decoder A L U

  4. Architecture Sequential Execution and Instruction Cycle Ø Th The e six phases es of of the e instruction on cycle: e: PC Ø Fetch MAR Ø Decode Ø Evaluate Address MEMORY Ø Fetch Operands Ø Execute Ø Store Result OP ADDRESS MDR Acc Decoder A L U

  5. Architecture Sequential Execution and Instruction Cycle Ø Fet Fetch PC Ø MAR ç PC Ø MDR ç MEM[MAR] MAR Ø IR ç MDR MEMORY OP ADDRESS MDR Acc Decoder A L U

  6. Architecture Sequential Execution and Instruction Cycle Ø Dec ecod ode e PC Ø DECODER ç IR.OP MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

  7. Architecture Sequential Execution and Instruction Cycle Ø Evaluate e Addres ess PC Ø MAR ç IR.ADDR Ø MDR ç MEM[MAR] MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

  8. Architecture Sequential Execution and Instruction Cycle Ø Exec ecute e PC Ø Acc ç Acc + MDR MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

  9. Architecture Sequential Execution and Instruction Cycle Ø Stor ore e Res esult PC Ø MDR ç Acc MAR MEMORY OP ADDRESS MDR Acc Decoder A L U

  10. Architecture Sequential Execution and Instruction Cycle Ø Reg egister er Fi File e PC MAR MEMORY Register File OP ADDRESS Decoder A L U

  11. Architecture Sequential Execution and Instruction Cycle PC MEMORY IR Register File A L U

  12. Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Multi-C -Cor ore e and Many-C -Cor ore e Architec ecture e

  13. Architecture Parallel Hardware Ø Shared ed vs vs Distributed ed Mem emor ory MEMORY MEMORY MEMORY PC PC PC IR IR IR Register File Register File Register File …… A L U A L U A L U Ø Cluster er Com omputing, Grid Com omputing, Clou oud Com omputing

  14. Architecture Multi-Core vs Many-Core Ø Defi efinition on of of Cor ore e – Indep epen enden ent ALU LU Ø How How abou out a vec ector or proc oces essor or? Ø SIMD: E.g. Intel’s SSE. Ø How How many is “many”? Ø What if there are too “many” cores in the Multi-core design? Shared ed con ontrol ol log ogic (PC (PC, IR, Sched edule) e)

  15. Architecture Multi-Core Ø Each core has its own control (PC and IR)

  16. Architecture Many-Core Ø A grou oup of of cor ores es shares es the e con ontrol ol (PC (PC, IR and Th Threa ead Sched eduling) )

  17. Architecture NVIDIA Fermi Architecture 16 Stream Multiprocessor (SM) 32 Core for Each SM

  18. Architecture Fermi SM

  19. Architecture Execution in a SM A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

  20. Architecture Data Parallel Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el Ø What to partition? Data or Task? Ø Massive e Data Pa Parallel el Ø Millions (or more) of threads Ø Same instruction, different data elements

  21. Architecture Computing on GPUs Ø Stea eam proc oces essing and Vec ector oriza zation on (S (SIMD) ) Instructions Input Stream SIMD Output Stream

  22. Architecture GPU Programming Model: Stream Ø Strea eam Pr Prog ogramming Mod odel el Ø Strea eams: Stream Ø An array of data units Kernel Ø Ker ernel els: Stream Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

  23. Architecture Why Streams? Ø Ample e com omputation on by ex expos osing parallel elism Ø Stream expose data parallelism Ø Multiple stream elements can be processed in parallel Ø Pipeline (task) parallelism Ø Multiple tasks can be processed in parallel Ø Effi fficien ent com ommunication on Ø Producer-consumer locality Ø Predictable memory access pattern Ø Optimize for throughput of all elements, not latency of one Ø Processing many elements at once allows latency hiding

Recommend


More recommend