evaluating the productivity of a evaluating the
play

Evaluating the Productivity of a Evaluating the Productivity of a - PowerPoint PPT Presentation

Evaluating the Productivity of a Evaluating the Productivity of a Multicore Architecture Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force


  1. Evaluating the Productivity of a Evaluating the Productivity of a Multicore Architecture Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Multicore Productivity

  2. Outline • Architecture Buffet Architecture Buffet • Parallel Design • Programming Buffet Programming Buffet • Productivity Assessment Productivity Assessment • Programming Models • Architectures • Productivity Results • Summary MIT Lincoln Laboratory Slide-2 Multicore Productivity

  3. Signal Processor Devices Full DSP/RISC ASIC FPGA Custom Core For 2005AD 90 nm CMOS Process 1.0E+11 10000 MIT LL VLSI MIT LL VLSI 1.0E+09 GOPS/W × GOPS/cm 2 1000 GOPS/cm 2 Full-Custom C 1.0E+07 I S A Full-Custom l l e 100 C - d r a d Standard-Cell n a t 1.0E+05 S FPGA ASIC FPGA DSP/RISC Core 10 1.0E+03 DSP/RISC Core 1 1.0E+01 0.1 1 10 100 1000 10000 2005 2007 2009 2011 2013 2015 GOPS/W Year • Wide range of device technologies for signal processing systems • Wide range of device technologies for signal processing systems • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-3 Multicore Productivity

  4. Multicore Processor Buffet Homogeneous Heterogeneous • Intel Duo/Duo • Broadcom Short • IBM Cell • AMD Opteron • Tilera Vector • Intel Polaris • IBM PowerX • Sun Niagara • IBM Blue Gene • Cray XT • nVidia Long • Cray XMT • ATI Vector • Clearspeed • Wide range of programmable multicore processors • Wide range of programmable multicore processors • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-4 Multicore Productivity

  5. Multicore Programming Buffet Flat Hierarchical • pThreads • CUDA word • StreamIt • ALPH • Cilk • UPC • MCF • CAF • Sequouia • VSIPL++ • PVTOL • GA++ object • pMatlabXVM • pMatlab • StarP • Wide range of multicore programming environments • Wide range of multicore programming environments • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-5 Multicore Productivity

  6. Performance vs Effort Style Example Granularity Training Effort Performance per Watt Graphical Spreadsheet Module Low 1/30 1/100 Domain Matlab, Maple, Array Low 1/10 1/5 Language IDL Object Oriented Java, C++ Object Medium 1/3 1/1.1 Programmable Multicore Procedural VSIPL, BLAS Structure Medium 2/3 1/1.05 Library (this talk) Procedural C, Fortran Word Medium 1 1 Language Assembly x86, PowerPC Register High 3 2 Gate Array VHDL Gate High 10 10 Standard Cell Cell High 30 100 Custom VLSI Transistor High 100 1000 • Applications can be implemented with a variety of interfaces • Applications can be implemented with a variety of interfaces • Clear tradeoff between effort (3000x) and performance (100,000x) • Clear tradeoff between effort (3000x) and performance (100,000x) – Translates into mission capability vs mission schedule – Translates into mission capability vs mission schedule MIT Lincoln Laboratory Slide-6 Multicore Productivity

  7. Assessment Approach Speedup vs Relative Code Size 10 3 Ref 10 2 Performance/Watt Traditional Goal Parallel Programming 10 1 Relative Speedup 10 0 10 -1 Java, Matlab, “All too often” Python, etc. 10 -2 10 -3 10 -1 10 0 10 1 Relative Effort Relative Code Size • “Write” benchmarks in many programming environments on • “Write” benchmarks in many programming environments on different multicore architectures different multicore architectures • Compare performance/watt and relative effort to serial C • Compare performance/watt and relative effort to serial C MIT Lincoln Laboratory Slide-7 Multicore Productivity

  8. Outline • Parallel Design • Environment features Environment features • Programming Models • Estimates Estimates • Performance Complexity Performance Complexity • Architectures • Productivity Results • Summary MIT Lincoln Laboratory Slide-8 Multicore Productivity

  9. Programming Environment Features Technology UPC F2008 GA++ PVL VSIPL PVTOL Titanium StarP pMatlab DCT Chapel X10 Fortress Organization Std Std DOE Lincoln Std Body Lincoln UC ISC Lincoln Math- Cray IBM Sun Body Body PNNL Berkeley works Sponsor DoD DOE DOE Navy DoD DOE, DoD DARPA DARPA DARPA DARPA SC HPCMP NSF Type Lang Lang Library Library Library Library New Library Library Library New New New Ext Ext Lang Lang Lang Lang Base Lang C Fortran C++ C++ C++ C++ Java Matlab Matlab Matlab ZPL Java HPF Precursors CAF STAPL, PVL, VSIPL++, pMatlab PVL, pMatlab, POOMA POOMA pMatlab StarP StarP Real Apps 2001 2001 1998 2000 2004 ~2007 2002 2003 2005 Data Parallel Y Y Y Y Y Y Y Y Y Y Y Y Y Block-cyclic 1D ND blk 2D 2D Y ND 2D 4D 1D ND ND Atomic Y Y Y Threads Y Y Y Y Y Task Parallel Y Y Y Y Y Y Y Y Pipelines Y Y Y Y Hier. arrays Y Y Y Y Y Y Automap Y Y Y Sparse ? Y Y Y Y ? ? FPGA IO Y Y • Too many environments with too many features to assess individually • Too many environments with too many features to assess individually • Decompose into general classes • Decompose into general classes – Serial programming environment – Serial programming environment – Parallel programming model – Parallel programming model • Assess only relevant serial environment and parallel model pairs • Assess only relevant serial environment and parallel model pairs MIT Lincoln Laboratory Slide-9 Multicore Productivity

  10. Dimensions of Programmability • Performance – The performance of the code on the architecture – Measured in: flops/sec, Bytes/sec, GUPS, … • Effort – Coding effort required to obtain a certain level of performance – Measured in: programmer-days, lines-of-code, function points, • Expertise Skill level of programmer required to obtain a certain level of – performance – Measured in: degree, years of experience, multi-disciplinary knowledge required, … • Portability – Coding effort required to port code from one architecture to the next and achieve a certain level of performance – Measured in: programmer-days, lines-of-code, function points, …) • Baseline – All quantities are relative to some baseline environment Serial C on a single core x86 workstation, cluster, multi-core, … – MIT Lincoln Laboratory Slide-10 Multicore Productivity

  11. Serial Programming Environments Programming Assembly SIMD Procedural Objects High Level (C+AltiVec) (ANSI C) Languages Language (C++, Java) (Matlab) Performance 0.8 0.5 0.2 0.15 0.05 Efficiency Relative Code 10 3 1 1/3 1/10 Size Effort/Line-of- 4 hour 2 hour 1 hour 20 min 10 min Code Portability Zero Low Very High High Low Granularity Word Multi-word Multi-word Object Array • OO High Level Languages are the current desktop state-of-the practice :-) • OO High Level Languages are the current desktop state-of-the practice :-) • Assembly/SIMD are the current multi-core state-of-the-practice :-( • Assembly/SIMD are the current multi-core state-of-the-practice :-( • Single core programming environments span 10x performance and 100x • Single core programming environments span 10x performance and 100x relative code size relative code size MIT Lincoln Laboratory Slide-11 Multicore Productivity

  12. Parallel Programming Environments Approach Direct Message Threads Recursive PGAS Hierarchical Memory Passing (OpenMP) Threads PGAS (UPC, Access (MPI) (Cilk) (PVTOL, VSIPL++) (DMA) HPCS) Performance 0.8 0.5 0.2 0.4 0.5 0.5 Efficiency Relative Code 10 3 1 1/3 1/10 1/10 Size Effort/Line-of- Very High High Medium High Medium High Code Portability Zero Very High High Medium Medium TBD Granularity Word Multi-word Word Array Array Array • Message passing and threads are the current desktop state-of-the practice • Message passing and threads are the current desktop state-of-the practice :-| :-| • DMA is the current multi-core state-of-the-practice :-( • DMA is the current multi-core state-of-the-practice :-( • Parallel programming environments span 4x performance and 100x • Parallel programming environments span 4x performance and 100x relative code size relative code size MIT Lincoln Laboratory Slide-12 Multicore Productivity

  13. Canonical 100 CPU Cluster Estimates Parallel Assembly C++/DMA /DMA C/DMA C/Arrays C++/Arrays C/MPI relative speedup C++/threads C++ C/threads /MPI Matlab/Arrays Matlab/MPI Matlab/threads Serial Assembly C C++ Matlab relative effort • Programming environments form regions around serial environment • Programming environments form regions around serial environment MIT Lincoln Laboratory Slide-13 Multicore Productivity

Recommend


More recommend