Evaluating the Productivity of a Evaluating the Productivity of a - PowerPoint PPT Presentation

Evaluating the Productivity of a Evaluating the Productivity of a Multicore Architecture Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Multicore Productivity

Outline • Architecture Buffet Architecture Buffet • Parallel Design • Programming Buffet Programming Buffet • Productivity Assessment Productivity Assessment • Programming Models • Architectures • Productivity Results • Summary MIT Lincoln Laboratory Slide-2 Multicore Productivity

Signal Processor Devices Full DSP/RISC ASIC FPGA Custom Core For 2005AD 90 nm CMOS Process 1.0E+11 10000 MIT LL VLSI MIT LL VLSI 1.0E+09 GOPS/W × GOPS/cm 2 1000 GOPS/cm 2 Full-Custom C 1.0E+07 I S A Full-Custom l l e 100 C - d r a d Standard-Cell n a t 1.0E+05 S FPGA ASIC FPGA DSP/RISC Core 10 1.0E+03 DSP/RISC Core 1 1.0E+01 0.1 1 10 100 1000 10000 2005 2007 2009 2011 2013 2015 GOPS/W Year • Wide range of device technologies for signal processing systems • Wide range of device technologies for signal processing systems • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-3 Multicore Productivity

Multicore Processor Buffet Homogeneous Heterogeneous • Intel Duo/Duo • Broadcom Short • IBM Cell • AMD Opteron • Tilera Vector • Intel Polaris • IBM PowerX • Sun Niagara • IBM Blue Gene • Cray XT • nVidia Long • Cray XMT • ATI Vector • Clearspeed • Wide range of programmable multicore processors • Wide range of programmable multicore processors • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-4 Multicore Productivity

Multicore Programming Buffet Flat Hierarchical • pThreads • CUDA word • StreamIt • ALPH • Cilk • UPC • MCF • CAF • Sequouia • VSIPL++ • PVTOL • GA++ object • pMatlabXVM • pMatlab • StarP • Wide range of multicore programming environments • Wide range of multicore programming environments • Each has their own tradeoffs. How do we choose? • Each has their own tradeoffs. How do we choose? MIT Lincoln Laboratory Slide-5 Multicore Productivity

Performance vs Effort Style Example Granularity Training Effort Performance per Watt Graphical Spreadsheet Module Low 1/30 1/100 Domain Matlab, Maple, Array Low 1/10 1/5 Language IDL Object Oriented Java, C++ Object Medium 1/3 1/1.1 Programmable Multicore Procedural VSIPL, BLAS Structure Medium 2/3 1/1.05 Library (this talk) Procedural C, Fortran Word Medium 1 1 Language Assembly x86, PowerPC Register High 3 2 Gate Array VHDL Gate High 10 10 Standard Cell Cell High 30 100 Custom VLSI Transistor High 100 1000 • Applications can be implemented with a variety of interfaces • Applications can be implemented with a variety of interfaces • Clear tradeoff between effort (3000x) and performance (100,000x) • Clear tradeoff between effort (3000x) and performance (100,000x) – Translates into mission capability vs mission schedule – Translates into mission capability vs mission schedule MIT Lincoln Laboratory Slide-6 Multicore Productivity

Assessment Approach Speedup vs Relative Code Size 10 3 Ref 10 2 Performance/Watt Traditional Goal Parallel Programming 10 1 Relative Speedup 10 0 10 -1 Java, Matlab, “All too often” Python, etc. 10 -2 10 -3 10 -1 10 0 10 1 Relative Effort Relative Code Size • “Write” benchmarks in many programming environments on • “Write” benchmarks in many programming environments on different multicore architectures different multicore architectures • Compare performance/watt and relative effort to serial C • Compare performance/watt and relative effort to serial C MIT Lincoln Laboratory Slide-7 Multicore Productivity

Outline • Parallel Design • Environment features Environment features • Programming Models • Estimates Estimates • Performance Complexity Performance Complexity • Architectures • Productivity Results • Summary MIT Lincoln Laboratory Slide-8 Multicore Productivity

Programming Environment Features Technology UPC F2008 GA++ PVL VSIPL PVTOL Titanium StarP pMatlab DCT Chapel X10 Fortress Organization Std Std DOE Lincoln Std Body Lincoln UC ISC Lincoln Math- Cray IBM Sun Body Body PNNL Berkeley works Sponsor DoD DOE DOE Navy DoD DOE, DoD DARPA DARPA DARPA DARPA SC HPCMP NSF Type Lang Lang Library Library Library Library New Library Library Library New New New Ext Ext Lang Lang Lang Lang Base Lang C Fortran C++ C++ C++ C++ Java Matlab Matlab Matlab ZPL Java HPF Precursors CAF STAPL, PVL, VSIPL++, pMatlab PVL, pMatlab, POOMA POOMA pMatlab StarP StarP Real Apps 2001 2001 1998 2000 2004 ~2007 2002 2003 2005 Data Parallel Y Y Y Y Y Y Y Y Y Y Y Y Y Block-cyclic 1D ND blk 2D 2D Y ND 2D 4D 1D ND ND Atomic Y Y Y Threads Y Y Y Y Y Task Parallel Y Y Y Y Y Y Y Y Pipelines Y Y Y Y Hier. arrays Y Y Y Y Y Y Automap Y Y Y Sparse ? Y Y Y Y ? ? FPGA IO Y Y • Too many environments with too many features to assess individually • Too many environments with too many features to assess individually • Decompose into general classes • Decompose into general classes – Serial programming environment – Serial programming environment – Parallel programming model – Parallel programming model • Assess only relevant serial environment and parallel model pairs • Assess only relevant serial environment and parallel model pairs MIT Lincoln Laboratory Slide-9 Multicore Productivity

Dimensions of Programmability • Performance – The performance of the code on the architecture – Measured in: flops/sec, Bytes/sec, GUPS, … • Effort – Coding effort required to obtain a certain level of performance – Measured in: programmer-days, lines-of-code, function points, • Expertise Skill level of programmer required to obtain a certain level of – performance – Measured in: degree, years of experience, multi-disciplinary knowledge required, … • Portability – Coding effort required to port code from one architecture to the next and achieve a certain level of performance – Measured in: programmer-days, lines-of-code, function points, …) • Baseline – All quantities are relative to some baseline environment Serial C on a single core x86 workstation, cluster, multi-core, … – MIT Lincoln Laboratory Slide-10 Multicore Productivity

Serial Programming Environments Programming Assembly SIMD Procedural Objects High Level (C+AltiVec) (ANSI C) Languages Language (C++, Java) (Matlab) Performance 0.8 0.5 0.2 0.15 0.05 Efficiency Relative Code 10 3 1 1/3 1/10 Size Effort/Line-of- 4 hour 2 hour 1 hour 20 min 10 min Code Portability Zero Low Very High High Low Granularity Word Multi-word Multi-word Object Array • OO High Level Languages are the current desktop state-of-the practice :-) • OO High Level Languages are the current desktop state-of-the practice :-) • Assembly/SIMD are the current multi-core state-of-the-practice :-( • Assembly/SIMD are the current multi-core state-of-the-practice :-( • Single core programming environments span 10x performance and 100x • Single core programming environments span 10x performance and 100x relative code size relative code size MIT Lincoln Laboratory Slide-11 Multicore Productivity

Parallel Programming Environments Approach Direct Message Threads Recursive PGAS Hierarchical Memory Passing (OpenMP) Threads PGAS (UPC, Access (MPI) (Cilk) (PVTOL, VSIPL++) (DMA) HPCS) Performance 0.8 0.5 0.2 0.4 0.5 0.5 Efficiency Relative Code 10 3 1 1/3 1/10 1/10 Size Effort/Line-of- Very High High Medium High Medium High Code Portability Zero Very High High Medium Medium TBD Granularity Word Multi-word Word Array Array Array • Message passing and threads are the current desktop state-of-the practice • Message passing and threads are the current desktop state-of-the practice :-| :-| • DMA is the current multi-core state-of-the-practice :-( • DMA is the current multi-core state-of-the-practice :-( • Parallel programming environments span 4x performance and 100x • Parallel programming environments span 4x performance and 100x relative code size relative code size MIT Lincoln Laboratory Slide-12 Multicore Productivity

Canonical 100 CPU Cluster Estimates Parallel Assembly C++/DMA /DMA C/DMA C/Arrays C++/Arrays C/MPI relative speedup C++/threads C++ C/threads /MPI Matlab/Arrays Matlab/MPI Matlab/threads Serial Assembly C C++ Matlab relative effort • Programming environments form regions around serial environment • Programming environments form regions around serial environment MIT Lincoln Laboratory Slide-13 Multicore Productivity

Evaluating the Productivity of a Evaluating the Productivity of a - PowerPoint PPT Presentation

Evaluating the Productivity of a Evaluating the Productivity of a Multicore Architecture Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

OUTLOOK, JULY 2017 Peter Harris Productivity Commission Productivity Commission 1 2 Topic

Productivity Strategy in Kazakhstan. How we implement it? Singapore May, 2017 World ranking on

Challenge to control the animal Challenge to control the animal diseases; the implications for

Ruby on Rails Ruby on Rails a high- -productivity productivity a high web application

Investing in Productivity A Model for Enhancing Competitiveness, Sustainability and Economic

Increased Productivity in Liquid Chromatography Dr. Timothy Cross EMEA Regional Marketing Manager

High-Velocity Productivity (HVP) Individual, Team, and Organizational Productivity Frameworks for

Group V presenta-on Modular Mul-purpose Mask TEAM MEMBERS : TEAM MENTORS :

Co-Management Arrangements in Healthcare: Compliance in Hospital-Physician Arrangements THURSDAY,

Quarterly Monitoring Reports: Overview Pacific AIDS Network Fall Meeting, Oct 23 rd , 2014

Cambridge Preservation Society The population of Cambridge + South Cambs is planned to increase

1 Legend Overview of the main topics addressed in the level 2 Topics addressed in the 1 st wave

FIJIAN ELECTIONS OFFICE GIS PROJECTS PHASE II Presenter: Viliame Ledua Vuiyanuca Elections

September 19,2014 SkinnyPop Driving Growth in Popcorn! TOTAL US Current 12 Weeks Total Category

CONTR ONTROLTE OLTEST ST LTD LTD BAS Reg. 52 INSPEC SPECTION ON BODY

Sambuz

Useful Links

Newsletter

Mail Us