ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB

Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s 100000000 16.3%PFlop/s % 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM % 100 Tflop/s 100000 N=1 % 10 Tflop/s 60.8%TFlop/s % 10000 6-8 years 1 Tflop/s 1000 1.17%TFlop/s % N=500 % 100 Gflop/s My Laptop (70 Gflop/s) 100 59.7%GFlop/s % 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400%MFlop/s % 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2012

June 2012: The TOP10 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / NNSA Sequoia, BlueGene/Q (16c) 1 USA 1,572,864 16.3 81 8.6 1895 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 2 Japan 705,024 10.5 93 12.7 830 for Comp Sci VIIIfx (8c) + custom DOE / OS 3 Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 Argonne Nat Lab Leibniz 4 SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 Rechenzentrum Tianhe-1A, NUDT Nat. SuperComputer 5 Intel (6c) + Nvidia GPU (14c) China 186,368 2.57 55 4.04 636 Center in Tianjin + custom DOE / OS Jaguar, Cray 6 USA 298,592 1.94 74 5.14 377 Oak Ridge Nat Lab AMD (16c) + custom Fermi, BlueGene/Q (16c) 7 CINECA Italy 163,840 1.73 82 .821 2099 + custom Forschungszentrum JuQUEEN, BlueGene/Q (16c) 8 Germany 131,072 1.38 82 .657 2099 Juelich (FZJ) + custom Commissariat a Curie, Bull 9 l'Energie Atomique France 77,184 1.36 82 2.25 604 Intel (8c) + IB (CEA) Nat. Supercomputer Nebulea, Dawning Intel (6) 10 China 120,640 43 2.58 493 1.27 Center in Shenzhen + Nvidia GPU (14c) + IB 3 ��

Accelerators (58 systems) 60" Intel"MIC"(1)" 50" Clearspeed"CSX600"(0)" ATI"GPU"(2)" 40" IBM"PowerXCell"8i"(2)" Systems% 30" NVIDIA"2070"(10)" NVIDIA"2050(12)" 20" NVIDIA"2090"(31)" �� 10" �� 0" �� 2006" 2007" 2008" 2009" 2010" 2011" 2012" ��

Countries Share Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20 Switzerland 5

Swiss Machines in Top500 (max:12 min:1) Jan+93" Oct+95" Jul+98" Apr+01" Jan+04" Oct+06" Jul+09" Apr+12" 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" 4 5 7 9 12 9 8 9 6 6 5 6 6 5 8 8 6 2 1 1 3 3 2 3 3 4 4 5 5 7 6 4 4 5 5 4 4 3 1 High point: 12 systems (6/95) Low points: 1 system (6/02, 11/02, 6/12) 6

28 Systems at > Pflop/s (Peak) Pflop/s"Club" 45" 41# 40" Pflop/s 35" 30" 25" 20" 16.2# 15" 11.1# 6.9# 10" 2.92# 2.73# 2.1# 5" 1.7# 0" US""""""""""""""""""" "Japan""""""" China""""""""""""""" Germany""""""""""" France"""""""""""" UK""""""""""""""""" Italy"""""""""""" Russia""""""" (9)" (4)" (5)" (4)" (2)" (2)" (1)" (1)" (Peak) 10/2/12 7

Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s N=1% 1000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s N=500% 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 0.1

The High Cost of Data Movement • �� 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ �� • �� 12

Energy Cost Challenge � At ~$1M per MW energy costs are substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs ! DOE Target: 1 Eflop/s in 2018 at 20 MWs 13

Potential System Architecture with a cap of $200M and 20MW Systems 2012 2019 Difference Today & 2019 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Today & 2022 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

Critical Issues at Peta & Exascale for Algorithm and Software Design � Synchronization-reducing algorithms ! Break Fork-Join model � Communication-reducing algorithms ! Use methods which have lower bound on communication � Mixed precision methods ! 2x speed of ops and 2x speed for data movement � Autotuning ! Today’s machines are too complicated, build “smarts” into software to adapt to the hardware � Fault resilient algorithms ! Implement algorithms that can recover from failures/bit flips � Reproducibility of results ! Today we can’t guarantee this. We understand the issues, 16 but some of our “colleagues” have a hard time with this.

Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software ! Manycore and Hybrid architectures are disruptive technology ! Similar to what happened with cluster computing and message passing ! Rethink and rewrite the applications, algorithms, and software ! Data movement is expensive ! Flops are cheap 17

Dense Linear Algebra Software Evolution LINPACK (70's) " Level 1 BLAS vector operations LAPACK (80's) " Level 3 BLAS block operations " PBLAS ScaLAPACK (90's) block cyclic " BLACS data distribution (message passing) PLASMA (00's) " tile layout tile operations " dataflow scheduling

PLASMA Principles " Tile Algorithms " minimize capacity misses LAPACK CPU cache MEM " Tile Matrix Layout " minimize conflict misses PLASMA CPU CPU CPU CPU cache cache cache cache MEM " Dynamic DAG Scheduling " minimizes idle time " More overlap " Asynchronous ops

Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. - • This is the 2/3n 3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS Cores Time

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s

NSF Future of High Performance Computing Bill Kramer NSF Workshop on the Future of High

Future Directions in High Future Directions in High P Performance Computing Performance

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

An Overview of High An Overview of High Performance Computing and Performance Computing and

On the Future of High Performance Computing: How to Think for Peta and Exascale Computing Jack

High Performance Computing on ARM C. Steinhaus C. Wedding christian.{wedding,

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

An Overview Of High Performance Computing And Challenges For The Future Jack Dongarra

High Performance Computing (HPC) at UL Present and Future Challenges Sbastien Varrette, PhD

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

Programming Models for Future High Performance Computing Systems John Gurd University of

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Introduction to High Performance Computing at ZIH Architecture of the PC Farm (Deimos)

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

High Performance Computing What is it used for and why? Overview What is it used for?

Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn,

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 -

Specializing General-Purpose Computing A New Approach to Designing Clusters for High-Performance

Linux and High-Performance Computing Outline Architectures & Performance Measurement

HPC Analytics Dan Stanzione Fulton High Performance Computing dstanzi@asu.edu 2/20/05 Theme

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

High Performance Computing for Nanoplasmonic Laser Fusion Istv an Papp, Larissa Bravina, M

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s

NSF Future of High Performance Computing Bill Kramer NSF Workshop on the Future of High

Future Directions in High Future Directions in High P Performance Computing Performance

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

An Overview of High An Overview of High Performance Computing and Performance Computing and

On the Future of High Performance Computing: How to Think for Peta and Exascale Computing Jack

High Performance Computing on ARM C. Steinhaus C. Wedding christian.{wedding,

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

An Overview Of High Performance Computing And Challenges For The Future Jack Dongarra

High Performance Computing (HPC) at UL Present and Future Challenges Sbastien Varrette, PhD

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

Programming Models for Future High Performance Computing Systems John Gurd University of

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Introduction to High Performance Computing at ZIH Architecture of the PC Farm (Deimos)

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

High Performance Computing What is it used for and why? Overview What is it used for?

Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn,

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 -

Specializing General-Purpose Computing A New Approach to Designing Clusters for High-Performance

Linux and High-Performance Computing Outline Architectures &amp; Performance Measurement

HPC Analytics Dan Stanzione Fulton High Performance Computing dstanzi@asu.edu 2/20/05 Theme

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

High Performance Computing for Nanoplasmonic Laser Fusion Istv an Papp, Larissa Bravina, M

Linux and High-Performance Computing Outline Architectures & Performance Measurement