Jack Dongarra University of Tennessee Oak Ridge National Laboratory - PowerPoint PPT Presentation

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/7/10 1

TPP performance Rate Size 2

100 Pflop/s 100000000 32.4 ¡PFlop/s ¡ 10 Pflop/s 10000000 1.76 ¡PFlop/s ¡ 1 Pflop/s 1000000 100 Tflop/s SUM ¡ 100000 24.7 ¡TFlop/s ¡ 10 Tflop/s 10000 N=1 ¡ 1 Tflop/s 1.17 ¡TFlop/s ¡ 1000 6-8 years 100 Gflop/s N=500 ¡ 100 59.7 ¡GFlop/s ¡ 10 Gflop/s 10 My Laptop 1 Gflop/s 1 400 ¡MFlop/s ¡ 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009

Intel 81% AMD 10% IBM 8% 4

Of the Top500, 499 are multicore. Intel Xeon(8 cores) Sun Niagra2 (8 cores) IBM Power 7 (8 cores) Intel Polaris [experimental] (80 cores) IBM Cell (9 cores) Fujitsu Venus (8 cores) 5 AMD Istambul (6 cores) IBM BG/P (4 cores)

Performance ¡of ¡Countries ¡ 100,000 ¡ US ¡ 10,000 ¡ Total ¡Performance ¡ ¡[Tflop/s] ¡ 1,000 ¡ 100 ¡ 10 ¡ 1 ¡ 0 ¡ 2000 ¡ 2002 ¡ 2004 ¡ 2006 ¡ 2008 ¡ 2010 ¡

Performance ¡of ¡Countries ¡ 100,000 ¡ US ¡ 10,000 ¡ Total ¡Performance ¡ ¡[Tflop/s] ¡ EU ¡ 1,000 ¡ 100 ¡ 10 ¡ 1 ¡ 0 ¡ 2000 ¡ 2002 ¡ 2004 ¡ 2006 ¡ 2008 ¡ 2010 ¡

Performance ¡of ¡Countries ¡ 100,000 ¡ US ¡ 10,000 ¡ Total ¡Performance ¡ ¡[Tflop/s] ¡ EU ¡ Japan ¡ 1,000 ¡ 100 ¡ 10 ¡ 1 ¡ 0 ¡ 2000 ¡ 2002 ¡ 2004 ¡ 2006 ¡ 2008 ¡ 2010 ¡

Performance ¡of ¡Countries ¡ 100,000 ¡ US ¡ 10,000 ¡ Total ¡Performance ¡ ¡[Tflop/s] ¡ EU ¡ Japan ¡ 1,000 ¡ China ¡ 100 ¡ 10 ¡ 1 ¡ 0 ¡ 2000 ¡ 2002 ¡ 2004 ¡ 2006 ¡ 2008 ¡ 2010 ¡

Countries ¡/ ¡System ¡Share ¡

Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / OS Jaguar / Cray 1 USA 224,162 75 7.0 251 1.76 Oak Ridge Nat Lab Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer 2 Blade, Intel X5650, Nvidia China 120,640 43 2.58 493 1.27 Center in Shenzhen C2050 GPU DOE / NNSA Roadrunner / IBM 3 USA 122,400 1.04 76 2.48 446 Los Alamos Nat Lab BladeCenter QS22/LS21 NSF / NICS / Kraken/ Cray 4 USA 98,928 .831 81 3.09 269 U of Tennessee Cray XT5 sixCore 2.6 GHz Forschungszentrum Jugene / IBM 5 Germany 294,912 .825 82 2.26 365 Juelich (FZJ) Blue Gene/P Solution NASA / Ames Research Pleiades / SGI 6 USA 56,320 .544 82 3.1 175 Center/NAS SGI Altix ICE 8200EX Tianhe-1 / NUDT TH-1 / National SC Center in IntelQC + AMD ATI Radeon 7 China 71,680 .563 46 1.48 380 Tianjin / NUDT 4870 DOE / NNSA BlueGene/L IBM 8 USA 212,992 80 2.32 206 .478 Lawrence Livermore NL eServer Blue Gene Solution DOE / OS Intrepid / IBM 9 USA 163,840 82 1.26 363 .458 Argonne Nat Lab Blue Gene/P Solution DOE / NNSA Red Sky / Sun / 10 USA 42,440 .433 87 2.4 180 Sandia Nat Lab SunBlade 6275

Recently upgraded to a 2 Pflop/s system with more than 224K cores using AMD’s 6 Core chip. Peak performance 2.332 PF System memory 300 TB Disk space 10 PB Disk bandwidth 240+ GB/s Interconnect bandwidth 374 TB/s Office of Science

¨ Nebulae ¨ Hybrid system, commodity + GPUs ¨ Theoretical peak 2.98 Pflop/s ¨ Linpack Benchmark at 1.27 Pflop/s ¨ 4640 nodes, each node: 2 Intel 6-core Xeon5650 + Nvidia Fermi C2050 GPU (each 14 cores)  120,640 cores  Infiniband connected  500 MB/s peak per link and 8 GB/s

Commodity Accelerator (GPU) Intel Xeon Nvidia C2050 “Fermi” 8 cores 448 “Cuda cores” 3 GHz 1.15 GHz 8*4 ops/cycle 448 ops/cycle 96 Gflop/s (DP) 515 Gflop/s (DP) Interconnect PCI Express 07 512 MB/s to 32GB/s 15 8 MW ‒ 512 MW

≈ 13,000 Cell HPC chips “Connected Unit” cluster ≈ 1.33 PetaFlop/s (from Cell) 192 Opteron nodes (180 w/ 2 dual-Cell blades ≈ 7,000 dual-core Opterons connected w/ 4 PCIe x8 ≈ 122,000 cores links) 17 clusters 2 nd stage InfiniBand 4x DDR interconnect Cell chip for each core (18 sets of 12 links to 8 switches) 2 nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s (DP) Cell chip Hybrid Design (2 kinds of chips & 3 kinds of cores) Programming required at 3 levels. Dual Core Opteron Chip

Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing applications and encourage development of parallel processing )  1 GFlop/s; 1988; Cray Y-MP; 8 Processors  Static finite element analysis  1 TFlop/s; 1998; Cray T3E; 1024 Processors  Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method.  1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors  Superconductive materials  1 EFlop/s; ~2018; ?; 1x10 7 Processors (10 9 threads)

Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s SUM ¡ 10000000 1000000 1 Pflop/s N=1 ¡ Gordon 100 Tflop/s 100000 Bell 10000 10 Tflop/s Winners 1000 1 Tflop/s N=500 ¡ 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 100 Mflop/s 0.1

Systems 2009 2019 Difference Today & 2019 System peak 2 Pflop/s 1 Eflop/s O(1000) Power 6 MW ~20 MW System memory 0.3 PB 32 - 64 PB [ .03 Bytes/Flop ] O(100) Node performance 125 GF 1,2 or 15TF O(10) – O(100) Node memory BW 25 GB/s 2 - 4TB/s [ .002 Bytes/Flop ] O(100) Node concurrency 12 O(1k) or 10k O(100) – O(1000) Total Node Interconnect BW 3.5 GB/s 200-400GB/s O(100) (1:4 or 1:8 from memory BW) System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100) Total concurrency 225,000 O(billion) [O(10) to O(100) for O(10,000) latency hiding] Storage 15 PB 500-1000 PB (>10x system O(10) – O(100) memory is min) IO 0.2 TB 60 TB/s (how long to drain the O(100) machine) MTTI days O(1 day) - O(10)

• Light weight processors (think BG/P)  ~1 GHz processor (10 9 )  ~1 Kilo cores/socket (10 3 )  ~1 Mega sockets/system (10 6 ) • Hybrid system (think GPU based)  ~1 GHz processor (10 9 )  ~10 Kilo FPUs/socket (10 4 )  ~100 Kilo sockets/system (10 5 )

• Steepness of the ascent from terascale Average Number of Cores Per to petascale to exascale Supercomputer for Top20 Systems • Extreme parallelism and hybrid design 100,000 • Preparing for million/billion way 90,000 parallelism 80,000 • Tightening memory/bandwidth 70,000 bottleneck 60,000 • Limits on power/clock speed 50,000 implication on multicore 40,000 • Reducing communication will become 30,000 much more intense 20,000 • Memory per core changes, byte-to-flop ratio will change 10,000 • Necessary Fault Tolerance 0 • MTTF will drop • Checkpoint/restart has limitations Software infrastructure does not exist today

• Number of cores per chip will double every two years • Clock speed will not increase (possibly decrease) because of Power Power ∝ Voltage 2 * Frequency Voltage ∝ Frequency Power ∝ Frequency 3 • Need to deal with systems with millions of concurrent threads • Need to deal with inter-chip parallelism as well as intra-chip parallelism

Different Classes of Many Floating- Chips Point Cores Home Games / Graphics Business Scientific + 3D Stacked Memory

• Must rethink the design of our software  Another disruptive technology • Similar to what happened with cluster computing and message passing  Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change  For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 24

Jack Dongarra University of Tennessee Oak Ridge National Laboratory - PowerPoint PPT Presentation

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/7/10 1 TPP performance Rate Size 2 100 Pflop/s 100000000 32.4 PFlop/s 10 Pflop/s 10000000 1.76 PFlop/s 1 Pflop/s 1000000

Jack Dongarra University of Tennessee http:/ / w w w .cs.utk.edu/ ~ dongarra/ http:/ / w w w

Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra,

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 What

A Look at Some Ideas and Experiments Jack Dongarra University of Tennessee and Oak Ridge

Algorithmic and Software Challenges when Moving Towards Exascale Jack Dongarra University of

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra

An Overview Of High Performance Computing And Challenges For The Future Jack Dongarra

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

In the Beginning Jack Dongarra University of Tennessee Oak Ridge National Lab PVM u Al and

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 ,

Features to Consider When Computing at Scale Jack Dongarra University of Tennessee Oak Ridge

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

An Overview of HPC and the Changing Rules at Exascale Jack Dongarra University of Tennessee Oak

Architecture-Aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra

The Grid, NetSolve, and Its Applications 11-13 February 2002 Jack Dongarra Computer Science

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U

Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester Slide 2

Architecture-aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra

With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 LINPACK is a

Jack Dongarra University of Tennessee Oak Ridge National Laboratory - PowerPoint PPT Presentation

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/7/10 1 TPP performance Rate Size 2 100 Pflop/s 100000000 32.4 PFlop/s 10 Pflop/s 10000000 1.76 PFlop/s 1 Pflop/s 1000000

Jack Dongarra University of Tennessee http:/ / w w w .cs.utk.edu/ ~ dongarra/ http:/ / w w w

Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra,

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of

Jack Dongarra University of Tennessee &amp; Oak Ridge National Laboratory, USA 1 What

A Look at Some Ideas and Experiments Jack Dongarra University of Tennessee and Oak Ridge

Algorithmic and Software Challenges when Moving Towards Exascale Jack Dongarra University of

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Towards Efficient MapReduce Using MPI Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra

An Overview Of High Performance Computing And Challenges For The Future Jack Dongarra

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

In the Beginning Jack Dongarra University of Tennessee Oak Ridge National Lab PVM u Al and

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 ,

Features to Consider When Computing at Scale Jack Dongarra University of Tennessee Oak Ridge

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

An Overview of HPC and the Changing Rules at Exascale Jack Dongarra University of Tennessee Oak

Architecture-Aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra

The Grid, NetSolve, and Its Applications 11-13 February 2002 Jack Dongarra Computer Science

Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U

Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester Slide 2

Architecture-aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra

With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &amp;

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra

Jack Dongarra University of Tennessee &amp; Oak Ridge National Laboratory, USA 1 LINPACK is a

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 What

TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH PERFORMANCE COMPUTING SYSTEMS Jack Dongarra &

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 LINPACK is a