Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 9/14/09 1
TPP performance Rate Size 2
100 Pflop/s 10000000 22.9 ¡PFlop/s ¡ 10 Pflop/s 10000000 1.1 ¡PFlop/s ¡ 1 Pflop/s 1000000 100 Tflop/s SUM ¡ 100000 17.08 ¡TFlop/s ¡ 10 Tflop/s 10000 N=1 ¡ 1 Tflop/s 1.17 ¡TFlop/s ¡ 1000 6-8 years 100 Gflop/s N=500 ¡ 100 59.7 ¡GFlop/s ¡ 10 Gflop/s 10 My Laptop 1 Gflop/s 1 400 ¡MFlop/s ¡ 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009
Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing applications and encourage development of parallel processing ) 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018; ?; 1x10 7 Processors (10 9 threads)
Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 10000000 10 Pflop/s SUM ¡ 10000000 1000000 1 Pflop/s N=1 ¡ 100 Tflop/s 100000 10000 10 Tflop/s 1000 1 Tflop/s Gordon N=500 ¡ Bell 100 Gflop/s 100 Winners 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 100 Mflop/s 0.1
Rmax % of Power Flops/ Rank Site Computer Country Cores [Tflops] Peak [MW] Watt Roadrunner / IBM DOE / NNSA 1 USA 129,600 1,105 76 2.48 446 BladeCenter QS22/LS21 Los Alamos Nat Lab DOE / OS Jaguar / Cray 2 USA 150,152 1,059 77 6.95 151 Oak Ridge Nat Lab Cray XT5 QC 2.3 GHz Forschungszentrum Jugene / IBM 3 Germany 294,912 82 2.26 365 825 Juelich (FZJ) Blue Gene/P Solution NASA / Ames Research Pleiades / SGI 4 USA 51,200 480 79 2.09 230 SGI Altix ICE 8200EX Center/NAS DOE / NNSA BlueGene/L IBM 5 USA 212,992 80 2.32 206 478 Lawrence Livermore NL eServer Blue Gene Solution NSF Kraken / Cray 6 USA 66,000 463 76 NICS/U of Tennessee Cray XT5 QC 2.3 GHz Intrepid / IBM DOE / OS 7 USA 163,840 458 82 1.26 363 Blue Gene/P Solution Argonne Nat Lab Ranger / Sun NSF 8 USA 62,976 75 2.0 217 433 SunBlade x6420 TACC/U. of Texas DOE / NNSA Dawn / IBM 9 USA 147,456 83 1.13 367 415 Lawrence Livermore NL Blue Gene/P Solution Forschungszentrum JUROPA /Sun - Bull SA 10 Germany 26,304 274 89 1.54 178 Juelich (FZJ) NovaScale /Sun Blade
Rmax % of Power Flops/ Rank Site Computer Country Cores [Tflops] Peak [MW] Watt Roadrunner / IBM DOE / NNSA 1 USA 129,600 1,105 76 2.48 446 BladeCenter QS22/LS21 Los Alamos Nat Lab DOE / OS Jaguar / Cray 2 USA 150,152 1,059 77 6.95 151 Oak Ridge Nat Lab Cray XT5 QC 2.3 GHz Forschungszentrum Jugene / IBM 3 Germany 294,912 82 2.26 365 825 Juelich (FZJ) Blue Gene/P Solution NASA / Ames Research Pleiades / SGI 4 USA 51,200 480 79 2.09 230 SGI Altix ICE 8200EX Center/NAS DOE / NNSA BlueGene/L IBM 5 USA 212,992 80 2.32 206 478 Lawrence Livermore NL eServer Blue Gene Solution NSF Kraken / Cray 6 USA 66,000 463 76 NICS/U of Tennessee Cray XT5 QC 2.3 GHz Intrepid / IBM DOE / OS 7 USA 163,840 458 82 1.26 363 Blue Gene/P Solution Argonne Nat Lab Ranger / Sun NSF 8 USA 62,976 75 2.0 217 433 SunBlade x6420 TACC/U. of Texas DOE / NNSA Dawn / IBM 9 USA 147,456 83 1.13 367 415 Lawrence Livermore NL Blue Gene/P Solution Forschungszentrum JUROPA /Sun - Bull SA 10 Germany 26,304 274 89 1.54 178 Juelich (FZJ) NovaScale /Sun Blade
From K. Olukotun, L. Hammond, H. • In the “old Sutter, and B. Smith days” it was: each year A hardware issue just became a processors software problem would become faster • Today the clock speed is fixed or getting slower • Things are still doubling every 18 -24 months • Moore’s Law reinterpretated. Number of cores double every 18-24 months 07 8
• Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 9
• Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 10
282 use Quad-Core 204 use Dual-Core 3 use Nona-core Intel Clovertown (4 cores) Sun Niagra2 (8 cores) IBM Power 7 (8 cores) Intel Polaris [experimental] (80 cores) IBM Cell (9 cores) Fujitsu Venus (8 cores) 11 AMD Istambul (6 cores) IBM BG/P (4 cores)
• Number of cores per chip doubles every 2 year, while clock speed remains fixed or decreases • Need to deal with systems with millions of concurrent threads • Future generation will have billions of threads! • Number of threads of execution doubles every 2 year
• Must rethink the design of our software Another disruptive technology • Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 13
• Effective Use of Many-Core and Hybrid architectures Dynamic Data Driven Execution Block Data Layout • Exploiting Mixed Precision in the Algorithms Single Precision is 2X faster than Double Precision With GP-GPUs 10x • Self Adapting / Auto Tuning of Software Too hard to do by hand • Fault Tolerant Algorithms With 1,000,000’s of cores things will fail • Communication Avoiding Algorithms For dense computations from O(n log p) to O( log p) 14 communications GMRES s-step compute ( x, Ax, A 2 x, … A s x )
Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Parallel software for multicores should have two characteristics: • Fine granularity: • High level of parallelism is needed • Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality. • Asynchronicity: • As the degree of thread level parallelism grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.
(Factor a panel) (Backward swap) (Forward swap) (Triangular solve) (Matrix multiply) 20
Recommend
More recommend