Future Directions in High Future Directions in High P Performance Computing Performance Computing P f f C C ti ti Jack Dongarra k INNOVATIVE COMP ING LABORATORY U i University of Tennessee i f T Oak Ridge National Laboratory University of Manchester 2/20/2008 1
Outline Outline • Top500 Results p • Four Important Concepts that Will Effect Math Software Effect Math Software � Effective Use of Many-Core � Exploiting Mixed Precision in Our � Exploiting Mixed Precision in Our Numerical Computations � Self Adapting / Auto Tuning of Software Self Adapting / Auto Tuning of Software � Fault Tolerant Algorithms 2
H. Meuer, H. Simon, E. Strohmaier, & JD H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful g p Computers in the World - Yardstick: Rmax from LINPACK MPP Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November SC‘ i h S i b Meeting in Germany in June - All data available from www.top500.org 3
Performance Development 6.96 PF/s 1 Pflop/ s 1 Pflop/ s IBM BlueGene/L 478 TF/s 100 Tflop/ s SUM NEC Earth Simulator 10 Tflop/ s 5.9 TF/s N=1 1.17 TF/s IBM ASCI White 1 Tflop/ s 6-8 years Intel ASCI Red 59.7 GF/s 100 Gflop/ s Fujitsu 'NWT' 10 Gflop/ s N=500 My Laptop 1 Gflop/ s 0.4 GF/s 100 Mflop/ s 3 4 5 6 7 8 9 00 01 02 03 04 05 06 07 199 199 199 199 199 199 199 200 200 200 200 200 200 200 200 4
30th Edition: The TOP10 Rmax Manufacturer Computer Installation Site Country Year #Cores [TF/s] Blue Gene/L DOE 2007 eServer Blue Gene 1 1 IBM IBM 478 478 USA USA 212,992 212,992 Lawrence Livermore Nat Lab L L N L b C Custom t Dual Core .7 GHz Blue Gene/P 2007 2 IBM 167 Forschungszentrum Jülich Germany 65,536 Quad Core .85 GHz Custom Altix ICE 8200 Xeon SGI/New Mexico Computing p g 2007 3 3 SGI SGI 127 127 USA USA 14 336 14,336 Applications Center Quad Core 3 GHz Hybrid Cluster Platform Xeon 4 Computational Research 2007 HP 118 India 14,240 Laboratories, TATA SONS Dual Core 3 GHz Commod Cluster Platform Cluster Platform 2007 2007 5 HP 102.8 Government Agency Sweden 13,728 Dual Core 2.66 GHz Commod Opteron DOE 6 2007 Cray 102.2 USA 26,569 Dual Core 2.4 GHz Sandia Nat Lab Hybrid Opteron DOE 7 2006 Cray Cray 101.7 101 7 USA USA 23 016 23,016 Dual Core 2.6 GHz Oak Ridge National Lab Hybrid eServer Blue Gene/L 8 IBM Thomas J. Watson 2005 IBM 91.2 USA 40,960 Research Center Dual Core .7 GHz Custom Opteron p DOE 9 9 C Cray 85 4 85.4 USA USA 19,320 19 320 2006 2006 Dual Core 2.6 GHz Lawrence Berkeley Nat Lab Hybrid eServer Blue Gene/L 07 Stony Brook/BNL, NY Center 2006 10 IBM 82.1 USA 36,864 for Computational Sciences Dual Core .7 GHz Custom 5
IBM IBM BlueGene BlueGene/L /L #1 #1 212,992 Cores 212,992 Cores 2.6 MWatts (2600 homes) 2 6 MW tt (2600 h ) (104 racks, 104x32x32) 70,000 ops/s/person 212992 procs Rack (32 Node boards, 8x8x16) 2048 processors 2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 16 Compute Cards 64 processors Compute Card 298/596 TF/s (2 chips, 2x1x1) 32 TB DDR 32 TB DDR 4 processors Chip (2 processors) 2.9/5.7 TF/s Full system total of 0.5 TB DDR 212,992 cores 90/180 GF/s 90/180 GF/s 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR “Fastest Computer” 4 MB (cache) BG/L 700 MHz 213K proc BG/L 700 MH 213K The compute node ASICs include all networking and processor functionality. 104 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded Peak: 596 Tflop/s cores (note that L1 cache coherence is not maintained between these cores). 07 6 Linpack: 498 Tflop/s (20.7K sec about 5.7hours; n=2.5M) 84% of peak
Cores per System – November 2007 300 TOP500 Total Cores 1,800,000 , , 250 1,600,000 1,400,000 Systems 1,200,000 200 1,000,000 Number of 800,000 150 600,000 400,000 200,000 100 0 50 List Release Date 0 33-64 65-128 129-256 257-512 513-1024 1025- 2049- 4k-8k 8k-16k 16k-32k 32k-64k 64k-128k 2048 4096 7
Top500 Systems November 2007 500 478 Tflop/s 478 Tflop/s 450 7 systems > 100 Tflop/s 400 350 300 x (Tflop/s) 250 21 systems > 50 Tflop/s 200 Rmax 150 100 149 systems > 10 Tflop/s 50 0 1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 5.9 Tflop/s 339 352 365 378 91 Rank Rank 04 17 0 3 40 3 443 3 41 6 456 3 43 9 469 482 495 8
Chips Used in Each of the 500 Systems 72% Intel 12% IBM 16% AMD 16% AMD Sun Sparc Intel EM64T Intel IA ‐ 32 0% NEC 65% 3% 0% C Cray 0% HP Alpha 0% HP PA ‐ RISC 0% 0% AMD x86_64 16% IBM Power Intel IA ‐ 64 12% 12% 4% 9
Interconnects / Systems 500 Others Cray Interconnect 400 SP Switch 300 Crossbar 200 200 Quadrics (121) Infiniband 100 (18) Myrinet 0 (270) 3 4 5 6 7 8 9 00 01 02 03 04 05 06 07 Gigabit Ethernet Gigabit Ethernet 199 199 199 199 199 199 199 200 200 200 200 200 200 200 200 N/A 07 GigE + Infiniband + Myrinet = 82% 10
Top500 by Usage 287, 57% Industry Research Academic 3, 1% Government 8, 2% Vendor Classified 15, 3% 101, 20% 86, 17% 07 11
Countries / Performance (Nov 2007) 60% 2.7% 7.7% 2.8% 3.2% 7.4% 4.2% 12
Power is an Industry Wide Problem Power is an Industry Wide Problem ♦ Google facilities G l f iliti � leveraging hydroelectric hydroelectric power � old aluminum “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 plants l t � >500,000 servers worldwide New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006 13
14 KWatt in the Top 20 in the Top 20 / KWatt Gflop/ Gflop 350 300 250 200 150 100 50 50 0
15 Green500 Green500
Performance Projection 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s SUM 10 Tflop/s 1 Tflop/s 6-8 years y 100 Gflop/s N=1 10 Gflop/s 8-10 years 1 Gflop/s N=500 N 500 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 30th List / November 2007 www.top500.org page 16
Los Alamos Roadrunner Los Alamos Roadrunner A A Petascale Petascale S S ystem ystem in in 2008 2008 “Connected Unit” cluster ≈ 13,000 Cell HPC chips 192 Opteron nodes • ≈ 1.33 PetaFlop/s (from Cell) (180 w/ 2 dual-Cell blades (180 w/ 2 dual Cell blades ≈ 7,000 dual-core Opterons 7 000 d l O t connected w/ 4 PCIe x8 links) ~18 clusters 2 nd stage InfiniBand 4x DDR interconnect (18 sets of 12 links to 8 switches) 2 nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s (DP) Cell chip Approval by DOE 12/07 First CU being built today Expect a May Pflop/s run Full system to LANL in December 2008
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Increase Increase Increase Lower Lower Clock Rate Clock Rate Voltage Voltage & Transistor & Transistor Density Density Density Density We have seen increasing number of gates on a Cache Cache chip and increasing clock speed. Core Core Core Heat becoming an unmanageable problem, Intel Processors > 100 Watts C1 C2 C1 C2 C1 C2 C1 C2 We will not see the dramatic increases in clock C1 C2 C1 C2 speeds in the future. C3 C4 C3 C4 However, the number of Cache Cache gates on a chip will C1 C2 C1 C2 continue to increase. C3 C4 C3 C4 C3 C4 C3 C4 18
Power Cost of Frequency Power Cost of Frequency • Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 P F 3 19
Power Cost of Frequency Power Cost of Frequency • Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 P F 3 20
What’ s Next? What’ s Next? Mixed Large Mixed Large and and All Large Core All Large Core Small Core Small Core S S all Co e all Co e Many Small Cores Many Small Cores Many Small Cores Many Small Cores All Small Core All Small Core Different Classes of Chips Home H Games / Graphics Business S cientific Many Floating- + 3D Stacked Point Cores Memory SRAM SRAM
80 Core 80 Core • Intel’s 80 Core chip Core chip � 1 Tflop/s � 62 Watts � 62 Watts � 1.2 TB/s internal BW internal BW 22
Maj or Changes to S Maj or Changes to S oftware oftware • Must rethink the design of our software software � Another disruptive technology • Similar to what happened with cluster computing and message passing � Rethink and rewrite the applications, algorithms and software algorithms, and software • Numerical libraries for example will change change � For example, both LAPACK and ScaLAPACK will undergo major changes g j g to accommodate this 23
Recommend
More recommend