OSC Statewide Users Group Distinguished Lecture Series and OSC Statewide Users Group Distinguished Lecture Series and Ralph Regula Regula School of Computational Science Lecture Series School of Computational Science Lecture Series Ralph Supercomputers and Supercomputers and Clusters and Grids, Clusters and Grids, Oh My! Oh My! Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1/12/2007 1 Take a Journey Through the World of Take a Journey Through the World of High Performance Computing High Performance Computing Apologies to Frank Baum author of “ Apologies to Frank Baum author of “Wizard of Oz Wizard of Oz”… ”… Dorothy: “Do you suppose we'll meet any wild animals?” Tinman: “We might.” Scarecrow: “Animals that ... that eat straw?” Tinman: “Some. But mostly lions, and tigers, and bears.” All: “Lions and tigers and bears, oh my! Supercomputers and clusters and grids, oh my! Lions and tigers and bears, oh my!” Supercomputers and clusters and grids, oh my! 07 2 1
A Growth- A Growth -Factor of a Billion Factor of a Billion Super Scalar/Vector/Parallel in Performance in a Career in Performance in a Career 1 PFlop/s IBM (10 15 ) BG/L Parallel ASCI White ASCI Red Pacific 1 TFlop/s (10 12 ) TMC CM-5 Cray T3D 2X Transistors/Chip Vector TMC CM-2 Every 1.5 Years Cray 2 1 GFlop/s Cray X-MP (10 9 ) Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) CDC 7600 IBM 360/195 1945 100 1 MFlop/s Scalar 1949 1,000 (1 KiloFlop/s, KFlop/s) (10 6 ) 1951 10,000 CDC 6600 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) IBM 7090 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) (10 3 ) UNIVAC 1 2000 10,000,000,000,000 EDSAC 1 2005 280,000,000,000,000 (280 Tflop/s) 07 3 1950 1960 1970 1980 1990 2000 2010 H. Meuer, H. Simon, E. Strohmaier, & JD H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org 07 4 2
Performance Development; Top500 Performance Development; Top500 3.54 PF/s 1 Pflop/ s 280.6 TF/s IBM BlueGene/L 100 Tflop/ s SUM NEC Earth Simulator 10 Tflop/ s N=1 1.167 TF/s IBM ASCI White 2.74 1 Tflop/ s TF/s 6-8 years 59.7 GF/s Intel ASCI Red 100 Gflop/ s Fujitsu 'NWT' 10 Gflop/ s N=500 My Laptop 0.4 GF/s 1 Gflop/ s 100 Mflop/ s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 07 5 Architecture/Systems Continuum Architecture/Systems Continuum Tightly 100% Coupled Best processor performance for Custom processor ♦ ♦ codes that are not “cache with custom interconnect Custom friendly” Cray X1 � 80% Good communication performance ♦ NEC SX-8 � Simpler programming model ♦ IBM Regatta � Most expensive IBM Blue Gene/L ♦ � 60% Commodity processor ♦ Hybrid with custom interconnect Good communication performance ♦ Good scalability SGI Altix ♦ � 40% � Intel Itanium 2 Cray XT3 � � AMD Opteron Commodity processor 20% ♦ Best price/performance (for with commodity interconnect ♦ codes that work well with caches Commod Clusters � and are latency tolerant) � Pentium, Itanium, 0% More complex programming model ♦ J u n -9 3 D e c -9 3 J u n -9 4 D e c -9 4 J u n -9 5 D e c -9 5 J u n -9 6 D e c -9 6 J u n -9 7 D e c -9 7 J u n -9 8 D e c -9 8 J u n -9 9 D e c -9 9 J u n -0 0 D e c -0 0 J u n -0 1 D e c -0 1 J u n -0 2 D e c -0 2 J u n -0 3 D e c -0 3 J u n -0 4 Opteron, Alpha � GigE, Infiniband, Myrinet, Quadrics Loosely NEC TX7 � IBM eServer Coupled � Dawning 07 � 6 3
Processors Used in Each Processors Used in Each of the 500 Systems of the 500 Systems 92% = 51% Intel 19% IBM Intel IA-32 22% AMD Sun Sparc 22% 1% Intel EM64T NEC 22% 1% HP Alpha Cray 1% 1% HP PA-RISC 4% Intel IA-64 7% AMD x86_64 22% IBM Power 19% 07 7 Interconnects / Systems Interconnects / Systems 500 Others 400 Cray Interconnect SP Switch 300 Crossbar Quadrics 200 (78) Infiniband (79) Myrinet 100 (211) Gigabit Ethernet N/ A 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 GigE + Infiniband + Myrinet = 74% 07 8 4
Processors per System - - Nov 2006 Nov 2006 Processors per System 200 180 160 140 Num ber of System s 120 100 80 60 40 20 0 33-64 65-128 129-256 257-512 513- 1025- 2049- 4k-8k 8k-16k 16k-32k 32k-64k 64k- 1024 2048 4096 128k 07 9 28th List: The TOP10 28th List: The TOP10 Rmax Year/ Manufacturer Computer Installation Site Country #Proc [TF/s] Arch BlueGene/L 2005 1 IBM 280.6 DOE/NNSA/LLNL USA 131,072 eServer Blue Gene Custom 2 Red Storm 2006 Sandia/Cray 101.4 NNSA/Sandia USA 26,544 9 Cray XT3 Hybrid 3 BGW 2005 IBM 91.29 IBM Thomas Watson USA 40,960 2 eServer Blue Gene Custom 4 ASC Purple 2005 IBM 75.76 DOE/NNSA/LLNL USA 12,208 3 eServer pSeries p575 Custom Barcelona Supercomputer MareNostrum 2006 5 IBM 62.63 Spain 12,240 Center JS21 Cluster, Myrinet Commod Thunderbird 2005 6 Dell 53.00 NNSA/Sandia USA 9,024 PowerEdge 1850, IB Commod 7 Tera-10 2006 Bull 52.84 CEA France 9,968 5 NovaScale 5160, Quadrics Commod 8 Columbia 2004 SGI 51.87 NASA Ames USA 10,160 4 Altix, Infiniband Hybrid 9 GSIC / Tokyo Institute Tsubame 2006 NEC/Sun 47.38 Japan 11,088 7 of Technology Fire x4600, ClearSpeed, IB Commod Jaguar 2006 10 Cray 43.48 ORNL USA 10,424 Cray XT3 Hybrid 07 10 5
IBM BlueGene BlueGene/L /L #1 IBM #1 131,072 Processors 131,072 Processors Total of 18 systems all in the Top100 Total of 18 systems all in the Top100 1.6 MWatts (1600 homes) (64 racks, 64x32x32) 43,000 ops/s/person 131,072 procs Rack (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card 180/360 TF/s (2 chips, 2x1x1) 32 TB DDR 4 processors Chip (2 processors) 2.9/5.7 TF/s Full system total of 0.5 TB DDR 131,072 processors 90/180 GF/s 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR “Fastest Computer” 4 MB (cache) BG/L 700 MHz 131K proc The compute node ASICs include all networking and processor functionality. 64 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded Peak: 367 Tflop/s 07 cores (note that L1 cache coherence is not maintained between these cores). Linpack: 281 Tflop/s (13K sec about 3.6 hours; n=1.8M) 11 77% of peak Performance Projection Performance Projection 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s SUM 10 Tflop/s 1 Tflop/s 6-8 years 100 Gflop/s N=1 10 Gflop/s 8-10 years 1 Gflop/s N=500 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 07 12 6
A PetaFlop A PetaFlop Computer by the End of the Computer by the End of the Decade Decade ♦ Many efforts working on a building a Petaflop system by the end of the decade. � Cray } 2+ Pflop/s Linpack � IBM 6.5 PB/s data streaming BW 3.2 PB/s Bisection BW � Sun 64,000 GUPS � Dawning } Chinese Chinese � Galactic Companies Companies � Lenovo } � Hitachi Japanese Japanese � NEC “Life Simulator Life Simulator” ” (10 (10 Pflop/s Pflop/s) ) “ � Fujitsu � Bull 07 13 Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Increase Lower Lower Clock Rate Clock Rate Voltage Voltage & Transistor & Transistor Density Density We have seen increasing number of gates on a Cache Cache chip and increasing clock speed. Core Core Core Heat becoming an unmanageable problem, Intel Processors > 100 Watts C1 C2 C1 C2 We will not see the dramatic increases in clock C1 C2 C1 C2 speeds in the future. C3 C4 C3 C4 However, the number of Cache Cache gates on a chip will C1 C2 C1 C2 continue to increase. C3 C4 C3 C4 07 C3 C4 C3 C4 14 7
1 Core 1 Core Free Lunch For Traditional Software 24 GHz, 1 Core (It just runs twice as fast every 18 months No Free Lunch For Traditional Operations per second for serial code Software with no change to the code!) (Without highly concurrent software it won’t get any faster!) 2 Cores 2 Cores 12 GHz, 1 Core 4 Cores 4 Cores 1 Core 6 GHz 8 Cores 8 Cores 3 GHz 3 GHz, 4 Cores 3 GHz, 8 Cores 2 Cores 1 Core 3GHz 07 Additional operations per second if code can take advantage of concurrency 15 From Craig Mundie, Microsoft 1.2 TB/s memory BW 07 http://www.pcper.com/article.php?aid=302 16 8
Recommend
More recommend