High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~ dongarra/ / 1 Outline ? Look at trends in HPC � Top500 statistics ? Perf ormance of Super- Scalar Processors � ATLAS ? Perf ormance Monitoring � PAPI ? NetSolve � Example of grid middleware I n pioneer days, they used oxen f or heavy pulling, and when one ox couldn' t budge a log they didn' t try to grow a larger ox. We shouldn' t be trying f or bigger computers, 2 but f or more systems of computers. - - Grace Hopper 1
Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Gordon Moore (co-founder of Called “Moore’s Law” Intel) predicted in 1965 that the transistor density of semiconductor Microprocessors have chips would double roughly every become smaller, denser, 18 months. and more powerful. 3 High Performance Computers & Numerical Libraries ? 20 years ago � 1x10 6 Floating Point Ops/ sec (Mf lop/ s) » Scalar based » Loop unrolling ? 10 years ago � 1x10 9 Floating Point Ops/ sec (Gf lop/ s) » Vect or & Shared memory comput ing, bandwidt h aware » Block part it ioned, lat ency t olerant ? Today � 1x10 12 Floating Point Ops/ sec (Tf lop/ s) » Highly parallel, dist ribut ed processing, message passing, net work based » dat a decomposit ion, communicat ion/ comput at ion ? 10 years away � 1x10 15 Floating Point Ops/ sec (Pf lop/ s) » Many more levels MH, combinat ion/ grids&HPC » More adapt ive, LT and bandwidt h aware, f ault t olerant , ext ended precision, at t ent ion t o SMP nodes 4 2
TOP500 TOP500 - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP perf ormance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org 5 Fastest Computer Over Time 50 45 40 35 GFlop/s 30 25 20 TMC 15 CM-2 (2048) 10 Cray 5 Y-MP (8) Fujitsu VP-2600 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can now be done in 1 month! 3
Fastest Computer Over Time 500 Hitachi 450 CP-PACS 400 (2040) Intel 350 Paragon GFlop/s (6788) 300 Fujitsu 250 VPP-500 200 (140) TMC 150 CM-5 NEC 100 (1024) SX-3 TMC 50 (4) Fujitsu CM-2 Cray VP-2600 (2048) Y-MP (8) 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can now be done in 4 days! Fastest Computer Over Time ASCI White 5000 Pacific 4500 (7424) 4000 3500 Intel ASCI GFlop/s Red Xeon 3000 ASCI Blue (9632) Pacific SST 2500 (5808) Intel 2000 ASCI Red SGI ASCI 1500 (9152) Blue Mountain 1000 Hitachi Intel Fujitsu CP-PACS TMC Paragon (5040) NEC TMC VPP-500 (2040) 500 CM-5 (6788) Cray Fujitsu SX-3 (140) CM-2 (1024) VP-2600 (4) Y-MP (8) (2048) 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can today be done in 1 hour! 4
Top 10 Machines (June 2000) Rank Company Machine Procs Gflop/s Place Country Year Sandia National Labs 1 Intel ASCI Red 9632 2380 USA 1999 Albuquerque ASCI Blue-Pacific Lawrence Livermore National 2 IBM 5808 2144 USA 1999 SST, IBM SP 604e Laboratory Livermore ASCI Blue 1608 Los Alamos National Laboratory 3 SGI 6144 USA 1998 Mountain Los Alamos Leibniz Rechenzentrum 4 Hitachi SR8000-F1/112 112 1035 Germany 2000 Muenchen High Energy Accelerator 5 Hitachi SR8000-F1/100 100 917 Research Organization /KEK Japan 2000 Tsukuba 6 Cray Inc. T3E1200 1084 892 Government USA 1998 892 US Army HPC Research Center 7 Cray Inc. T3E1200 1084 USA 2000 at NCS Minneapolis 8 Hitachi SR8000/128 128 874 University of Tokyo Tokyo Japan 1999 9 Cray Inc. T3E900 1324 815 Government USA 1997 SP Power3 Naval Oceanographic Office 10 IBM 1336 723 USA 2000 375 MHz (NAVOCEANO) Poughkeepsie Performance Development 100 Tflop/s 64.3 TF/s SUM 10 Tflop/s 2.38 TF/s 1 Tflop/s Intel Intel Intel ASCI Red N=1 ASCI Red ASCI Red Sandia Sandia Sandia 100 Gflop/s Hitachi/Tsukuba 39.9 GF/s Fujitsu Fujitsu CP-PACS/2048 'NWT' NAL 'NWT' NAL Intel XP/S140 Sandia 10 Gflop/s IBM 604e N=500 69 proc Sun Sun Ultra Nabisco HPC 10000 HPC 1000 SGI Merril Lynch 1 Gflop/s News Cray POWER International Cray Y-MP CHALLANGE Y-MP C94/364 SNI VP200EX M94/4 GOODYEAR 'EPA' USA Uni Dresden KFA Jülich 100 Mflop/s Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 [6 0 G - 4 0 0 M][2 .4 Tf lop/ s 4 0 Gf lop/ s] , Sch wab # 1 9 , 1 /2 each year , 13 3 > 10 0 Gf , f ast er t han Moore’s law, 5
Performance Development 1 PFlop/s 1000000 ASCI 100000 Performance [GFlop/s] Earth Simulator 10000 Sum 1 TFlop/s 1000 N=1 100 10 1 My Laptop N=500 0.1 Nov-94 Nov-97 Nov-00 Nov-03 Nov-06 Nov-09 Jun-93 Jun-96 Jun-99 Jun-02 Jun-05 Jun-08 Ent ry 1 T 20 0 5 and 1 P 20 1 0 Architectures Cluster 500 Constellation SIMD 400 MPP 300 200 100 SMP Single Processor 0 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 91 const, 14 clus, 275 mpp, 120 smp 6
Chip Technology 500 Other 400 Inmos Transputer SUN 300 MIPS 200 HP IBM 100 intel Alpha 0 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 High-Performance Computing Directions: Beowulf-class PC Clusters Definition: Advantages: ? COTS PC Nodes ? Best price- perf ormance � Pentium, Alpha, PowerPC, SMP ? Low entry- level cost ? COTS LAN/ SAN ? Just- in- place I nterconnect conf iguration � Ethernet, Myrinet, ? Vendor invulnerable Giganet, ATM ? Scalable ? Open Source Unix ? Rapid technology � Linux, BSD tracking ? Message Passing Computing Enabled by PC hardware, networks and operating system � MPI , PVM achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message � HPF 1 4 passing libraries. 7
Where Does the Performance Go? or Why Should I Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” Performance (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 9%/yr. DRAM 1 (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 15 Optimizing Computation and Memory Use ? Computational optimizations � Theoretical peak: (# f pus)*(f lops/ cycle) * Mhz » PI I I : (1 f pu)*(1 f lop/ cycle)*(650 Mhz) = 650 MFLOP/ s » Athlon: (2 f pu)*(1f lop/ cycle)*(600 Mhz) = 1200 MFLOP/ s » Power3: (2 f pu)*(2 f lops/ cycle)*(375 Mhz) = 1500 MFLOP/ s ? Memory optimization � Theoretical peak: (bus width) * (bus speed) » PI I I : (32 bits)*(133 Mhz) = 532 MB/ s = 66. 5 MW/ s » Athlon: (64 bits)*(200 Mhz) = 1600 MB/ s = 200 MW/ s » Power3: (128 bits)*(100 Mhz) = 1600 MB/ s = 200 MW/ s ? Memory about an order of magnit ude slower 1 6 8
Memory Hierarchy ? By taking advantage of the principle of locality: � Present the user with as much memory as is available in the cheapest technology. � Provide access at the speed of f ered by the f astest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main On-Chip Registers Remote 2 and 3 Memory Distributed Cache Datapath Cluster Cache (DRAM) Memory Memory (SRAM) 10,000,000s Speed (ns): 1s 10s 100s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s 100,000 s Ks Ms 10,000,000 s (.1s ms) (10s ms) Gs Ts How To Get Performance From Commodity Processors? ? Today’s processors can achieve high- perf ormance, but this requires extensive machine- specif ic hand tuning. ? Hardware and sof tware have a large design space w/ many parameters � Blocking sizes, loop nesting permutations, loop unrolling depths, sof tware pipelining strategies, register allocations, and instruction schedules. � Complicated interactions with the increasingly sophisticated micro- architectures of new microprocessors. ? Until recently, no tuned BLAS f or Pentium f or Linux. ? Need f or quick/ dynamic deployment of optimized routines. ? ATLAS - Automatic Tuned Linear Algebra Sof tware � PhiPac f rom Berkeley � FFTW f rom MI T (http:/ / www. f f tw. org) 18 9
ATLAS ? An adaptive sof tware architecture � High- perf ormance � Portability � Elegance ? ATLAS is f aster than all other portable BLAS implementations and it is comparable with machine- specif ic libraries provided by the vendor. 1 9 ATLAS Across Various Architectures (DGEMM n=500) 900.0 Vendor BLAS 800.0 ATLAS BLAS F77 BLAS 700.0 600.0 MFLOPS 500.0 400.0 300.0 200.0 100.0 0.0 DEC ev6-500 AMD Athlon-600 DEC ev56-533 Pentium III-550 HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 SGI R10000ip28-200 SGI R12000ip30-270 Sun UltraSparc2-200 Architectures ? ATLAS is f aster than all other portable BLAS implementations and it is comparable with 20 machine- specif ic libraries provided by the vendor. 1 0
Recommend
More recommend