an overview of high an overview of high performance
play

An Overview of High An Overview of High Performance Computing and - PowerPoint PPT Presentation

An Overview of High An Overview of High Performance Computing and Performance Computing and Challenges for the Future Challenges for the Future Challenges for the Future Challenges for the Future Jack Dongarra k INNOVATIVE COMP ING


  1. An Overview of High An Overview of High Performance Computing and Performance Computing and Challenges for the Future Challenges for the Future Challenges for the Future Challenges for the Future Jack Dongarra k INNOVATIVE COMP ING LABORATORY U i University of Tennessee i f T Oak Ridge National Laboratory University of Manchester 7/7/2008 1

  2. Overview Overview • Quick look at High Performance • Quick look at High Performance Computing � Top500 T 500 • Challenges for Math Software � Linear Algebra Software for Multicore and Beyond 2

  3. H. Meuer, H. Simon, E. Strohmaier, & JD H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful g p Computers in the World - Yardstick: Rmax from LINPACK MPP Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November SC‘ i h S i b Meeting in Germany in June - All data available from www.top500.org 3

  4. Performance Development 11.7 PF/s 10 Pflop/s IBM Roadrunner 1.02 PF/s 1 Pflop/ s IBM BlueGene/L 100 Tflop/ s SUM SUM NEC Earth Simulator NEC Earth Simulator 9 0 TF/ 9.0 TF/s 10 Tflop/ s 1.17 TF/s IBM ASCI White 1 Tflop/ s #1 6-8 years Intel ASCI Red 59.7 GF/s 100 Gflop/ s Fujitsu 'NWT' #500 10 Gflop/ s My Laptop 1 Gflop/ s 1 Gflop/ s 0.4 GF/s 100 Mflop/ s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 4

  5. Performance Development & Projections 10 Eflop/s 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s SUM 100 Gflop/s 10 Gflop/s N=1 1 Gflop/s 1 Gflop/s N=500 100 Mflop/s 10 Mflop/s 1 Mflop/s p

  6. Performance Development & Projections ~1000 years ~1 year ~8 hours ~1 min 10 Eflop/s 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s SUM 100 Gflop/s 10 Gflop/s N=1 1 Gflop/s 1 Gflop/s N=500 100 Mflop/s 10 Mflop/s 1 Mflop/s p Cray 2 ASCI Red Roadrunner 1 Gflop/s 1 Pflop/s 1 Eflop/s 1 Tflop/s O(1) Thread O(10 6 )Threads O(10 9 ) Threads O(10 3 ) Threads

  7. LANL Roadrunner A P t A Petascale System in 2008 l S t i 2008 ≈ 13,000 Cell HPC chips “Connected Unit” cluster • ≈ 1.33 PetaFlop/s (from Cell) 192 Opteron nodes ≈ 7,000 dual-core Opterons ≈ 7 000 dual core Opterons (180 w/ 2 dual Cell blades (180 w/ 2 dual-Cell blades connected w/ 4 PCIe x8 links) ≈ 122,000 cores 17 clusters 2 nd stage InfiniBand 4x DDR interconnect Cell chip for each core (18 sets of 12 links to 8 switches) 2 nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s (DP) Cell chip Hybrid Design (2 kinds of chips & 3 kinds of cores) Programming required at 3 levels. Dual Core Opteron Chip

  8. Top10 of the June 2008 List Power Rmax Rmax / MFlops/ Computer Installation Site Country #Cores Rpeak [MW] Watt [TF/s] IBM / Roadrunner 1 1,026 75% 75% DOE/NNSA/LANL USA 2.35 437 437 122,400 BladeCenter QS22/LS21 BladeCenter QS22/LS21 IBM / BlueGene/L 2 478 80% 80% DOE/NNSA/LLNL USA 212,992 2.33 205 205 eServer Blue Gene Solution IBM / Intrepid 3 450 81% 81% DOE/OS/ANL USA 163,840 1.26 357 357 Blue Gene/P Solution Blue Gene/P Solution SUN / Ranger 4 326 65% 65% NSF/TACC USA 2.00 163 163 62,976 SunBlade x6420 CRAY / Jaguar 5 205 79% 79% DOE/OS/ORNL USA 1.58 130 130 30,976 C Cray XT4 QuadCore XT4 Q dC IBM / JUGENE Forschungszentrum 6 180 81% 81% Germany 0.50 357 357 65,536 Juelich (FZJ) Blue Gene/P Solution SGI / Encanto New Mexico Computing 7 133.2 77% 77% USA 0.86 155 155 14,336 Applications Center Applications Center SGI Altix ICE 8200 SGI Altix ICE 8200 HP / EKA Computational Research 8 Cluster Platform 3000 BL460c 132.8 77% 77% India 1.60 83 83 14,384 Lab, TATA SONS IBM / Blue Gene/P 9 112 IDRIS France 0.32 81% 81% 40,960 357 357 Solution Total Exploration 10 SGI / Altix ICE 8200EX 106 86% 86% France 0.44 240 240 10,240 Production 8

  9. Top10 of the June 2008 List Power Rmax Rmax / MFlops/ Computer Installation Site Country #Cores Rpeak [MW] Watt [TF/s] IBM / Roadrunner 1 1,026 75% 75% DOE/NNSA/LANL USA 2.35 437 437 122,400 BladeCenter QS22/LS21 BladeCenter QS22/LS21 IBM / BlueGene/L 2 478 80% 80% DOE/NNSA/LLNL USA 212,992 2.33 205 205 eServer Blue Gene Solution IBM / Intrepid 3 450 81% 81% DOE/OS/ANL USA 163,840 1.26 357 357 Blue Gene/P Solution Blue Gene/P Solution SUN / Ranger 4 326 65% 65% NSF/TACC USA 2.00 163 163 62,976 SunBlade x6420 CRAY / Jaguar 5 205 79% 79% DOE/OS/ORNL USA 1.58 130 130 30,976 C Cray XT4 QuadCore XT4 Q dC IBM / JUGENE Forschungszentrum 6 180 81% 81% Germany 0.50 357 357 65,536 Juelich (FZJ) Blue Gene/P Solution SGI / Encanto New Mexico Computing 7 133.2 77% 77% USA 0.86 155 155 14,336 Applications Center Applications Center SGI Altix ICE 8200 SGI Altix ICE 8200 HP / EKA Computational Research 8 Cluster Platform 3000 BL460c 132.8 77% 77% India 0.79 169 169 14,384 Lab, TATA SONS IBM / Blue Gene/P 9 112 IDRIS France 0.32 81% 81% 40,960 357 357 Solution Total Exploration 10 SGI / Altix ICE 8200EX 106 86% 86% France 0.44 240 240 10,240 Production 9

  10. ORNL/UTK Computer Power Cost Projections 2007-2012 • Over the next 5 O th t 5 years ORNL/UTK will deploy 2 large p y g Petascale systems • Using 4 MW today, going to 15MW going to 15MW before year end • By 2012 could be y using more than 50MW!! • Cost estimates Cost estimates based on $0.07 per KwH Power becomes the architectural Cost Per Year driver for future large systems Includes both DOE and NSF systems.

  11. S S omething’ s Happening Here… omething’ s Happening Here… From K. Olukotun, L. Hammond, H. • In the “old Sutter, and B. Smith days” it was: A hardware issue just became a each year h software problem processors would become faster faster • Today the clock speed is fixed or getting slower getting slower • Things are still doubling every 18 24 months 18 -24 months • Moore’s Law reinterpretated. � � Number of cores Number of cores double every 18-24 months 07 11

  12. Multicore Multicore • What is multicore? � A multicore chip is a single chip (socket) that � A multicore chip is a single chip (socket) that combines two or more independent processing units that provide independent threads of units that provide independent threads of control • Why multicore? Why multicore? � The race for ever higher clock speeds is over. • In the old days, new the chips where faster In the old days, new the chips where faster • Applications ran faster on the new chips • Today new chips are not faster, just have more processors per chip • Applications and software must use those extra processors to 12 become faster

  13. Power Cost of Frequency Power Cost of Frequency • Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 P F 3 13

  14. Power Cost of Frequency Power Cost of Frequency • Power ∝ Voltage 2 x Frequency (V 2 F) • Frequency ∝ Voltage • Power ∝ Frequency 3 P F 3 14

  15. Today’ s Multicores Today’ s Multicores 282 use Quad-Core 204 use Dual-Core 98% 98% of Top500 S of Top500 S p p y ystems Are Based on ystems Are Based on Multicore y Multicore 3 use Nona-core IBM Cell (9 cores) Intel Clovertown (4 cores) Sun Niagra2 (8 cores) SciCortex (6 cores) Intel Polaris (80 cores) 15 AMD Opteron (4 cores) IBM BG/P (4 cores)

  16. And then there’s the GPU’s And then there’s the GPU’s NVIDIA’s NVIDIA’s Tesla T10P Tesla T10P • T10P chip � 240 cores; 1.5 GHz 240 cores; 1.5 GHz � Tpeak 1 Tflop/s - 32 bit floating point � Tpeak 100 Gflop/s - 64 bit floating point • S1070 board � 4 - T10P devices; � 700 Watts • C1060 card � 1 – T10P; 1.33 GHz � 160 Watts � Tpeak 887 Gflop/s - 32 bit floating point T k 887 Gfl / 32 bi fl i i � Tpeak 88.7 Gflop/s - 64 bit floating point 16

  17. What’ s Next? What’ s Next ? Multicore Multicore to to Manycore Manycore Mixed Large Mixed Large and and All Large Core All Large Core S S Small Core Small Core all Co e all Co e Many Small Cores Many Small Cores Many Small Cores Many Small Cores All Small Core All Small Core Different Classes of Chips H Home Games / Graphics Business S cientific Many Floating- + 3D Stacked Point Cores Memory The question is not whether this will happen but whether we are ready SRAM SRAM

  18. Coding for an Coding for an Abstract Abstract M Multicore ulticore Parallel software for multicores should have two characteristics: two characteristics: • Fine granularity: • High level of parallelism is needed High level of parallelism is needed • Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic operate on small portions of data in order to reduce bus traffic and improve data locality. • Asynchronicity: • A th As the degree of thread level parallelism grows and granularity d f th d l l ll li d l it of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm the efficiency of an algorithm.

Recommend


More recommend