on the future of high performance computing how to think
play

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s


  1. ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB

  2. Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s 100000000 16.3%PFlop/s % 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM % 100 Tflop/s 100000 N=1 % 10 Tflop/s 60.8%TFlop/s % 10000 6-8 years 1 Tflop/s 1000 1.17%TFlop/s % N=500 % 100 Gflop/s My Laptop (70 Gflop/s) 100 59.7%GFlop/s % 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400%MFlop/s % 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2012

  3. June 2012: The TOP10 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / NNSA Sequoia, BlueGene/Q (16c) 1 USA 1,572,864 16.3 81 8.6 1895 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 2 Japan 705,024 10.5 93 12.7 830 for Comp Sci VIIIfx (8c) + custom DOE / OS 3 Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 Argonne Nat Lab Leibniz 4 SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 Rechenzentrum Tianhe-1A, NUDT Nat. SuperComputer 5 Intel (6c) + Nvidia GPU (14c) China 186,368 2.57 55 4.04 636 Center in Tianjin + custom DOE / OS Jaguar, Cray 6 USA 298,592 1.94 74 5.14 377 Oak Ridge Nat Lab AMD (16c) + custom Fermi, BlueGene/Q (16c) 7 CINECA Italy 163,840 1.73 82 .821 2099 + custom Forschungszentrum JuQUEEN, BlueGene/Q (16c) 8 Germany 131,072 1.38 82 .657 2099 Juelich (FZJ) + custom Commissariat a Curie, Bull 9 l'Energie Atomique France 77,184 1.36 82 2.25 604 Intel (8c) + IB (CEA) Nat. Supercomputer Nebulea, Dawning Intel (6) 10 China 120,640 43 2.58 493 1.27 Center in Shenzhen + Nvidia GPU (14c) + IB 3 ������������������������������������������������������������������������������������������������������������������������������������ � � � � � � �

  4. Accelerators (58 systems) 60" Intel"MIC"(1)" 50" Clearspeed"CSX600"(0)" ATI"GPU"(2)" 40" IBM"PowerXCell"8i"(2)" Systems% 30" NVIDIA"2070"(10)" NVIDIA"2050(12)" 20" NVIDIA"2090"(31)" ������ ��������� 10" �������� ������������ �������� ��������� 0" ��������� ��������� ��������� ������������ 2006" 2007" 2008" 2009" 2010" 2011" 2012" ���������� �������� �������� ��������� �������� �����

  5. Countries Share Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20 Switzerland 5

  6. Swiss Machines in Top500 (max:12 min:1) Jan+93" Oct+95" Jul+98" Apr+01" Jan+04" Oct+06" Jul+09" Apr+12" 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" 4 5 7 9 12 9 8 9 6 6 5 6 6 5 8 8 6 2 1 1 3 3 2 3 3 4 4 5 5 7 6 4 4 5 5 4 4 3 1 High point: 12 systems (6/95) Low points: 1 system (6/02, 11/02, 6/12) 6

  7. 28 Systems at > Pflop/s (Peak) Pflop/s"Club" 45" 41# 40" Pflop/s 35" 30" 25" 20" 16.2# 15" 11.1# 6.9# 10" 2.92# 2.73# 2.1# 5" 1.7# 0" US""""""""""""""""""" "Japan""""""" China""""""""""""""" Germany""""""""""" France"""""""""""" UK""""""""""""""""" Italy"""""""""""" Russia""""""" (9)" (4)" (5)" (4)" (2)" (2)" (1)" (1)" (Peak) 10/2/12 7

  8. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  9. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  10. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  11. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s N=1% 1000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s N=500% 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 0.1

  12. The High Cost of Data Movement • ������������������������������������������� ������������������� � ���������������������������������������� � 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ ������������������������� • ������������������������������������� ������������������������������������������ ���������� 12

  13. Energy Cost Challenge � At ~$1M per MW energy costs are substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs ! DOE Target: 1 Eflop/s in 2018 at 20 MWs 13

  14. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2019 Difference Today & 2019 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

  15. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Today & 2022 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

  16. Critical Issues at Peta & Exascale for Algorithm and Software Design � Synchronization-reducing algorithms ! Break Fork-Join model � Communication-reducing algorithms ! Use methods which have lower bound on communication � Mixed precision methods ! 2x speed of ops and 2x speed for data movement � Autotuning ! Today’s machines are too complicated, build “smarts” into software to adapt to the hardware � Fault resilient algorithms ! Implement algorithms that can recover from failures/bit flips � Reproducibility of results ! Today we can’t guarantee this. We understand the issues, 16 but some of our “colleagues” have a hard time with this.

  17. Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software ! Manycore and Hybrid architectures are disruptive technology ! Similar to what happened with cluster computing and message passing ! Rethink and rewrite the applications, algorithms, and software ! Data movement is expensive ! Flops are cheap 17

  18. Dense Linear Algebra Software Evolution LINPACK (70's) " Level 1 BLAS vector operations LAPACK (80's) " Level 3 BLAS block operations " PBLAS ScaLAPACK (90's) block cyclic " BLACS data distribution (message passing) PLASMA (00's) " tile layout tile operations " dataflow scheduling

  19. PLASMA Principles " Tile Algorithms " minimize capacity misses LAPACK CPU cache MEM " Tile Matrix Layout " minimize conflict misses PLASMA CPU CPU CPU CPU cache cache cache cache MEM " Dynamic DAG Scheduling " minimizes idle time " More overlap " Asynchronous ops

  20. Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. - • This is the 2/3n 3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS Cores Time

Recommend


More recommend