overview overview
play

Overview Overview Look at current state of high performance - PDF document

Workshop on Edge Computing Using New Commodity Architectures (EDGE) May 23 - 24, 2006 Chapel Hill, North Carolina The Impact of The Impact of Multicore Multicore on Math Software on Math Software and and Exploiting Single Precision


  1. Workshop on Edge Computing Using New Commodity Architectures (EDGE) May 23 - 24, 2006 Chapel Hill, North Carolina The Impact of The Impact of Multicore Multicore on Math Software on Math Software and and Exploiting Single Precision Computing to Exploiting Single Precision Computing to Obtain Double Precision Results Obtain Double Precision Results Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 5/27/2006 1 Overview Overview ♦ Look at current state of high performance computing � Top500 data for Past and present ♦ Some of the changes Multicore brings � Look at the impact on numerical libraries ♦ Potential gains by exploiting lower precision devices � GPUs, Cell, SSE2, AltaVec 33 2 1

  2. H. Meuer, H. Simon, E. Strohmaier, & JD H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org 33 3 Current HPC Architecture/Systems Current HPC Architecture/Systems Tightly 100% Coupled Best processor performance for Custom processor ♦ ♦ codes that are not “cache with custom interconnect Custom friendly” Cray X1 � 80% Good communication performance ♦ NEC SX-8 � Simpler programming model ♦ IBM Regatta � Most expensive IBM Blue Gene/L ♦ � 60% Commodity processor ♦ Hybrid with custom interconnect Good communication performance ♦ Good scalability SGI Altix ♦ � 40% � Intel Itanium 2 Cray XT3, XD1 � � AMD Opteron Commodity processor 20% ♦ Best price/performance (for with commodity interconnect ♦ codes that work well with caches Commod Clusters � and are latency tolerant) � Pentium, Itanium, 0% More complex programming model ♦ J u n -9 3 D e c -9 3 J u n -9 4 D e c -9 4 J u n -9 5 D e c -9 5 J u n -9 6 D e c -9 6 J u n -9 7 D e c -9 7 J u n -9 8 D e c -9 8 J u n -9 9 D e c -9 9 J u n -0 0 D e c -0 0 J u n -0 1 D e c -0 1 J u n -0 2 D e c -0 2 J u n -0 3 D e c -0 3 J u n -0 4 Opteron, Alpha � GigE, Infiniband, Myrinet, Quadrics Loosely NEC TX7 � IBM eServer Coupled � Dawning 33 � 4 2

  3. Processor Type Used in Processor Type Used in the Top500 Systems the Top500 Systems 91% = 66% Intel Hitachi SR8000 15% IBM 0% 11% AMD Sun Sparc Intel IA-32 1% 41% NEC 1% HP Alpha 1% Cray 2% HP PA-RISC 3% Intel EM64T 16% Intel IA-64 9% AMD x86_64 IBM Power 11% 33 5 15% Processor Types (Top500) Processor Types (Top500) 500 S IMD S parc 400 Vector 300 MIPS Alpha 200 HP AMD 100 IBM Power Intel 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Intel + IBM Power PC + AMD = 91% 33 6 3

  4. Interconnects / Systems (Top500) Interconnects / Systems (Top500) 500 Others 400 Cray Interconnect SP Switch 300 Crossbar Quadrics 200 Infiniband (101) Myrinet 100 (249) Gigabit Ethernet N/ A 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 GigE + Myrinet = 70% 33 7 Performance Development (Top500) Performance Development (Top500) 2.3 PF/s 1 Pflop/ s 280.6 TF/s 100 Tflop/ s SUM IBM BlueGene/ L NEC 10 Tflop/ s 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1.646 TF/s 1 Tflop/ s LLNL 59.7 GF/s Intel ASCI Red Sandia 100 Gflop/ s Fuj itsu 'NWT' NAL 10 Gflop/ s N=500 My Laptop 0.4 GF/s 1 Gflop/ s 100 Mflop/ s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 33 8 4

  5. Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Increase Lower Lower Clock Rate Clock Rate Voltage Voltage & Transistor & Transistor Density Density We have seen increasing number of gates on a Cache Cache chip and increasing clock speed. Core Core Core Heat becoming an unmanageable problem, Intel Processors > 100 Watts C1 C2 C1 C2 We will not see the dramatic increases in clock C1 C2 C1 C2 speeds in the future. C3 C4 C3 C4 However, the number of gates on a chip will Cache Cache continue to increase. C1 C2 C1 C2 C3 C4 C3 C4 33 C3 C4 C3 C4 9 CPU Desktop Trends – CPU Desktop Trends – Change is Coming Change is Coming ♦ Relative processing power will continue to double every 18 months ♦ 256 logical processors per chip in late 2010 300 250 200 150 100 50 0 Hardware Threads Per Chip 2004 2005 2006 Cores Per Processor Chip 2007 2008 2009 2010 Year 33 10 5

  6. Commodity Processor Trends Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Bandwidth/Latency is the Critical Issue, not FLOPS Got Bandwidth? Annual Typical value Typical value Typical value increase in 2006 in 2010 in 2020 Single-chip floating-point 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s performance Front-side bus 1 GWord/s 3.5 GWord/s 27 GWord/s 23% bandwidth = 0.25 word/flop = 0.11 word/flop = 0.008 word/flop 70 ns 50 ns 28 ns DRAM latency (5.5%) = 280 FP ops = 1600 FP ops = 94,000 FP ops = 70 loads = 170 loads = 780 loads Source: Getting Up to Speed: The Future of Supercomputing , National Research Council, 222 33 11 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6. That Was the Good News That Was the Good News ♦ Bad news: the effect of the hardware change on the existing software base ♦ Must rethink the design of our software � Another disruptive technology � Rethink and rewrite the applications, algorithms, and software 33 12 6

  7. Parallelism in LAPACK / ScaLAPACK Shared Memory Distributed Memory ScaLAPACK LAPACK Parallel Specialized Specialized Specialized PBLAS PBLAS PBLAS ATLAS ATLAS ATLAS BLAS BLAS BLAS BLACS BLACS BLACS threads threads threads MPI MPI MPI 33 Right-Looking LU factorization (LAPACK) DLSWP DLSWP DGETF2 DTRSM DGEMM DGETF2 – Unblocked LU DLSWP – row swaps DTRSM – triangular solve with many right-hand sides DGEMM – matrix-matrix multiply 33 14 7

  8. Steps in the LAPACK LU Steps in the LAPACK LU DGETF2 LAPACK DLSWP LAPACK DLSWP LAPACK DTRSM BLAS DGEMM BLAS 33 15 LU Timing Profile (4 processor system) LU Timing Profile (4 processor system) LAPACK + BLAS threads Time for each component DGETF2 DLASWP(L) 1D decomposition and SGI Origin DLASWP(R) DTRSM DGEMM 33 8

  9. LU Timing Profile (4 processor system) LU Timing Profile (4 processor system) LAPACK + BLAS threads Time for each component Threads – no lookahead DGETF2 In this case the performance difference comes from DLASWP(L) DLASWP(R) parallelizing row exchanges (DLASWP) and threads in the LU DTRSM algorithm. DGEMM 1D decomposition and SGI Origin 33 Right-Looking LU Factorization Right-Looking LU factorization 33 18 9

  10. Right-Looking LU with a Lookahead 33 Pivot Rearrangement and Lookahead Pivot Rearrangement and Lookahead 4 Processor runs 4 Processor runs Lookahead = 0 1 2 3 ∞ 33 20 10

  11. Fixed vs vs Adaptive Lookahead Adaptive Lookahead Fixed ♦ No look-ahead or shallow look-ahead: � Not enough work in the update to the trailing matrix Pipeline stalls "bubbles" at the end of factorization. ♦ Deep or unlimited lookahead: � Attempt to factorization the next panel before the necessary piece of the trailing matrix is available, � Pipeline stalls "bubbles" at the beginning of the factorization. ♦ Solution - adaptive look-ahead: � Basically implement left-looking version of the algorithm, � Pursue the panels as fast a possible, � But continue updating the trailing matrix until sure that calling next panel does not stall. Pivot Rearrangement and Adaptive Pivot Rearrangement and Adaptive Look- Look -ahead ahead (16 SMP runs) (16 SMP runs) 33 22 11

  12. GPU Performance GPU Performance GPU Vendor NVIDIA NVIDIA ATI Model 6800Ultra 7800GTX X1900XTX Release 2004 2005 2006 Year 32-bit 60 GFLOPS 200 GFLOPS 400 GFLOPS Performance 64-bit must be emulated in software Performance 33 23 Thanks: Jeremy Meredith, ORNL Things to Watch: Things to Watch: PlayStation 3 PlayStation 3 The PlayStation 3's CPU based on a chip codenamed "Cell" ♦ Each Cell contains 8 APUs. ♦ An APU is a self contained vector processor which acts independently from the � others. 4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) � 256 Gflop/s peak! 32 bit floating point; 64 bit floating point at 25 Gflop/s. � IEEE format, but only rounds toward zero in 32 bit, overflow set to largest � � According to IBM, the SPE’s double precision unit is fully IEEE854 compliant. Datapaths “lite” � 33 24 12

Recommend


More recommend