the impact of multicore multicore on math software on
play

The Impact of Multicore Multicore on Math Software on Math Software - PowerPoint PPT Presentation

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and Exploiting Single Precision Computing to and Exploiting Single Precision Computing to Obtain Double Precision Results Obtain Double Precision Results Jack


  1. The Impact of Multicore Multicore on Math Software on Math Software The Impact of and Exploiting Single Precision Computing to and Exploiting Single Precision Computing to Obtain Double Precision Results Obtain Double Precision Results Jack Dongarra Innovative Computer Laboratory University of Tennessee Oak Ridge National Laboratory 12/5/2006 1

  2. Where in the World is Knoxville Tennessee? Where in the World is Knoxville Tennessee? Oak Ridge National Lab x x University of Tennessee, Knoxville 2

  3. Outline Outline ♦ Top500 � A quick look ♦ Multicore � Software changes ♦ IBM Cell processor � Early experiments 3

  4. Performance Development; Top500 Performance Development; Top500 3.54 PF/s 1 Pflop/ s 280.6 TF/s IBM BlueGene/L 100 Tflop/ s SUM NEC Earth Simulator 10 Tflop/ s N=1 1.167 TF/s IBM ASCI White 2.74 1 Tflop/ s TF/s 6-8 years 59.7 GF/s Intel ASCI Red 100 Gflop/ s Fujitsu 'NWT' 10 Gflop/ s N=500 My Laptop 0.4 GF/s 1 Gflop/ s 100 Mflop/ s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 4

  5. Predicted Performance Levels Predicted Performance Levels 100,000 6,267 4,648 10,000 3,447 Total TFlop/s Linpack #1 557 1,000 #10 405 294 #500 Total Pred. #1 100 59 44 33 Pred. #10 Pred. #500 10 5.46 2.86 3.95 1 Jun-03 Jun-04 Jun-05 Jun-06 Jun-07 Jun-08 Jun-09 Dec-03 Dec-04 Dec-05 Dec-06 Dec-07 Dec-08 5

  6. Processor count in Top500 systems Processor count in Top500 systems “Sweet Spot” For Parallel 64k-128k Computing 500 32k-64k 75% 16k-32k 8k-16k 400 4k-8k 2049-4096 1025-2048 300 513-1024 257-512 129-256 200 65-128 33-64 17-32 100 9-16 5-8 3-4 0 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 1 6

  7. Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Increase Lower Lower Clock Rate Clock Rate Voltage Voltage & Transistor & Transistor Density Density We have seen increasing number of gates on a Cache Cache chip and increasing clock speed. Core Core Core Heat becoming an unmanageable problem, Intel Processors > 100 Watts C1 C2 C1 C2 We will not see the dramatic increases in clock C1 C2 C1 C2 speeds in the future. C3 C4 C3 C4 However, the number of Cache Cache gates on a chip will C1 C2 C1 C2 continue to increase. C3 C4 C3 C4 C3 C4 C3 C4 7

  8. What is Multicore Multicore? ? What is ♦ Discrete chips ♦ Multiple, externally visible processors on a single die ♦ Processors have independent control-flow, separate � Bandwidth 2 GBps internal state and no critical � Latency 60 ns resource sharing. ♦ Multicore Its not just SMP on a chip ♦ � Cores on the wrong side of the pins ♦ Highly sensitive to temporal locality ♦ Memory wall is getting worse � Bandwidth > 20 GBps � Latency < 3ns 8

  9. 1.2 TB/s memory BW 9 http://www.pcper.com/article.php?aid=302

  10. CPU Desktop Trends 2004- -2011 2011 CPU Desktop Trends 2004 ♦ Relative processing power will continue to double every 18 months ♦ 5 years from now: 128 cores/chip w/512 logical processes per chip 600 500 400 300 200 100 0 Cores Per Processor 2004 2005 2006 2007 2008 Chip 2009 2010 2011 10 Cores Per Processor Chip Hardware Threads Per Chip

  11. Challenges Resulting From Multicore Multicore Challenges Resulting From ♦ Aggravated memory wall � Memory bandwidth � to get data out of memory banks � to get data into multi-core processors � Memory latency � Fragments L3 cache ♦ Pins become strangle point � Rate of pin growth projected to slow and flatten � Rate of bandwidth per pin projected to grow slowly ♦ Relies on effective exploitation of multiple-thread parallelism � Need for parallel computing model and parallel programming model ♦ Requires mechanisms for efficient inter-processor coordination � Synchronization � Mutual exclusion 11 � Context switching

  12. What will the chip will look like? What will the chip will look like? Shared Local Cache Cache Core . . . Cache Cache Cache Core Core Cache Core Processor Core Core Core 12

  13. What will the chip will look like What will the chip will look like Shared Local Cache Cache Core . . . Cache Cache Cache Core Core Cache Core Processor Core Core Core 13

  14. What will the chip will look like What will the chip will look like Shared Local Cache Cache Core . . . Cache Cache Cache Core Core Cache Core Processor Core Core Core 14

  15. Major Changes to Software Major Changes to Software ♦ Must rethink the design of our software � Another disruptive technology � Similar to what happened with cluster computing and message passing � Rethink and rewrite the applications, algorithms, and software ♦ Numerical libraries for example will change � For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 15

  16. Parallelism in LAPACK / ScaLAPACK Distributed Memory Shared Memory ScaLAPACK LAPACK Parallel Specialized Specialized Specialized PBLAS PBLAS PBLAS ATLAS ATLAS ATLAS BLAS BLAS BLAS BLACS BLACS BLACS threads threads threads MPI MPI MPI Two well known open source software efforts for dense matrix problems.

  17. Steps in the LAPACK LU Steps in the LAPACK LU DGETF2 LAPACK (Factor a panel) DLSWP LAPACK (Backward swap) DLSWP LAPACK (Forward swap) DTRSM BLAS (Triangular solve) DGEMM BLAS (Matrix multiply) 17

  18. LU Timing Profile (4 processor system) LU Timing Profile (4 processor system) Threads – no lookahead Time for each component 1D decomposition and SGI Origin DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases Bulk Sync Phases DGEMM

  19. Adaptive Lookahead Lookahead - - Dynamic Dynamic Adaptive Reorganizing algorithms to use Event Driven Multithreading Event Driven Multithreading 19 this approach

  20. LU – – Fixed Fixed Lookahead Lookahead – – 4 processors 4 processors LU Original LAPACK Code Data Flow Code Time 20

  21. LU - - BLAS Threads vs. Dynamic BLAS Threads vs. Dynamic LU Lookahead Lookahead SGI Origin 3000 / 16 MIPS R14000 500 Mhz BLAS Threads (LAPACK) Dynamic Lookahead Problem Size N = 4000 Time 21

  22. Taking a Look at the PlayStation 3 Taking a Look at the PlayStation 3 The PlayStation 3's CPU based on a "Cell“ processor ♦ Each Cell contains 8 APUs. ♦ An SPE is a self contained vector processor which acts independently from the others. � 4 floating point units capable of a total of 25.6 Gflop/s (6.4 Gflop/s each @ 3.2 GHz) � 204.8 Gflop/s peak! 32 bit floating point; 64 bit floating point at 15 Gflop/s. � IEEE format, but only rounds toward zero in 32 bit, overflow set to largest � � According to IBM, the SPE’s double precision unit is fully IEEE854 compliant. 22

  23. 32 or 64 bit Floating Point Precision? 32 or 64 bit Floating Point Precision? ♦ A long time ago 32 bit floating point was used � Still used in scientific apps but limited ♦ Most apps use 64 bit floating point � Accumulation of round off error � A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (10 18 ) ops. � Ill conditioned problems � IEEE SP exponent bits too few (8 bits, 10 ±38 ) � Critical sections need higher precision � Sometimes need extended precision (128 bit fl pt) � However some can get by with 32 bit fl pt in some parts ♦ Mixed precision a possibility � Approximate in lower precision and then refine or improve solution to high precision. 23

  24. On the Way to Understanding How to Use On the Way to Understanding How to Use the Cell Something Else Happened … … the Cell Something Else Happened ♦ Realized have the Processor and BLAS Speedup SGEMM DGEMM similar situation on Library SP/DP our commodity (GFlop/s) (GFlop/s) processors. Pentium III Katmai 0.98 0.46 2.13 � That is, SP is 2X (0.6GHz) Goto BLAS as fast as DP on Pentium III CopperMine many systems 1.59 0.79 2.01 (0.9GHz) Goto BLAS Pentium Xeon Northwood ♦ The Intel Pentium 7.68 3.88 1.98 and AMD Opteron (2.4GHz) Goto BLAS have SSE2 Pentium Xeon Prescott 10.54 5.15 2.05 � 2 flops/cycle DP (3.2GHz) Goto BLAS � 4 flops/cycle SP Pentium IV Prescott 11.09 5.61 1.98 (3.4GHz) Goto BLAS ♦ IBM PowerPC has AMD Opteron 240 4.89 2.48 1.97 AltaVec (1.4GHz) Goto BLAS � 8 flops/cycle SP PowerPC G5 18.28 9.98 1.83 � 4 flops/cycle DP (2.7GHz) AltaVec � No DP on AltaVec 24 Performance of single precision and double precision matrix multiply (SGEMM and DGEMM) with n=m=k=1000

  25. Idea Something Like This… … Idea Something Like This ♦ Exploit 32 bit floating point as much as possible. � Especially for the bulk of the computation ♦ Correct or update the solution with selective use of 64 bit floating point to provide a refined results ♦ Intuitively: � Compute a 32 bit result, � Calculate a correction to 32 bit result using selected higher precision and, � Perform the update of the 32 bit results with the correction using high precision. 25

  26. Mixed- -Precision Iterative Refinement Precision Iterative Refinement Mixed Solve: Ax = b [L U] = lu(A) O ( n 3 ) SINGLE x = L\(U\b) O ( n 2 ) SINGLE r = b – Ax O ( n 2 ) DOUBLE WHILE || r || not small enough z = L\(U\r) O ( n 2 ) SINGLE x = x + z O ( n 1 ) DOUBLE r = b – Ax O ( n 2 ) DOUBLE END 26

Recommend


More recommend