power aware performance of mixed precision
play

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs - PowerPoint PPT Presentation

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs Tennessee Advanced Computing Laboratory University of Tennessee July 14 th 2010 JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde This


  1. Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs Tennessee Advanced Computing Laboratory University of Tennessee July 14 th 2010 JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde This work was partially supported by the National Science Foundation, grant NSF CHE-0625598.

  2. Overview of the Presentation High Performance Computational Science Applications Mixed Precision Linear System Solvers Accelerators Power ( GPGPUs / FPGAs ) Precision Power-aware performance for mixed-precision solvers for GPGPUs and FPGAs according to system characteristics (matrix sizes and condition numbers)

  3. Impact of precision ALU Precision Lower Higher SPEED UP!! Smaller ALUs Larger ALUs Number of ALUs Clock Rate Shorter Wires Number of TRs in Fixed Area Shorter Pipeline

  4. Mixed precision solvers Mixed Precision Solver : 1. Employ multiple precisions. 2. Lower precision (faster) for computationally intensive tasks Higher precision (slower) for refinement. Goals:  High performance (Lower precision computation)  Numeric accuracy (Higher precision refinement) Mixed Precision Solver Digital Iterative Computational ( better numeric Refinement Computers Science results and high Error Prone (better numeric Applications performance ) results) Solution x Static, Finite J.Langou et al. James Wilkinson Precision (2006) Ax = b (1948) Computation J. Sun et al. (2008)

  5. Solving a Linear System Equation To solve Ax = b; 1. 2/3 n 3 Ops U A = x b L LU Decomposition n 2 Ops n 2 Ops 2. 3. U = x y = y b L

  6. Mixed precision algorithm To solve Ax = b; Computationally Approximation Expensive O(n 3 ) Step 1: LUPP (A); O(n 3 ) <= precision P I ; Employ lower precision for faster Solve LUx(1) = P × b; O(n 2 ) <= precision P I ; computation Refinement Computationally O(n 2 ) Less Expensive for ( i = 1 to x(i) accurate enough) Step 2: r(i) = b – A h x(i); O(n 2 ) <= precision P H ; Step 3: LUz(i) = P × r(i); O(n 2 ) <= precision P I ; Step 4: x(i) = x(i) + z(i); O(n) <= precision P H ; Employ higher end precision for accuracy P is a permutation matrix and r is a residual vector.

  7. How Mixed Precision Algorithm Works ? Successful convergence  Condition number Ax = b of the matrix Exact Ax at iteration 2 r = b – Ax at iteration 1 Ax at iteration 1 A × z = r x at iteration 2 Solution x Exact x at iteration 1 z = A -1 × r at iteration 1

  8. Mixed precision linear solvers for GPGPUs and FPGAs GPGPUs FPGAs Single precision for LUPP Arbitrary precision for LUPP Double precision for LUPP Arbitrary precision for LUPP Double precision refinement Arbitrary precision refinement Converge ? Converge ? DONE DONE

  9. Benefits for Mixed precision linear solvers for FPGAs 1. FPGA can employ arbitrary precision computation (Selecting a precision based on a condition number) 2. Lower precision  Smaller, Faster ALUs  More ALUs (Quadratic) (Significant performance difference for multiplication between lower precision and higher precision in FPGAs <=> Table I) Table I. Number of DSP48Es for a multiplier on Xilinx XC5VLX330T # of required DSP48Es per Exponent Mantissa Multiplier 8 16 1 8 (Single) 17-23 2 (5x speed up) 11 24-33 4 11 34-40 6 11 41-50 9 11 51 12 11 (Double) 52 10 (1x)

  10. Power-Aware Performance Apply dynamic power consumption: the incremental performance benefit of one additional watt. Total Power (U) = Static (S) + Dynamic (D) = S + C × Volt 2 × freq = S + α × freq, α MAX = (U MAX – S)/freq = D MAX /freq Three Kind Performance Metric: F := # of Flops, (Time-based Performance) F CLK := # of Flops / clock-cycle, (Clock-based Performance) F WATT-D := # of Flops / Watt (Power-based Performance) Relation between Clock-based Performance and Power-based Performance: MAX(F WATT-D ) = F / D MAX = F/( α MAX × freq) = F CLK / α MAX F CLK = F/freq, Design the logic to obtain maximum Flops/Cycle to save power !!

  11. Methodology Performance estimation for GPGPUs : MAGMA v0.2 with Tesla C1060 and Intel Xeon 2.93GHz Multi Core. Performance estimation for FPGAs : By Performance Modeling (Xilinx XC5VLX 330T) with the previous work. Precision Choice for FPGA Performance estimation: Mantissa bit width (M) = (log 2 (condition number)) – 1 Exponent bit width (E) = 8 (if M ≤ 23) / 11 (if 24 ≤ M ≤ 52) FPGA Performance : 2(Flops) × Number of PEs × Clock Rate Table I. Number of PEs on Xilinx XC5VLX330T Condition 1 – 2 17 2 24 – 2 34 2 34 – 2 41 2 41 – 2 51 2 51 – 2 52 2 52 – 2 53 2 17 -2 24 Number 1 – 16 Mantissa bits 17-23 24-33 34-40 41-50 51 52 # of PEs 192 96 48 32 21 16 19

  12. Tesla C1060 – MAGMA v0.2 {F WATT-D , F CLK , F } Mixed precision solver performance on hybrid system (Tesla C1060 + Intel Xeon 2.93GHz) Performance (GFlops/Watt, Flops/cc, GFlops/sec) {1.9, 192, 250} {1.5, 154, 200} {1.2, 115, 150} {0.8, 77, 100} {0.4, 38, 50} {0, 0, 0} 14 12 0 10 10 20 8 30 40 6 50 Matrix size (log2 base) 4 60 Infinite norm condition number (log2 base)

  13. FPGA (XC5VLX330T) {F WATT-D , F CLK , F } Mixed precision solver performance on FPGAs (XC5VLX330T) Performance (GFlops/Watt, Flops/cc, GFlops/sec) {3.8, 417, 50} {3.1, 333, 40} {2.3, 250, 30} {1.5, 167, 20} {0.8, 83, 10} {0, 0, 0} 14 12 10 5 10 15 8 20 25 30 6 35 40 45 50 4 55 Matrix size (log2 base) Infinite norm condition number (log2 base)

  14. GFLOPs (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 250 200 150 GFlops 100 50 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

  15. FLOPs/Cycle (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 400 350 300 Flops/Clock-cycle 250 200 150 100 50 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

  16. GFLOPs/Watt (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 4 3.5 3 2.5 GFlops/W 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

  17. Discussions and Conclusions - FPGAs can employ arbitrary precisions while GPUs can employ either single or double precision. - In order to save power, it is important to design the logic to obtain good clock-based performance. - The FPGA shows power-based performance better than the GPGPU in the case-study for mixed precision linear solvers, since we can obtain higher clock-based performance due to flexibility of the design choices in the FPGA.

  18. Thank you, Any Questions ?

Recommend


More recommend