wanted floating point add round off error instruction
play

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan - PowerPoint PPT Presentation

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan Richard Vuduc Jason Riedy School of Computational Science and Engineering College of Computing Georgia Institute of Technology June 23, 2016 M. Dukhan et al (Georgia Tech)


  1. Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan Richard Vuduc Jason Riedy School of Computational Science and Engineering College of Computing Georgia Institute of Technology June 23, 2016 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 1 / 20

  2. Outline Introduction 1 Error-Free Transformations 2 Performance Evaluation 3 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 2 / 20

  3. High-Precision Arithmetic in High Demand Numerical reproducibility ◮ Dynamic work distribution across threads ◮ Variations in SIMD- and instruction-level parallelism Mathematical functions ◮ IEEE754-2008 recommends correct rounding for LibM functions Growing number of scientific applications ◮ David Bailey’s presetations: 8 areas in (2005): 8 areas of science ◮ His recent presentation on SC BoF (2014): 12 areas of science M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 3 / 20

  4. High-Precision Arithmetic Algorithms Quadruple precision ◮ Software implementation using integer arithmetic Double-double arithmetic ◮ Represent a number as an unevaluated sum of two doubles: x = x hi + x lo Compensated algorithms ◮ High-precision summation, dot product, polynomial evaluation double−double Format quad double−double double with FPADDRE 75 Addition latency, cycles 50 25 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 4 / 20

  5. Outline Introduction 1 Error-Free Transformations 2 Performance Evaluation 3 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 5 / 20

  6. Error-Free Multiplication p + e = a · b where p = double ( a · b ) Error-Free Multiplication with FMA p := FPMUL a * b e := FMA a * b - p M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 6 / 20

  7. Error-Free Addition s + t = a + b where s = double ( a + b ) Error-Free Addition (Knuth, 1997) s := FPADD a + b b virtual := FPADD s - b a virtual := FPADD s - b virtual b roundoff := FPADD b - b virtual a roundoff := FPADD a - a virtual e := FPADD a roundoff + b roundoff M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 7 / 20

  8. FPADD3 Instruction Ogita et al (2005) suggested FPADD3 instruction to accelerate Error-Free Addition. FPADD3 adds 3 floating-point numbers without intermediate rounding No general-purpose CPU or GPU ever implemented this instruction Error-Free Addition with FPADD3 (Ogita et al, 2005) s := FPADD a + b e := FPADD3 a + b - s M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 8 / 20

  9. FPADDRE Instruction We suggest an instruction, Floating-Point Add Round-off Error (FPADDRE) to compute the roundoff error of floating-point addition. The instruction offers two benefits for error-free addition: Replace 5 FPADD instructions with 1 FPADDRE Break dependency chain between the sum and the roundoff error Error-Free Addition with FPADDRE s := FPADD a + b e := FPADDRE a + b M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 9 / 20

  10. Reusing FPADD logic in FPADDRE + 12 0b1101011011 + 7 0b1111111101 11101011011_____ + _____11111101101 1111011101011101 ___________1____ 11110111011_____ ___________01101 Schema of FPADD and FPADDRE operations (the case of operands with the same sign and overlapping mantissas). The operations differ only in two aspects: addition or subtraction of a sticky bit and the bits copied to the resulting mantissa. M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 10 / 20

  11. Outline Introduction 1 Error-Free Transformations 2 Performance Evaluation 3 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 11 / 20

  12. Simulation To estimate performance effect of the FPADDRE instruction, we implemented a several of high-precision algorithms: Double-double scalar addition and multiplication Double-double matrix multiplication Compensated dot product Polynomial evaluation via compensated Horner scheme Then we replaced FPADDRE with an instruction with performance characteristics of addition and benchmarked the algorithms on four microarchitectures: Intel Haswell Intel Skylake AMD Steamroller Intel Knights Corner co-processor M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 12 / 20

  13. Double-double Latency Operation Addition Multiplication 60 55% 53% Latency reduction, % 45% 40 36% 20 11% 3% 1% 0% 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 13 / 20

  14. Double-double Throughput Operation Addition Multiplication Throughput improvement, % 103% 90 60 36% 34% 30 18% 16% 14% 11% 0% 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 14 / 20

  15. Double-double Matrix Multiplication Intel 93% Skylake Intel 90% Haswell AMD 84% Steamroller Intel Xeon Phi 28% (Knights Corner) 0 25 50 75 100 Speedup, % Double-double matrix multiplication acceleration with FPADDRE instruction M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 15 / 20

  16. Compensated Dot Product Intel Skylake microarchitecture AMD Steamroller microarchitecture 3 6 Cycles per element Cycles per element 2 4 1 2 1K 4K 16K 64K 256K 1M 4M 16M 1K 4K 16K 64K 256K 1M 4M 16M Array size, elements Array size, elements compensated compensated compensated compensated Algorithm dot product dot product Algorithm dot product dot product dot product dot product with FPADDRE with FPADDRE Intel Haswell microarchitecture Intel Knights Corner microarchitecture 3 4 Cycles per element Cycles per element 2 3 2 1 1 1K 4K 16K 64K 256K 1M 4M 16M 1K 4K 16K 64K 256K 1M 4M 16M Array size, elements Array size, elements M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction compensated PMMA’16 compensated 16 / 20 compensated compensated Algorithm dot product dot product Algorithm dot product dot product dot product dot product

  17. Compensated Polynomial Evaluation compensated Horner scheme Algorithm Horner scheme compensated Horner scheme with FPADDRE 323 300 229 Latency, cycles 191 200 157 157 153 136 130 100 75 69 64 57 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 17 / 20

  18. Public release We open-sourced the software which was deloped as a part of this research The implementation, unit tests, and benchmarks, are available at github.com/Maratyszcza/FPplus The paper preprint is on arxiv.org/abs/1603.00491 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 18 / 20

  19. Summary We suggest a new instruction, Floating-Point Add Round-off Error, to compute the roundoff error of floating-point addition Performance simulations suggest that the proposed instruction could accelerate high-precision computations by up to 2x M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 19 / 20

  20. Funding This research was supported in part by The National Science The U.S. Dept. of Defense Advanced Foundation (NSF) Energy (DOE), Office Research Projects under NSF CAREER of Science, Advanced Agency (DARPA) award number Scientific Computing under agreement 1339745. Research under award #HR0011-13-2-0001 DE-FC02- 10ER26006/DE- SC0004915. Declaimer Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reflect those of NSF, DOE, or DARPA.

Recommend


More recommend