am p a r
play

AM P A R CudA Multiple Precision ARithmetic librarY Target - PowerPoint PPT Presentation

A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller, Ping Tak Peter Tang ARITH 23 July 2016 AM P A R CudA Multiple Precision ARithmetic librarY Target applications


  1. A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller, Ping Tak Peter Tang ARITH 23 July 2016 AM P A R CudA Multiple Precision ARithmetic librarY

  2. Target applications Need massive parallel computations 1 → high performance computing using graphics processors – GPUs; Need more precision than standard available (up to few hundred bits) 2 → extend precision using floating-point expansions. 1 / 1

  3. Target applications Need massive parallel computations 1 → high performance computing using graphics processors – GPUs; Need more precision than standard available (up to few hundred bits) 2 → extend precision using floating-point expansions. Chaotic dynamical systems: 0.4 bifurcation analysis; 0.2 x2 0 compute periodic orbits (e.g., finding sinks in the -0.2 H´ enon map, iterating the Lorenz attractor); -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. 1 / 1

  4. Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. 2 / 1

  5. Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY – CudaA Multiple Precision ARithmetic librarY – 2 / 1

  6. CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation – floating-point expansions – 3 / 1

  7. CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation – floating-point expansions – Pros: – use directly available and highly optimized native FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; – su ffi ciently simple and regular algorithms for addition. 3 / 1

  8. CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation – floating-point expansions – Pros: – use directly available and highly optimized native FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; – su ffi ciently simple and regular algorithms for addition. Cons: – more than one representation; – existing multiplication algorithms do not generalize well for an arbitrary number of terms; – di ffi cult rigorous error analysis → lack of thorough error bounds. 3 / 1

  9. Non-overlapping expansions R = 1 . 11010011 e − 1 can be represented, using a p = 5 (in radix 2 ) system, as: Less compact R = x 0 + x 1 + x 2 : R = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : 8 x 0 = 1 . 1000 e − 1 ; 8 < y 0 = 1 . 0000 e − 1 ; x 1 = 1 . 0010 e − 3 ; > > y 1 = 1 . 0000 e − 2 ; > x 2 = 1 . 0110 e − 6 . > : > > y 2 = 1 . 0000 e − 3 ; < Most compact R = z 0 + z 1 : ⇢ z 0 = 1 . 1101 e − 1 ; y 3 = 1 . 0000 e − 5 ; > > y 4 = 1 . 0000 e − 8 ; > > z 1 = 1 . 1000 e − 8 . > > y 5 = 1 . 0000 e − 9 . : 4 / 1

  10. Non-overlapping expansions R = 1 . 11010011 e − 1 can be represented, using a p = 5 (in radix 2 ) system, as: Less compact R = x 0 + x 1 + x 2 : R = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : 8 x 0 = 1 . 1000 e − 1 ; 8 < y 0 = 1 . 0000 e − 1 ; x 1 = 1 . 0010 e − 3 ; > > y 1 = 1 . 0000 e − 2 ; > x 2 = 1 . 0110 e − 6 . > : > > y 2 = 1 . 0000 e − 3 ; < Most compact R = z 0 + z 1 : ⇢ z 0 = 1 . 1101 e − 1 ; y 3 = 1 . 0000 e − 5 ; > > y 4 = 1 . 0000 e − 8 ; > > z 1 = 1 . 1000 e − 8 . > > y 5 = 1 . 0000 e − 9 . : Solution: the FP expansions are required to be non-overlapping . Definition: ulp -nonoverlapping . For an expansion u 0 , u 1 , . . . , u n − 1 if for all 0 < i < n , we have | u i | ≤ ulp ( u i − 1 ) . Example: p = 5 (in radix 2 ) x 0 = 1 . 1010 e − 2 ; 8 > x 1 = 1 . 1101 e − 7 ; < x 2 = 1 . 0000 e − 11 ; > : x 3 = 1 . 1000 e − 17 . 4 / 1

  11. Non-overlapping expansions R = 1 . 11010011 e − 1 can be represented, using a p = 5 (in radix 2 ) system, as: Less compact R = x 0 + x 1 + x 2 : R = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : 8 x 0 = 1 . 1000 e − 1 ; 8 < y 0 = 1 . 0000 e − 1 ; x 1 = 1 . 0010 e − 3 ; > > y 1 = 1 . 0000 e − 2 ; > x 2 = 1 . 0110 e − 6 . > : > > y 2 = 1 . 0000 e − 3 ; < Most compact R = z 0 + z 1 : ⇢ z 0 = 1 . 1101 e − 1 ; y 3 = 1 . 0000 e − 5 ; > > y 4 = 1 . 0000 e − 8 ; > > z 1 = 1 . 1000 e − 8 . > > y 5 = 1 . 0000 e − 9 . : Solution: the FP expansions are required to be non-overlapping . Definition: ulp -nonoverlapping . For an expansion u 0 , u 1 , . . . , u n − 1 if for all 0 < i < n , we have | u i | ≤ ulp ( u i − 1 ) . Example: p = 5 (in radix 2 ) x 0 = 1 . 1010 e − 2 ; 8 > x 1 = 1 . 1101 e − 7 ; < x 2 = 1 . 0000 e − 11 ; > : x 3 = 1 . 1000 e − 17 . Restriction: n ≤ 12 for single -precision and n ≤ 39 for double -precision. 4 / 1

  12. Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 ( Fast2Sum ( a, b ) ) Requirement: s ← RN ( a + b ) e a ≥ e b ; z ← RN ( s − a ) e ← RN ( b − z ) → Uses 3 FP operations. return ( s, e ) Requirement: Algorithm 2 ( 2MultFMA ( a, b ) ) e a + e b ≥ e min + p − 1; p ← RN ( a · b ) e ← fma ( a, b, − p ) → Uses 2 FP operations. return ( p, e ) 5 / 1

  13. Existing multiplication algorithms Priest’s multiplication [Pri91]: 1 – very complex and costly; – based on scalar products; – uses re-normalization after each step; – computes the entire result and “truncates” a-posteriori; – comes with an error bound and correctness proof. 6 / 1

  14. Existing multiplication algorithms Priest’s multiplication [Pri91]: 1 – very complex and costly; – based on scalar products; – uses re-normalization after each step; – computes the entire result and “truncates” a-posteriori; – comes with an error bound and correctness proof. quad-double multiplication in QD library [Hida et.al. 07]: 2 – does not straightforwardly generalize; – can lead to O ( n 3 ) complexity; – worst case error bound is pessimistic; – no correctness proof is published. 6 / 1

  15. New multiplication algorithms requires: ulp -nonoverlapping FP expansion x = ( x 0 , x 1 , . . . , x n − 1 ) and y = ( y 0 , y 1 , . . . , y m − 1 ) ; ensures: ulp -nonoverlapping FP expansion ⇡ = ( ⇡ 0 , ⇡ 1 , . . . , ⇡ r − 1 ) . Let me explain it with an example ... 7 / 1

  16. Example: n = 4 , m = 3 and r = 4 8 / 1

  17. Example: n = 4 , m = 3 and r = 4 paper-and-pencil intuition; on-the-fly “truncation”; 8 / 1

  18. Example: n = 4 , m = 3 and r = 4 paper-and-pencil intuition; on-the-fly “truncation”; term-times-expansion products, x i · y ; error correction term, ⇡ r . 8 / 1

  19. Example: n = 4 , m = 3 and r = 4 [ r · p b ] + 2 containers of size b (s.t. 3 b > 2 p ); b + c = p − 1 , s.t. we can add 2 c numbers without error ( binary64 → b = 45 , binary32 → b = 18 ); starting exponent e = e x 0 + e y 0 ; each bin’s LSB has a fixed weight; bins initialized with 1 . 5 · 2 e − ( i +1) b + p − 1 ; 9 / 1

  20. Example: n = 4 , m = 3 and r = 4 [ r · p b ] + 2 containers of size b (s.t. 3 b > 2 p ); b + c = p − 1 , s.t. we can add 2 c numbers without error ( binary64 → b = 45 , binary32 → b = 18 ); starting exponent e = e x 0 + e y 0 ; each bin’s LSB has a fixed weight; bins initialized with 1 . 5 · 2 e − ( i +1) b + p − 1 ; the number of leading bits, ` ; accumulation done using a Fast2Sum and addition [Rump09]. 9 / 1

  21. Example: n = 4 , m = 3 and r = 4 10 / 1

  22. Example: n = 4 , m = 3 and r = 4 subtract initial value; 10 / 1

  23. Example: n = 4 , m = 3 and r = 4 subtract initial value; apply renormalization step to B : – Fast2Sum and branches; – render the result ulp -nonoverlapping . 10 / 1

  24. Error bound Exact result computed as: n − 1; m − 1 m + n − 2 X X X xiyj = xiyj . i =0; j =0 k =0 i + j = k On-the-fly “truncation” → three error sources: 11 / 1

  25. Error bound Exact result computed as: n − 1; m − 1 m + n − 2 X X X xiyj = xiyj . i =0; j =0 k =0 i + j = k On-the-fly “truncation” → three error sources: discarded partial products: r +1 P – we add only the first k ; k =1 m + n − 2 x i y j ≤ | x 0 y 0 | 2 − ( p − 1)( r +1) h i − 2 − ( p − 1) (1 − 2 − ( p − 1) ) 2 + m + n − r − 2 P P – ; 1 − 2 − ( p − 1) k = r +1 i + j = k 11 / 1

Recommend


More recommend