Accurate Complex Multiplication in Floating-Point Arithmetic Vincent Lefèvre Jean-Michel Muller. Université de Lyon, CNRS, Inria, France. Arith26, Kyoto, June 2019 1
Accurate complex multiplication in FP arithmetic ◮ ω · x , emphasis on the case where ℜ ( ω ) and ℑ ( ω ) are double-word numbers—i.e., pairs (high-order, low-order) of FP numbers; ◮ applications: Fourier transforms, iterated products. Assumptions: ◮ radix-2, precision- p , FP arithmetic; ◮ rounded to nearest (RN) FP operations; ◮ an FMA instruction is available; ◮ underflow/overflow do not occur. Bound on relative error of (real) operations: u | RN ( a + b ) − ( a + b ) | � 1 + u · | a + b | < u · | a + b | , where u (rounding unit) equals 2 − p . 2
Some variables: double-word (DW) numbers ◮ also called double-double in the literature; ◮ v ∈ R represented by a pair of FP numbers v h and v ℓ such that v = v h + v ℓ , | v ℓ | � 1 2 ulp ( v ) � u · | v | . ◮ algorithms and libraries for manipulating DW numbers: QD (Hida, Li & Bailey), Campary (Joldes, Popescu & others), ◮ use the 2Sum, Fast2Sum & Fast2Mult algorithms (see later). 3
Naive algorithms for complex FP multiplication ◮ straightforward transcription of the formula z = ( x R + ix I ) · ( y R + iy I ) = ( x R y R − x I y I ) + i · ( x I y R + x R y I ); ◮ bad solution if componentwise relative error is to be minimized; ◮ adequate solution if normwise relative error is at stake. ( ˆ z approximates z with normwise error | (ˆ z − z ) / z | ) Algorithms: ◮ if no FMA instruction is available � z R RN ( RN ( x R y R ) − RN ( x I y I )) , ˆ = (1) z I RN ( RN ( x R y I ) + RN ( x I y R )) . ˆ = ◮ if an FMA instruction is available � ˆ RN ( x R y R − RN ( x I y I )) , z R = (2) RN ( x R y I + RN ( x I y R )) . z I ˆ = 4
Naive algorithms for complex multiplication ◮ if no FMA instruction is available � z R RN ( RN ( x R y R ) − RN ( x I y I )) , ˆ = (1) z I RN ( RN ( x R y I ) + RN ( x I y R )) . ˆ = ◮ if an FMA instruction is available RN ( x R y R − RN ( x I y I )) , � z R ˆ = (2) RN ( x R y I + RN ( x I y R )) . z I ˆ = Asymptotically optimal bounds on the normwise relative error of (1) and (2) are known: √ • Brent et al (2007): bound 5 · u for (1), • Jeannerod et al. (2017): bound 2 · u for (2). 5
Accurate complex multiplication Our goal: • smaller normwise relative errors, • closer to the best possible one (i.e., u , unless we output DW numbers), • at the cost of more complex algorithms. We consider the product ω · x , with ω = ω R + i · ω I and x = x R + i · x I , where: ◮ ω R and ω I are DW numbers (special case FP considered later) ◮ x R and x I are FP numbers. 6
Basic building blocks: Error-Free Transforms Expressing a + b as a DW number Algorithm 1 : 2Sum ( a , b ) . Returns s and t such that s = RN ( a + b ) and t = a + b − s s ← RN ( a + b ) a ′ ← RN ( s − b ) b ′ ← RN ( s − a ′ ) δ a ← RN ( a − a ′ ) δ b ← RN ( b − b ′ ) t ← RN ( δ a + δ b ) Expressing a · b as a DW number Algorithm 2 : Fast2Mult ( a , b ) . Returns π and ρ such that π = RN ( ab ) and ρ = ab − π π ← RN ( ab ) ρ ← RN ( ab − π ) 7
The multiplication algorithm ◮ ω R = ℜ ( ω ) and ω I = ℑ ( ω ) : DW numbers, i.e., ω = ω R + i · ω I = ( ω R h + ω R ℓ ) + i · ( ω I h + ω I ℓ ) , where ω R h , ω R ℓ , ω I h , and ω I ℓ are FP numbers that satisfy: ℓ | � 1 • | ω R 2 ulp ( ω R ) � u · | ω R | ; ℓ | � 1 • | ω I 2 ulp ( ω I ) � u · | ω I | . ◮ Real part z R of the result (similar for imaginary part): h x R and ω I • difference v R h of the high-order parts of ω R h x I , • add approximated sum γ R ℓ of all the error terms that may have a significant influence on the normwise relative error. ◮ rather straightforward algorithms: the tricky part is the error bounds. 8
ℓ ) · x R − ( ω I Real part ( ω R h + ω R h + ω I ℓ ) · x I ω I x I ω R x R ω I x I ω R x R ℓ ℓ h h × FMA Fast2Mult Fast2Mult t R Q R ℓ P R Q R P R h ℓ h 2Sum π R ℓ v R v R ℓ h + + + + r R s R γ R ℓ ℓ ℓ z R 9
The multiplication algorithm Algorithm 3 : Computes ω · x , where the real & imaginary parts of ω = ( ω R h + ω R ℓ ) + i · ( ω I h + ω I ℓ ) are DW, and the real & im. parts of x are FP. 1: t R ← RN ( ω I ℓ x I ) ℓ x R − t R ) 2: π R ℓ ← RN ( ω R 3: ( P R h , P R ℓ ) ← Fast2Mult ( ω I h , x I ) 4: r R ℓ ← RN ( π R ℓ − P R ℓ ) 5: ( Q R h , Q R ℓ ) ← Fast2Mult ( ω R h , x R ) 6: s R ℓ ← RN ( Q R ℓ + r R ℓ ) 7: ( v R h , v R ℓ ) ← 2Sum ( Q R h , − P R h ) 8: γ R ℓ ← RN ( v R ℓ + s R ℓ ) 9: return z R = RN ( v R h + γ R ℓ ) (real part) 10: t I ← RN ( ω I ℓ x R ) ℓ x I + t I ) 11: π I ℓ ← RN ( ω R 12: ( P I h , P I ℓ ) ← Fast2Mult ( ω I h , x R ) 13: r I ℓ ← RN ( π I ℓ + P I ℓ ) 14: ( Q I h , Q I ℓ ) ← Fast2Mult ( ω R h , x I ) 15: s I ℓ ← RN ( Q I ℓ + r I ℓ ) 16: ( v I h , v I ℓ ) ← 2Sum ( Q I h , P I h ) 17: γ I ℓ ← RN ( v I ℓ + s I ℓ ) 18: return z I = RN ( v I h + γ I ℓ ) (imaginary part) 10
The multiplication algorithm Theorem 1 As soon as p � 4 , the normwise relative error η of Algorithm 3 satisfies η < u + 33 u 2 . (remember: the best possible bound is u) Remarks: • Condition “ p � 4” always holds in practice; • Algorithm 3 easily transformed (see later) into an algorithm that returns the real and imaginary parts of z as DW numbers. 11
Sketch of the proof ◮ first, we show that | z R − ℜ ( wx ) | α n R + β N R , � | z I − ℑ ( wx ) | α n I + β N I , � with N R | ω R x R | + | ω I x I | , = | ω R x R − ω I x I | , n R = N I | ω R x I | + | ω I x R | , = | ω R x I + ω I x R | , n I = u + 3 u 2 + u 3 , = α 15 u 2 + 38 u 3 + 39 u 4 + 22 u 5 + 7 u 6 + u 7 ; = β ◮ then we deduce N R � 2 + N I � 2 η 2 = ( z R − ℜ ( ω x )) 2 + ( z I − ℑ ( ω x )) 2 � � � α 2 + 2 αβ + β 2 � � · ( n R ) 2 + ( n I ) 2 ; ( ℜ ( ω x )) 2 + ( ℑ ( ω x )) 2 ◮ the theorem follows, by using N R � 2 + N I � 2 � � � 2 . ( n R ) 2 + ( n I ) 2 12
Obtaining the real and imaginary parts of z as DW numbers ◮ replace the FP addition z R = RN ( v R h + γ R ℓ ) of line 9 of Algorithm 3 by a call to 2Sum ( v R h , γ R ℓ ) , ◮ replace the FP addition z I = RN ( v I h + γ I ℓ ) of line 18 by a call to 2Sum ( v I h , γ I ℓ ) . ◮ resulting relative error √ 241 · u 2 + O ( u 3 ) ≈ 15 . 53 u 2 + O ( u 3 ) (instead of u + 33 u 2 ). Interest: • iterative product z 1 × z 2 × · · · × z n : keep the real and imaginary parts of the partial products as DW numbers, • Fourier transforms: when computing z 1 ± ω z 2 , keep ℜ ( ω z 2 ) and ℑ ( ω z 2 ) as DW numbers before the ± . 13
If ω I and ω R are floating-point numbers ω I ℓ = ω R ℓ = 0 ⇒ Algorithm 3 becomes simpler: Algorithm 4 : Complex multiplication ω · x , where ℜ ( ω ) and ℑ ( ω ) are FP numbers. 1: ( P R h , P R ℓ ) ← Fast2Mult ( ω I , x I ) 2: ( Q R h , Q R ℓ ) ← Fast2Mult ( ω R , x R ) 3: s R ℓ ← RN ( Q R ℓ − P R ℓ ) 4: ( v R h , v R ℓ ) ← 2Sum ( Q R h , − P R h ) 5: γ R ℓ ← RN ( v R ℓ + s R ℓ ) 6: return z R = RN ( v R h + γ R ℓ ) (real part) 7: ( P I h , P I ℓ ) ← Fast2Mult ( ω I , x R ) 8: ( Q I h , Q I ℓ ) ← Fast2Mult ( ω R , x I ) 9: s I ℓ ← RN ( Q I ℓ + P I ℓ ) 10: ( v I h , v I ℓ ) ← 2Sum ( Q I h , P I h ) 11: γ I ℓ ← RN ( v I ℓ + s I ℓ ) 12: return z I = RN ( v I h + γ I ℓ ) (imaginary part) 14
Real part ω I x I ω R x R ω I x I ω R x R ℓ ℓ h h × FMA Fast2Mult Fast2Mult t R Q R ℓ P R Q R P R h ℓ h P R ℓ 2Sum π R ℓ v R v R ℓ h + + + + r R s R γ R ℓ ℓ ℓ z R 15
Real part ω I x I ω R x R ω I x I ω R x R ℓ ℓ h h × FMA Fast2Mult Fast2Mult t R Q R ℓ P R Q R P R h ℓ h P R ℓ 2Sum π R ℓ v R v R ℓ h + + + + r R s R γ R ℓ ℓ ℓ z R 16
Real part ω I x I ω R x R ω I x I ω R x R ℓ ℓ h h × FMA Fast2Mult Fast2Mult t R Q R ℓ P R Q R P R h ℓ h P R ℓ 2Sum π R ℓ v R v R ℓ h + + + + r R s R γ R ℓ ℓ ℓ z R 17
If ω I and ω R are floating-point numbers ◮ Real and complex parts of Algorithm 4 similar to: • Cornea, Harrison and Tang’s algorithm for ab + cd (with a “ + ” replaced by a 2Sum), • Alg. 5.3 in Ogita, Rump and Oishi’s Accurate sum & dot product (with different order of summation of P R ℓ , Q R ℓ & v R ℓ ). ◮ The error bound u + 33 u 2 of Theorem 1 still applies, but it can be slightly improved: Theorem 2 As soon as p � 4 , the normwise relative error η of Algorithm 4 satisfies η < u + 19 u 2 . 18
Implementation and experiments ◮ Main algorithm (Algorithm 3) implemented in binary64 (a.k.a. double-precision) arithmetic, compared with other solutions: • naive formula in binary64 arithmetic; • naive formula in binary128 arithmetic; • GNU MPFR with precision ranging from 53 to 106 bits. ◮ loop over N random inputs, itself inside another loop doing K iterations; ◮ Goal of the external loop: get accurate timings without having to choose a large N , with input data that would not fit in the cache; ◮ For each test, we chose ( N , K ) = ( 1024 , 65536 ) , ( 2048 , 32768 ) and ( 4096 , 16384 ) . 19
Recommend
More recommend