Fast Integer Multiplication with Sch¨ onhage-Strassen’s Algorithm Alexander Kruppa CACAO team at LORIA, Nancy s´ eminaire Algorithms INRIA Rocquencourt
Contents Contents Contents of this talk: 1. Basics of Integer Multiplication (a) by polynomial multiplication (b) Evaluation/Interpolation (c) Karatsuba’s method (d) Toom-Cook method (e) FFT and weighted transforms (f) Sch¨ onhage-Strassen’s Algorithm Alexander Kruppa 2 INRIA Rocquencourt
Contents 2. Motivation for Improving SSA 3. Sch¨ onhage-Strassen’s Algorithm 4. High-Level Improvements (a) Mersenne Transform (b) CRT Reconstruction √ 2 Trick (c) 5. Low-Level Improvements (a) Arithmetic modulo 2 n + 1 (b) Cache Locality (c) Fine-Grained Tuning 6. Timings, Comparisons and Untested Ideas Alexander Kruppa 3 INRIA Rocquencourt
Integer Multiplication Integer Multiplication • Problem: given two n -word (word base β ) integers n − 1 � a i β i , a = i =0 0 ≤ a i < β and likewise for b , compute 2 n − 1 � c i β i c = ab = i =0 n − 1 n − 1 � � a i b j β i + j , = i =0 j =0 with 0 ≤ c i < β . Alexander Kruppa 4 INRIA Rocquencourt
by Polynomial Multiplication by Polynomial Multiplication • We can rewrite the problem as polynomial arithmetic: n − 1 � a i x i A ( x ) = i =0 so that a = A ( β ) , likewise for B ( x ) , then C ( x ) = A ( x ) B ( x ) n − 1 n − 1 � � a i b j x i + j = i =0 j =0 so that c = ab = C ( β ) . • Double sum has complexity O ( n 2 ) (Grammar School Algorithm), we can do much better Alexander Kruppa 5 INRIA Rocquencourt
Evaluation/Interpolation Evaluation/Interpolation • Unisolvence Theorem: Polynomial of degree d − 1 is determined uniquely by values at d distinct points • Since C ( k ) = A ( k ) B ( k ) for all k ∈ R for ring R , reduce the polynomial multiplication to: 1. Evaluate A ( x ) , B ( x ) at 2 n − 1 points k 0 , . . . , k 2 n − 2 2. Pairwise multiply to get C ( k i ) = A ( k i ) B ( k i ) 3. Interpolate C ( x ) from its values C ( k i ) Alexander Kruppa 6 INRIA Rocquencourt
Karatsuba’s Method Karatsuba’s Method • First algorithm to use this principle (Karatsuba and Ofman, 1962) • Multiplies polynomials of degree 1 : A ( x ) = a 0 + a 1 x • Suggested points of evaluation: 0 , 1 , ∞ • A (0) = a 0 , A (1) = a 0 + a 1 , A ( ∞ ) = a 1 (same for B ( x ) ) • C (0) = a 0 b 0 , C (1) = ( a 0 + a 1 )( b 0 + b 1 ) , C ( ∞ ) = a 1 b 1 • c 0 = C (0) , c 2 = C ( ∞ ) , c 1 = C (1) − c 0 − c 2 • Product of 2 n words computed with 3 pointwise multiplications of n words each, applied recursively: O ( n log 2 (3) ) = O ( n 1 . 585 ) Alexander Kruppa 7 INRIA Rocquencourt
Toom-Cook Method Toom-Cook Method • Generalized to polynomials of larger degree (Toom, 1963, Cook, 1966) • Product of two n word integers with A ( x ) , B ( x ) of degree d : 2 d + 1 products of n/ ( d + 1) word integers • For fixed d : complexity O ( n log d +1 (2 d +1) ) , e.g. d = 2 : O ( n 1 . 465 ) • Interpolation/Evaluation costly ( O ( dn log( d )) ), cannot increase d arbitrarily for given n • Choosing d as function of n allows algorithm in O ( n 1+ ǫ ) , for any ǫ > 0 . Small exponents need very large n Alexander Kruppa 8 INRIA Rocquencourt
Evaluation/Interpolation with FFT Evaluation/Interpolation with FFT • FFT solves problem of costly evaluation/interpolation a j = A ( ω j • Length- ℓ DFT of a 0 , ..., a ℓ − 1 in R computes ˜ ℓ ) , 0 ≤ j < ℓ , with ω ℓ an ℓ -th principal root of unity in R : ℓ -point polynomial evaluation • Length- ℓ IDFT computes a i from given ˜ a j : ℓ -point polynomial interpolation • With FFT algorithm, algebraic complexity only O ( ℓ log( ℓ )) • Problem: R needs to support length- ℓ FFT (preferably ℓ a power of 2 ): needs ℓ -th principal root of unity, ℓ a unit Alexander Kruppa 9 INRIA Rocquencourt
Weighted Transform Weighted Transform ℓ ) ℓ = 1 for all j ∈ N , C 1 ( x ) x ℓ + C 0 ( x ) has same DFT • Since ( ω j coefficients as C 1 ( x ) + C 0 ( x ) : implicit modulus x ℓ − 1 in DFT • FFT convolution gives C ( x ) = ( A ( x ) B ( x )) mod ( x ℓ − 1) : cyclic convolution • Can change that modulus with weighed transform: compute C ( wx ) = ( A ( wx ) B ( wx )) mod ( x ℓ − 1) . Then A ( wx ) B ( wx ) = C 1 ( wx ) x ℓ w ℓ + C 0 ( wx ) C ( wx ) = C 1 ( wx ) x ℓ w ℓ + C 0 ( wx ) mod ( x ℓ − 1) = C 1 ( wx ) w ℓ + C 0 ( wx ) so that C ( x ) = ( A ( x ) B ( x )) mod ( x ℓ − w ℓ ) • With w ℓ = − 1 , we get modulus x ℓ +1 : negacyclic convolution, but need 2 ℓ -th root of unity in R Alexander Kruppa 10 INRIA Rocquencourt
Sch¨ onhage-Strassen’s Algorithm: Basic Idea Sch¨ onhage-Strassen’s Algorithm: Basic Idea • First algorithms to use FFT (Sch¨ onhage and Strassen 1971) • Uses ring R n = Z / (2 n + 1) Z for transform, with ℓ = 2 k | n • Then 2 n/ℓ ≡ − 1 (mod 2 n +1) : so 2 n/ℓ ∈ R n is 2 ℓ -th root of unity, multiplication by powers of 2 is fast! ( O ( n ) ) • Allows length ℓ weighted transform for negacyclic convolution • Write input a = A (2 M ) , b = B (2 M ) , compute C ( x ) = A ( x ) B ( x ) (mod x ℓ + 1) . Then c = C (2 M ) = ab (mod 2 Mℓ + 1) • Point-wise products modulo 2 n + 1 use SSA recursively: choose next level’s ℓ ′ , M ′ so that M ′ ℓ ′ = n Alexander Kruppa 11 INRIA Rocquencourt
Improvements to Sch¨ onhage- Strassen’s Algorithm Alexander Kruppa 12 INRIA Rocquencourt
Motivation for Improving SSA Motivation for Improving SSA • Integer multiplication is fundamental to arithmetic, used in PRP testing, ECPP , polynomial multiplication • Sch¨ onhage-Strassen’s algorithm [SSA]: good asymptotic complexity O ( n log n log log n ) , fast in practice for large operands, exact (only integer arithmetic) • Used in GMP , widely deployed • We improved algorithmic aspects of Sch¨ onhage-Strassen • Validated by implementation based on GMP 4.2.1 [GMP] Alexander Kruppa 13 INRIA Rocquencourt
Sch¨ onhage-Strassen’s Algorithm Sch¨ onhage-Strassen’s Algorithm • SSA reduces multiplication of two S -bit integers to ℓ multiplications of approx. 4 S/ℓ -bit integers • Example: multiply two numbers a, b of 2 20 bits each ⇒ product has at most 2 21 bits 1. Choose N = 2 21 and a good ℓ , for this example ℓ = 512 . We compute ab mod (2 N + 1) 2. Write a as polynomial of degree ℓ , coefficients a i < 2 M with M = N/ℓ , a = a (2 M ) . Same for b 3. ab = a (2 M ) b (2 M ) mod (2 N + 1) , compute c ( x ) = a ( x ) b ( x ) mod ( x ℓ + 1) 4. Convolution theorem: Fourier transform and pointwise multiplication Alexander Kruppa 14 INRIA Rocquencourt
Sch¨ onhage-Strassen’s Algorithm 5. FFT needs ℓ -th root of unity: map to Z / (2 n + 1) Z [ x ] with ℓ | n . Then 2 2 n/ℓ has order ℓ 6. We need 2 n + 1 > c i : choose n ≥ 2 M + log 2 ( ℓ ) + 1 7. Compute c ( x ) = a ( x ) b ( x ) mod ( x ℓ + 1) , evaluate ab = c (2 M ) - and we’re done! 8. Benefits: - Root of unity is power of 2 - Reduction mod(2 n + 1) is fast - Point-wise products can use SSA recursively without padding ⇒ Z [ x ] mod ( x ℓ + 1) = ⇒ Z 2 n +1 [ x ] mod ( x ℓ + 1) = Z = ⇒ Z 2 N +1 = ⇒ Z 2 n +1 ✻ ❄ ✟ ❍ ✟✟✟✟ ❍ n ❍ ❍ No, recurse ❍ ❍❍❍❍ ✟ small? ✟ ✟ ✟ ❍ ✟ ❄ Yes, multiply Alexander Kruppa 15 INRIA Rocquencourt
High-Level Optimizations High-Level Optimizations Alexander Kruppa 16 INRIA Rocquencourt
Mersenne Transform Mersenne Transform • Convolution theorem implies reduction mod( x ℓ − 1) • Convolution mod( x ℓ + 1) needs weights θ i with ord ( θ ) = 2 ℓ , needs ℓ | n to get 2 ℓ -th root of unity in R n • Computing ab mod (2 N + 1) to allows recursive use of SSA, but is not required at top level • Map a and b to Z / (2 N − 1) Z instead: compute c ( x ) = a ( x ) b ( x ) mod ( x ℓ − 1) • Condition relaxes to ℓ | 2 n . Twice the transform length, smaller n • No need to apply/unapply weights Alexander Kruppa 17 INRIA Rocquencourt
CRT Reconstruction CRT Reconstruction a, b a, b Z / (2 N + 1) Z Z / (2 rN + 1) Z Z / (2 sN − 1) Z Convolution Convolution Convolution ab CRT with 2 N + 1 > ab ab with (2 rN + 1)(2 sN − 1) > ab Alexander Kruppa 18 INRIA Rocquencourt
CRT Reconstruction • At least one of (2 rN − 1 , 2 sN +1) and (2 rN +1 , 2 sN − 1) is coprime • Our implementation uses (2 rN +1 , 2 N − 1) : always coprime, good speed • Smaller convolution, finer-grained parameter selection Alexander Kruppa 19 INRIA Rocquencourt
√ 2 Trick The √ The 2 Trick • If 4 | n , 2 is a quadratic residue in Z / (2 n + 1) Z √ √ 2 ≡ 2 3 n/ 4 − 2 n/ 4 : simple form, multiplication by • In that case, 2 takes only 2 shift, 1 subtraction modulo 2 n + 1 • Offers root of unity of order 4 n , allows ℓ | 2 n for Fermat transform, ℓ | 4 n for Mersenne transform • Sadly, higher roots of 2 usually not available in Z / (2 n + 1) Z , or have no simple form Alexander Kruppa 20 INRIA Rocquencourt
Low-Level Optimizations Low-Level Optimizations Alexander Kruppa 21 INRIA Rocquencourt
Recommend
More recommend