The tangent FFT D. J. Bernstein University of Illinois at Chicago
Advertisement “SPEED: Software Performance Enhancement for Encryption and Decryption” A workshop on software speeds for secret-key cryptography and public-key cryptography. Amsterdam, June 11–12, 2007 http:// www.hyperelliptic.org/SPEED
The convolution problem How quickly can we multiply polynomials in the ring R [ x ]? Answer depends on degrees, representation of polynomials, number of polynomials, etc. Answer also depends on definition of “quickly.” Many models of computation; many interesting cost measures.
f ; g 2 R [ x ], Assume two inputs f < m , deg g � n � m , deg f g < n . Assume f ; g ; f g so deg represented as coeff sequences. How quickly can we compute the n coeffs of f g , given f ’s m coeffs g ’s n � m + 1 coeffs? and m , f 0 ; f 1 ; : : : ; f 2 R m � 1 ) Inputs ( n � m +1 . g 0 ; g 1 ; : : : ; g 2 R n � m ) ( n h 0 ; h 1 ; : : : ; h 2 R n � 1 ) Output ( n � 1 = h 0 + h 1 x + � � � + h x n � 1 with f 0 + f 1 x + � � � )( g 0 + g 1 x + � � � ). (
� � � � � � � � � � � � � � � � � Assume R -algebraic algorithms (without divs, branches): chains of u; v 7! u + v , binary R -adds u; v 7! u � v , binary R -subs u; v 7! uv , binary R -mults starting from inputs and constants. Example (1963 Karatsuba): f 0 � h 0 � � � � � f 1 � � � � � + � � � � � � � � � h 1 � � � � � � � � � � � � g 0 � � � � � � � + + � � � g 1 � � h 2 � � � � � �
Cost measure for this talk: total R -algebraic complexity. Cost 1 for binary R -add; cost 1 for binary R -sub; cost 1 for binary R -mult; cost 0 for constant in R . Many real-world computations use (e.g.) Pentium M’s floating-point operations to approximate operations in R . Properly scheduled operations achieve Pentium M cycles � total R -algebraic complexity.
Fast Fourier transforms � 2 C as exp(2 � i=n ). n Define n n T : C [ x ] = ( x � 1) , Define ։ C n � 1 f 7! f (1) ; f ( � ; : : : ; f ( � n ) n as ). T . Can very quickly compute First publication of fast algorithm: 1866 Gauss. Easy to see that Gauss’s FFT uses O ( n lg n ) arithmetic operations n 2 f 1 ; 2 ; 4 ; 8 ; : : : g . if Several subsequent reinventions, ending with 1965 Cooley Tukey.
Inverse map is also very fast. n is very fast. Multiplication in C 1966 Sande, 1966 Stockham: Can very quickly multiply n in C [ x ] = ( x � 1) or C [ x ] or R [ x ] n n . by mapping C [ x ] = ( x � 1) to C n f ; g 2 C [ x ] = ( x � 1): Given � 1 ( f g as T T ( f ) T ( g )). compute f ; g 2 C [ x ] with deg f g < n : Given f g from compute n its image in C [ x ] = ( x � 1). O ( n lg n ). R -algebraic complexity
A closer look at costs More precise analysis of Gauss FFT (and Cooley-Tukey FFT): n n using C [ x ] = ( x � 1) , ։ C (1 = 2) n lg n binary C -adds, (1 = 2) n lg n binary C -subs, (1 = 2) n lg n binary C -mults, n 2 f 1 ; 2 ; 4 ; 8 ; : : : g . if a; b ) 2 R 2 represents a + bi 2 C . ( ; 2 ; 6: C -add, C -sub, C -mult cost 2 a; b ) + ( ; d ) = ( a + ; b + d ), ( a; b ) � ( ; d ) = ( a � ; b � d ), ( a; b )( ; d ) = ( a � bd; ad + b ). (
Total cost 5 n lg n . Easily save some time: eliminate mults by 1; � 1 ; i; � i absorb mults by p into subsequent operations; � � i p p simplify mults by a; b )(1 = 2 ; 1 = p p using, e.g., ( 2) = a � b ) = 2 ; ( a + b ) = (( 2). Cost 5 n lg n � 10 n + 16 n n , to map C [ x ] = ( x � 1) , ։ C n 2 f 4 ; 8 ; 16 ; 32 ; : : : g . if
What about cost of convolution? 5 n lg n + O ( n ) to compute T ( f ), 5 n lg n + O ( n ) to compute T ( g ), n , O ( n ) to multiply in C � 1 . n lg n + O ( n ) for T similar 5 Total cost 15 n lg n + O ( n ) n f g 2 C [ x ] = ( x � 1) to compute n f ; g 2 C [ x ] = ( x � 1). given Total cost (15 = 2) n lg n + O ( n ) n f g 2 R [ x ] = ( x � 1) to compute n f ; g 2 R [ x ] = ( x � 1): map given n n= 2 � 1 R [ x ] = ( x � 1) , � C ։ R 2 (Gauss) to save half the time.
1968 R. Yavne: Can do better! Cost 4 n lg n � 6 n + 8 n n , to map C [ x ] = ( x � 1) , ։ C n 2 f 2 ; 4 ; 8 ; 16 ; : : : g . if
1968 R. Yavne: Can do better! Cost 4 n lg n � 6 n + 8 n n , to map C [ x ] = ( x � 1) , ։ C n 2 f 2 ; 4 ; 8 ; 16 ; : : : g . if 2004 James Van Buskirk: Can do better! Cost (34 = 9) n lg n + O ( n ). Expositions of the new algorithm: Frigo, Johnson, in IEEE Trans. Signal Processing ; Lundy, Van Buskirk, in Computing ; Bernstein, this talk, expanding an old idea of Fiduccia.
Van Buskirk, comp.arch , January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib ?”
Van Buskirk, comp.arch , January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib ?” Bernstein, comp.arch : “What do you mean, ‘better opcounts’? The k : : : of a size-2 algebraic complexity complex DFT has stood at (3 k � k +4 additions and ( k +4 k � 3)2 3)2 multiplications since 1968.”
Van Buskirk, comp.arch , January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib ?” Bernstein, comp.arch : “What do you mean, ‘better opcounts’? The k : : : of a size-2 algebraic complexity complex DFT has stood at (3 k � k +4 additions and ( k +4 k � 3)2 3)2 multiplications since 1968.” Van Buskirk, comp.arch : “Oh, you’re so 20th century, Dan. Read the handwriting on the wall.”
Understanding the FFT n The FFT trick: C [ x ] = ( x � 1) , n= 2 n= 2 + 1) ։ C [ x ] = ( x � 1) � C [ x ] = ( x by unique C [ x ]-algebra morphism. Cost 2 n : n= 2 C -adds, n= 2 C -subs. n = 4: C [ x ] = ( x 4 � 1) , e.g. ։ C [ x ] = ( x 2 � 1) � C [ x ] = ( x 2 + 1) g 0 + g 1 x + g 2 x 2 + g 3 x 3 7! by g 0 + g 2 ) + ( g 1 + g 3 ) x; ( g 0 � g 2 ) + ( g 1 � g 3 ) x . ( g 0 ; g 1 ; g 2 ; g 3 ) 7! Representation: ( g 0 + g 2 ) ; ( g 1 + g 3 )) ; (( g 0 � g 2 ) ; ( g 1 � g 3 )). ((
n= 2 Recurse: C [ x ] = ( x � 1) , n= 4 n= 4 +1); ։ C [ x ] = ( x � 1) � C [ x ] = ( x n= 2 + 1) similarly C [ x ] = ( x , n= 4 n= 4 + ։ C [ x ] = ( x � i ) � C [ x ] = ( x i ); continue to C [ x ] = ( x � 1) � � � � . n General case: C [ x ] = ( x � � 2 ) , n= 2 n= 2 + ։ C [ x ] = ( x � � ) � C [ x ] = ( x � ) n= 2 + g 0 + � � � + g x � � � 7! n= 2 by g 0 + �g g 1 + � � � � ) x + � � � , n= 2 ) + ( ( g 0 � �g g 1 � � � � � ) x + � � � . n= 2 ) + ( ( Cost 5 n : n= 2 C -mults, n= 2 C -adds, n= 2 C -subs. Recurse, eliminate easy mults. Cost 5 n lg n � 10 n + 16.
Alternative: the twisted FFT n After previous C [ x ] = ( x � 1) , n= 2 n= 2 +1), ։ C [ x ] = ( x � 1) � C [ x ] = ( x apply unique C -algebra morphism n= 2 +1) n= 2 C [ x ] = ( x , ։ C [ y ] = ( y � 1) x to � y . n that maps n= 2 + g 0 + g 1 x + � � � + g x � � � 7! n= 2 g 0 + g g 1 + g x + � � � , n= 2 )+( n= 2+1 ) ( g 0 � g � g 1 � g y + � � � . n ( n= 2 )+ n= 2+1 ) ( n : n= 2 C -mults, Again cost 5 n= 2 C -adds, n= 2 C -subs. Eliminate easy mults, recurse. Cost 5 n lg n � 10 n + 16.
The split-radix FFT FFT and twisted FFT end up with � n , same number of mults by � n= 2 , same number of mults by � n= 4 , same number of mults by etc. Is this necessary? No! Split-radix FFT: more easy mults. “Don’t twist until you see i ’s.” the whites of their (Can use same idea to speed up Sch¨ onhage-Strassen algorithm for integer multiplication.)
n Cost 2 n : C [ x ] = ( x � 1) , n= 2 n= 2 +1). ։ C [ x ] = ( x � 1) � C [ x ] = ( x n= 2 + 1) n : C [ x ] = ( x , Cost n= 4 n= 4 + ։ C [ x ] = ( x � i ) � C [ x ] = ( x i ). n= 4 Cost 6( n= 4): C [ x ] = ( x � i ) , n= 4 ։ C [ y ] = ( y � 1) by x 7! � y . n n= 4 + Cost 6( n= 4): C [ x ] = ( x i ) , n= 4 � 1 ։ C [ y ] = ( y � 1) by x 7! � y . n Overall cost 6 n to split into 1 = 2 ; 1 = 4 ; 1 = 4, entropy 1 : 5 bits. Eliminate easy mults, recurse. Cost 4 n lg n � 6 n + 8, exactly as in 1968 Yavne.
The tangent FFT Several ways to achieve i� . e cost 6 for mult by i� e One approach: Factor i tan � ) cos � . as (1 + � . Cost 2 for mult by cos i tan � . Cost 4 for mult by 1 + For stability and symmetry, fj cos � j ; j sin � jg use max � . instead of cos Surprise (Van Buskirk): Can merge some cost-2 mults!
Recommend
More recommend