speeding up characteristic 2 i linear maps m n game ii
play

Speeding up characteristic 2: I. Linear maps M ( n ) game II. The - PDF document

Speeding up characteristic 2: I. Linear maps M ( n ) game II. The III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 Part I. Linear maps Consider computing h 0 = q 0 ; h 1 = q 1 ; h 2 = q 2


  1. Speeding up characteristic 2: I. Linear maps M ( n ) game II. The III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR–0716498

  2. Part I. Linear maps Consider computing h 0 = q 0 ; h 1 = q 1 ; h 2 = q 2 � ( p 0 � q 0 � r 0 ); h 3 = ( p 1 � q 1 � r 1 ); h 4 = ( p 2 � q 2 � r 2 ) � r 0 ; h 5 = r 1 ; h 6 = r 2 . Easy: 8 additions. Can find these 8 additions in several papers. But 8 is not optimal!

  3. “Wasting brain power is bad for the environment.” Use existing algorithms to find addition chains. Apply, e.g., greedy additive CSE algorithm from 1997 Paar: � find input pair i 0 ; i 1 i 0 � i 1 ; with most popular � compute i 0 � i 1 ; � simplify using i 0 � i 1 ; � repeat. This algorithm finds repeated q 2 � r 0 ; uses 7 additions.

  4. A new algorithm: “xor largest.” Start with the matrix mod 2 for the desired linear map. If two largest rows have same first bit, replace largest row by its xor with second-largest row. Otherwise change largest row by clearing first bit. In both cases, compute result recursively, and finish with one xor.

  5. A small example: x 0 + x 2 + x 3 1011 = x 0 + x 1 + x 2 + x 3 1111 = x 1 + x 2 0110 = x 1 + x 3 0101 = Replace largest row by its xor with second-largest row.

  6. Recursively compute x 0 + x 2 + x 3 1011 = x 1 0100 = x 1 + x 2 0110 = x 1 + x 3 0101 = plus 1 xor of first output into second output.

  7. Recursively compute 0011 0100 0110 0101 plus 1 input load, 2 xors.

  8. Recursively compute 0011 0100 0011 0101 plus 1 input load, 3 xors.

  9. Recursively compute 0011 0100 0011 0001 plus 1 input load, 4 xors.

  10. Recursively compute 0011 0000 0011 0001 plus 2 input loads, 4 xors. Note: this was just a copy.

  11. Recursively compute 0000 0000 0011 0001 plus 2 input loads, 4 xors.

  12. Recursively compute 0000 0000 0001 0001 plus 3 input loads, 5 xors.

  13. Recursively compute 0000 0000 0000 0001 plus 3 input loads, 5 xors.

  14. Recursively compute 0000 0000 0000 0000 plus 4 input loads, 5 xors.

  15. Memory friendliness: Algorithm writes only to the output registers. No temporary storage. n inputs, n outputs: n registers total 2 with 0 loads, 0 stores. n + 1 registers Or n loads, 0 stores: with each input is read only once. n registers Or n loads, 0 stores, with if platform has load-xor insn.

  16. Two-operand friendliness: a a � b Platform with a b � but without n extra copies. uses only Naive column sweep also uses n + 1 registers, n loads, but usually many more xors. Input partitioning (e.g., 1956 Lupanov) uses somewhat more xors, copies; somewhat more registers. Greedy additive CSE uses somewhat fewer xors but many more copies, registers.

  17. m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs.

  18. m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs. Pippenger’s algorithm uses � mn= lg mn three-operand xors but seems to need many regs. Pippenger proved that his algebraic complexity was near optimal for most matrices (at least without mod 2), but didn’t consider regs, two-operand complexity, etc.

  19. Our original example: 000100000 000010000 100101100 010010010 001001101 000000010 000000001 Each row has coefficients of p 0 ; p 1 ; p 2 ; q 0 ; q 1 ; q 2 ; r 0 ; r 1 ; r 2 .

  20. Our original example: 000100000 000010000 000101100 010010010 001001101 000000010 000000001 plus 1 xor, 1 input load.

  21. Our original example: 000100000 000010000 000101100 000010010 001001101 000000010 000000001 plus 2 xors, 2 input loads.

  22. Our original example: 000100000 000010000 000101100 000010010 000001101 000000010 000000001 plus 3 xors, 3 input loads.

  23. Our original example: 000100000 000010000 000001100 000010010 000001101 000000010 000000001 plus 4 xors, 3 input loads.

  24. Our original example: 000000000 000010000 000001100 000010010 000001101 000000010 000000001 plus 4 xors, 4 input loads.

  25. Our original example: 000000000 000010000 000001100 000000010 000001101 000000010 000000001 plus 5 xors, 4 input loads.

  26. Our original example: 000000000 000000000 000001100 000000010 000001101 000000010 000000001 plus 5 xors, 5 input loads.

  27. Our original example: 000000000 000000000 000001100 000000010 000000001 000000010 000000001 plus 6 xors, 5 input loads.

  28. Our original example: 000000000 000000000 000000100 000000010 000000001 000000010 000000001 plus 7 xors, 6 input loads.

  29. Our original example: 000000000 000000000 000000000 000000010 000000001 000000010 000000001 plus 7 xors, 7 input loads.

  30. Our original example: 000000000 000000000 000000000 000000000 000000001 000000010 000000001 plus 7 xors, 7 input loads.

  31. Our original example: 000000000 000000000 000000000 000000000 000000001 000000000 000000001 plus 7 xors, 8 input loads.

  32. Our original example: 000000000 000000000 000000000 000000000 000000000 000000000 000000001 plus 7 xors, 8 input loads.

  33. Our original example: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 7 xors, 9 input loads. Algorithm found the speedup.

  34. M ( n ) game Part II. The M ( n ) Define as the minimum number of bit operations (ands, xors) needed to multiply n -bit polys f ; g 2 F 2 [ x ] (in standard representation). M (2) � 5: e.g. to compute h 0 + h 1 x + h 2 x 2 = f 0 + f 1 x )( g 0 + g 1 x ) ( h 0 = f 0 g 0 , can compute h 1 = f 0 g 1 + f 1 g 0 , h 2 = f 1 g 1 with 4 ands, 1 xor.

  35. Schoolbook multiplication: M ( n ) � Θ( n 2 ). 1963 Karatsuba: M ( n ) � Θ( n lg 3 ). p 1963 Toom: n ) . M ( n ) � n 2 Θ( lg 1971 Sch¨ onhage–Strassen: M ( n ) � Θ( n lg n lg lg n ). 2007 F¨ urer n for integers improves lg lg but doesn’t help mod 2.

  36. What does this tell us M (131) or M (251)? about Absolutely nothing! Reanalyze algorithms to see exact complexity. Rethink algorithm design to find constant-factor (and sub-constant-factor) speedups that are not visible in the asymptotics.

  37. Schoolbook recursion: M ( n + 1) � M ( n ) + 4 n . M ( n ) � 2 n 2 � 2 n + 1. Hence Karatsuba recursion as commonly stated: M (2 n ) � 3 M ( n ) + 8 n � 4. n = 1: e.g. Karatsuba for f = f 0 + f 1 x , g = g 0 + g 1 x , h 0 = f 0 g 0 , h 2 = f 1 g 1 , h 1 = ( f 0 + f 1 )( g 0 + g 1 ) � h 0 � h 2 ) f g = h 0 + h 1 x + h 2 x 2 .

  38. n = 2: Karatsuba for f = f 0 + f 1 x + f 2 x 2 + f 3 x 3 , g = g 0 + g 1 x + g 2 x 2 + g 3 x 3 , H 0 = ( f 0 + f 1 x )( g 0 + g 1 x ), H 2 = ( f 2 + f 3 x )( g 2 + g 3 x ), H 1 = ( f 0 + f 2 + ( f 1 + f 3 ) x ) � g 0 + g 2 + ( g 1 + g 3 ) x ) ( � H 0 � H 2 ) f g = H 0 + H 1 x 2 + H 2 x 4 .

  39. Initial linear computation: f 0 + f 2 ; f 1 + f 3 ; g 0 + g 2 ; g 1 + g 3 ; cost 4. Three size-2 mults producing H 0 = q 0 + q 1 x + q 2 x 2 ; H 2 = r 0 + r 1 x + r 2 x 2 ; H 0 + H 1 + H 2 = p 0 + p 1 x + p 2 x 2 . Final linear reconstruction: H 1 = ( p 0 � q 0 � r 0 ) + p 1 � q 1 � r 1 ) x + ( p 2 � q 2 � r 2 ) x 2 , ( cost 6; f g = H 0 + H 1 x 2 + H 2 x 4 , cost 2.

  40. Let’s look more closely at the reconstruction: f g = h 0 + h 1 x + � � � + h 6 x 6 with h 0 = q 0 ; h 1 = q 1 ; h 2 = q 2 + ( p 0 � q 0 � r 0 ); h 3 = ( p 1 � q 1 � r 1 ); h 4 = ( p 2 � q 2 � r 2 ) + r 0 ; h 5 = r 1 ; h 6 = r 2 .

  41. Let’s look more closely at the reconstruction: f g = h 0 + h 1 x + � � � + h 6 x 6 with h 0 = q 0 ; h 1 = q 1 ; h 2 = q 2 + ( p 0 � q 0 � r 0 ); h 3 = ( p 1 � q 1 � r 1 ); h 4 = ( p 2 � q 2 � r 2 ) + r 0 ; h 5 = r 1 ; h 6 = r 2 . We’ve seen this before! Reduce 6 + 2 = 8 ops to 7 ops q 2 � r 0 . by reusing

  42. 2000 Bernstein: M (2 n ) � 3 M ( n ) + 7 n � 3. 2009 Bernstein: M ( n ) new bounds on from further improvements to Karatsuba, Toom, etc. binary.cr.yp.to/m.html Typically 20% smaller than 2003 Rodr´ ıguez-Henr´ ıquez–Ko¸ c, 2005 Chang–Kim–Park–Lim, 2006 Weimerskirch–Paar, 2006 von zur Gathen–Shokrollahi, 2007 Peter–Langend¨ orfer.

  43. So far have focused on M ( n ) for small n , but different techniques n . are better for large I’m now exploring impact of 2008 Gao–Mateer. � F � k : q For F 2 1988 Wang–Zhu, 1989 Cantor q + k [ t ] = ( t t ) using diagonalize � 0 : 5 q lg q mults in k , � 0 : 5 q (lg q ) lg 3 adds in k . 2008 Gao–Mateer use � 0 : 5 q lg q mults, � 0 : 25 q lg q lg lg q adds.

  44. “Who cares?” Conventional wisdom: M ( n ) analysis Detailed has very little relevance to software speed. f by g We multiply f by looking up 4 bits of in a size-16 table of g ; precomputed multiples of looking up next 4 bits; etc. One table lookup replaces many bit operations! Might use Karatsuba etc., n . but only for large

Recommend


More recommend