Efficient arithmetic on elliptic curves in large characteristic D. J. Bernstein University of Illinois at Chicago
Fix a field and an elliptic curve. e.g. NIST P-224: the elliptic curve a 6 over Z =p . y 2 = x 3 � 3 x + p = 2 224 � 2 96 + 1 Here a 6 = 18958286285566608 and 00040866854449392 64155046809686793 21075787234672564. e.g. NIST P-256: the elliptic curve y 2 = x 3 � 3 x + � � � over Z =p where p = 2 256 � 2 224 + 2 192 + 2 96 � 1. e.g. Curve25519: the elliptic curve y 2 = x 3 + 486662 x 2 + x over Z =p p = 2 255 � 19. where
“Elliptic-curve scalar multiplication”: x; y ) on curve, Given ( n � 0, and given integer n th multiple of ( x; y ) compute in the elliptic-curve group. This is the bottleneck in elliptic-curve Diffie-Hellman. The big question: How quickly can we do this? Many variations of problem: m; n; P ; Q 7! mP + nQ , e.g. critical for elliptic-curve signatures.
Review of addition chains Typical recursive formulas: 2 P = P + P . 3 P = 2 P + P . 4 P = 2 P +2 P . 5 P = 3 P +2 P . 6 P = 3 P +3 P . 7 P = 5 P +2 P . 2 nP = 7 P +( n � 7) P if 4 � n< 8. (2 n +1) P = 2 nP + P if 4 � n< 8. (4 n +1) P = 4 nP + P if 4 � n< 8. (4 n +3) P = 4 nP +3 P if 4 � n< 8. 2 nP = nP + nP if 8 � n . (8 n +1) P = 8 nP + P if 4 � n . (8 n +3) P = 8 nP +3 P if 4 � n . (8 n +5) P = 8 nP +5 P if 4 � n . (8 n +7) P = 8 nP +7 P if 4 � n .
This addition chain (“length-3 sliding windows”) � lg n doublings and uses � 0 : 25 lg n more additions nP for average n . to compute � 320 additions for � � e.g. n 2 0 ; 1 ; : : : ; 2 256 � 1 average . Some easy improvements from fast negation on elliptic curves: (16 n � 7) P = 16 nP � 7 P , etc. Also use endomorphisms for “Koblitz curves,” “GLV curves.” More complicated methods replace 0 : 25 by � 1 = lg lg n .
Explicit doubling formulas y 2 = x 3 � 3 x + a 6 : On curve 00 00 ) where x; y ) = ( x ; y 2( � = (3 x 2 � 3) = 2 y , 00 = x � 2 � 2 x , 00 = 00 ) y � ( x � x � y . 7 subs etc., 2 squarings, 1 more mult, 1 division. How do we divide efficiently in a finite field?
p � 2 in prime field Z =p . f =g = f g p � 2 with g Can compute � lg p squarings and � (lg p ) = lg lg p more mults. p = 2 224 � 2 96 + 1: e.g. 223 squarings, 11 more mults. q � 2 f =g = f g More generally, q . in any field of size There are faster division methods (e.g. “Euclid”—beware timing attacks!); smaller “I/M ratio.” Special methods for some fields.
Speedup: delay divisions Division costs many mults even with fastest division methods. Save time by delaying divisions. Naive division-delay method: Store field elements as fractions until end of computation. Divide once before output. Mult fractions with 2 field mults. Divide fractions with 2 field mults. Add fractions with 3 field mults.
Speedup: unify denominators For elliptic-curve doubling, y have denominator 2 � = (3 x 2 � 3) = 2 y ; in y ) 2 denominator (2 00 = x � 2 � 2 x ; in y ) 3 denominator (2 00 = 00 ) y � ( x � x � y . in Subsequent computations will perform separate computations y ) 2 ; (2 y ) 3 on the denominators (2 00 00 . x ; y of Save time by manipulating denominators together.
“Jacobian coordinates”: x; y ; z ) to represent Store ( x=z 2 ; y =z 3 ). elliptic-curve point ( 00 00 ) where x=z 2 ; y =z 3 ) = ( x ; y 2( � = (3( x=z 2 ) 2 � 3) = 2( y =z 3 ) �= 2 y z with � = 3 x 2 � 3 z 4 ; = 00 = x � 2 � 2( x=z 2 ) � 2 � 8 xy 2 ) = (2 y z ) 2 ; = ( 00 = 00 ) y � (( x=z 2 ) � x � ( y =z 3 ) xy 2 � � � 3 � 8 y 4 ) = (2 y z ) 3 . = (12
x=z 2 ; y =z 3 ) = ( x 2 =z 2 ; y 2 =z 3 2( 2 ) z 2 = 2 y z , 2 where � = 3 x 2 � 3 z 4 , x 2 = � 2 � 8 xy 2 , y 2 = � (4 xy 2 � x 2 ) � 8 y 4 . Easily compute with 6 squarings, x 2 , z 2 , z 4 , y 2 , y 4 , 3 more mults: y z , xy 2 , � 2 , � ( � � � ). Also some subs, doublings, etc. Use fast field arithmetic: e.g., can delay carries and y 2 . reductions in computing
Speedup: difference of squares Can compute 3 x 2 � 3 z 4 as x � z 2 )( x + z 2 ). 3( Replace 3 squarings by 1 mult, 1 squaring. Revised total: 4 squarings, 4 more mults. Note: 3 x 2 � 3 z 4 came from 3 x 2 � 3, x 3 � 3 x + a 6 . derivative of Wouldn’t have same speedup x 3 � 5 x + a 6 . for, e.g.,
f 2 ; g 2 ; 2 f g Speedup: f 2 and g 2 After computing can compute 2 f g f + g ) 2 � f 2 � g 2 . as ( In particular: y 2 and z 2 After computing can compute 2 y z y + z ) 2 � y 2 � z 2 . as ( Replace 1 mult with 1 squaring. Revised total: 5 squarings, 3 more mults.
Explicit addition formulas Similar speedups in formulas for adding distinct points. 5 squarings, 11 more mults. Again some opportunities to delay carries, etc.
Speedup: cache results x 1 =z 2 ; y 1 =z 3 In adding ( 1 ) 1 x 2 =z 2 ; y 2 =z 3 to ( 2 ), 2 compute many intermediates, z 2 ; z 3 including 1 . 1 Often add same point again to a different point; z 2 ; z 3 can reuse 1 . 1 “Chudnovsky coordinates.”
Speedup: delay fewer divisions? Faster divisions sometimes justify delaying fewer divisions. e.g. Do we really need P ; 3 P ; 5 P ; 7 P ? fractions for P ; 3 P ; 5 P ; 7 P Can convert out of Jacobian coordinates with one division, several mults. Then save mults in every P ; 3 P ; 5 P ; 7 P . addition of “Mixed coordinates.” Sometimes worthwhile, depending on division speed.
Montgomery coordinates On elliptic curves with “Montgomery form” y 2 = x 3 + a 2 x 2 + x , a 2 � 2) = 4: preferably with small ( n ( x 1 ; : : : ) = ( x =z ; : : : ) where n n z 1 = 1; x 2 m = ( x 2 � z 2 m m ) 2 ; z 2 m =4 x z x 2 m + a 2 x z m + z 2 m m ( m m ); x 2 m +1 =4( x x � z z m m +1 m m +1 ) 2 ; z 2 m +1 =4( x z � z x x 1 . m m m +1 m +1 ) 2 y , Can also figure out or use cryptographic protocols y . that ignore
x z x z m +1 m +1 m m � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � + + � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �������� � � � ������ � � � � � � � � � � � � � � � � � + � � � � � � � � � � � � � � � � � � � a 2 � 2 � � � + 4 � � � � � x 1 � � � � x 2 m +1 z 2 m +1 x 2 m z 2 m
a 2 � 2) = 4 small, Assuming ( main operations are 4 squarings, 5 more mults n . for each bit of Compare to Jacobian coordinates: n has each bit of 5 squarings, 3 more mults, and on occasion 5 more squarings, 11 more mults. Montgomery form is better n is not gigantic. if
What are today’s speed records? Let’s focus on Pentium M. Each Pentium M cycle does � 1 floating-point operation: fp add or fp sub or fp mult. Current scalar-multiplication y 2 = x 3 +486662 x 2 + x software for over Z = (2 255 � 19): 640838 Pentium M cycles. � 0 : 92 per cycle. 589825 fp ops; Understand cycle counts fairly well by simply counting fp ops.
Main loop: 545700 fp ops. 2140 times 255 iterations. Reciprocal: 43821 fp ops. � 162 for 254 squares; 41148 = 254 � 243 for 11 more mults. 2673 = 11 Additional work: 304 fp ops. Inside one main-loop iteration: � 10 for 8 adds/subs; 80 = 8 55 for mult by 121665; � 162 for 4 squarings; 648 = 4 � 243 for 5 more mults; 1215 = 5 bx [1] + (1 � b ) x [0] etc. 142 for
� 19 is An integer mod 2 255 represented in radix 2 25 : 5 as a sum of 10 fp numbers in specified ranges. Add/sub: 10 fp adds/subs. Delay reductions and carries! Mult: poly mult using 10 2 fp mults, 9 2 fp adds; reduce using 9 fp mults, 9 fp adds; carry 11 times, each 4 fp adds; � 10 2 + 4 � 10 + 3 fp ops. overall 2 Squaring: first do 9 fp doublings; then eliminate 9 2 + 9 fp ops; � 10 2 + 6 � 10 + 2 fp ops. overall 1
Course advertisement “High-speed cryptography” at the Fields Institute, 36 hours, starting 23 Oct, ending 17 Nov. What are the state-of-the-art cryptographic functions for sharing secrets, expanding keys, authenticating data, signing data? How fast are these functions in software for typical CPUs? What’s known about security? How were the functions chosen? cr.yp.to/highspeed.html
Recommend
More recommend