pushing the limits of high speed gf 2 m elliptic curve
play

Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar - PowerPoint PPT Presentation

Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar Multiplier on FPGAs Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay Secured Embedded Architecture Lab Indian Institute of Technology Kharagpur India 9/12/2012 CHES


  1. Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar Multiplier on FPGAs Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay Secured Embedded Architecture Lab Indian Institute of Technology Kharagpur India 9/12/2012 CHES 2012, Leuven Belgium

  2. Elliptic Curve Scalar Multiplication • An elliptic curve over GF (2 m ) is a set of points which satisfies the equation y 2 + xy = x 3 + ax 2 + b , where a , b ∈ GF (2 m ) and b � = 0. The points on the elliptic curve form an additive group. • The projective coordinate representation of the curve is Y 2 + XYZ = X 3 + aX 2 Z 2 + bZ 4 • Scalar Multiplication : Given a base point P = ( X P , Y P , Z P ) on the elliptic curve and a scalar s compute Q = sP ( i.e. Q = P + P + P + · · · ( stimes )) 9/12/2012 CHES 2012, Leuven Belgium 2

  3. Montgomery Ladder for Scalar Multiplication Inputs : scalar s = ( s t − 1 s t − 2 · · · s 1 s 0 ) 2 basepoint P Output : Scalar Product Q = sP 1 P 1 = ( X 1 , Y 1 , Z 1 ) ← P and P 2 = ( X 2 , Y 2 , Z 2 ) ← 2 · P 2 For each bit s k (for k = t − 2 , t − 3 , · · · , 0) • if s k = 1 then P 1 ← P 1 + P 2 ; P 2 = 2 · P 2 • if s k = 0 then P 2 ← P 1 + P 2 ; P 1 = 2 · P 1 3 Q = Projective 2 Affine ( P 1 ) 9/12/2012 CHES 2012, Leuven Belgium 3

  4. Montgomery Ladder for Scalar Multiplication Inputs : scalar s = ( s t − 1 s t − 2 · · · s 1 s 0 ) 2 basepoint P Output : Scalar Product Q = sP 1 P 1 = ( X 1 , Y 1 , Z 1 ) ← P and P 2 = ( X 2 , Y 2 , Z 2 ) ← 2 · P 2 For each bit s k (for k = t − 2 , t − 3 , · · · , 0) • if s k = 1 then P 1 ← P 1 + P 2 ; P 2 = 2 · P 2 • if s k = 0 then P 2 ← P 1 + P 2 ; P 1 = 2 · P 1 3 Q = Projective 2 Affine ( P 1 ) Performing P i ← P i + P j and P j ← 2 · P i X i ← X i · Z j ; Z i ← X j · Z i ; T ← X j ; X j ← X 4 j + b · Z 4 j Z j ← ( T · Z j ) 2 ; T ← X i · Z i ; Z i ← ( X i + Z i ) 2 ; X i ← x · Z i + T . . . all operations are in GF (2 m ) 9/12/2012 CHES 2012, Leuven Belgium 3

  5. Engineering the Montgomery Ladder for Scalar Multiplication Scalar Multiplication Arithmetic Regbank Unit Elliptic Curve sP Group Operations Finite Field Operations ROM Control Unit s (a) The ECC Pyramid (b) Block Diagram 9/12/2012 CHES 2012, Leuven Belgium 4

  6. Engineering the Montgomery Ladder for Scalar Multiplication Scalar Multiplication Arithmetic Regbank Unit Elliptic Curve sP Group Operations Finite Field Operations ROM Control Unit s (a) The ECC Pyramid (b) Block Diagram High-speed scalar multiplication on FPGAs • Minimize area by maximizing utilization of available resources • Optimal Pipelining • Efficient Scheduling of Operations 9/12/2012 CHES 2012, Leuven Belgium 4

  7. Field Programmable Gate Arrays • Provides the speed of hardware and the reconfigurablitity of software • FPGA Architecture Programmable Routing Switches Programmable Connection Switch Logic Block (a) FPGA Island 9/12/2012 CHES 2012, Leuven Belgium 5

  8. Field Programmable Gate Arrays • Provides the speed of hardware and the reconfigurablitity of software • FPGA Architecture COUT Programmable Routing Switches F4 PRE F3 Control LUT & D Q F2 Carry CE Logic Programmable F1 Connection Switch Logic Block CLK CLR SR CE CLK BY CIN (a) FPGA Island (b) Lookup Table 9/12/2012 CHES 2012, Leuven Belgium 5

  9. LUT Utilization LUT • Four (or six) input → one output • Can implement any four (or six) input truth table • y 1 = x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 Requires one LUT. • y 2 = x 1 ⊕ x 2 Still requires one LUT. 9/12/2012 CHES 2012, Leuven Belgium 6

  10. LUT Utilization LUT • Four (or six) input → one output • Can implement any four (or six) input truth table • y 1 = x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 Requires one LUT. • y 2 = x 1 ⊕ x 2 Still requires one LUT. • y 2 results in an under utilized LUT. . . . need to maximize LUT utilization to minimize area. 9/12/2012 CHES 2012, Leuven Belgium 6

  11. Finite field Multiplier for Best LUT utilization 233 Karatusba Multiplier 117 116 58 59 58 58 29 29 29 29 29 29 29 30 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 8 5 6 (a) Karatsuba-Ofman Multiplication 9/12/2012 [VLSID 2008] CHES 2012, Leuven Belgium 7

  12. Finite field Multiplier for Best LUT utilization 233 Karatusba Multiplier 233 Karatusba Multiplier 117 117 116 116 58 59 58 59 58 58 58 58 29 29 29 29 29 29 29 30 29 29 29 30 29 29 29 29 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 77 78 7778 7778 7778 7778 777 8 77 78 7856 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 8 5 6 Classical Multiplier (a) Karatsuba-Ofman Multiplication (b) Hybrid Karatsuba Multiplication 9/12/2012 [VLSID 2008] CHES 2012, Leuven Belgium 7

  13. Finite field Multiplier for Best LUT utilization 233 Karatusba Multiplier 233 Karatusba Multiplier 117 117 116 116 58 59 58 59 58 58 58 58 29 29 29 29 29 29 29 30 29 29 29 30 29 29 29 29 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 77 78 7778 7778 7778 7778 777 8 77 78 7856 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 8 5 6 Classical Multiplier (a) Karatsuba-Ofman Multiplication (b) Hybrid Karatsuba Multiplication 9600 LUTs 8800 4 6 8 10 12 14 16 18 20 22 Threshold (c) Finding the Right Threshold 9/12/2012 [VLSID 2008] CHES 2012, Leuven Belgium 7

  14. Finite field Multiplier for Best LUT utilization 233 Karatusba Multiplier 233 Karatusba Multiplier 117 117 116 116 58 59 58 59 58 58 58 58 29 29 29 29 29 29 29 30 29 29 29 30 29 29 29 29 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15 14 15 14 15 14 15 14 15 14 15 14 15 15 15 77 78 7778 7778 7778 7778 777 8 77 78 7856 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 7 7 8 7 8 5 6 Classical Multiplier (a) Karatsuba-Ofman Multiplication (b) Hybrid Karatsuba Multiplication 1.1e+06 1e+06 900000 9600 800000 LUTs 700000 Area * Time 600000 500000 400000 8800 300000 200000 4 6 8 10 12 14 16 18 20 22 100000 Karatsuba-Ofman Hybrid Karatsuba 0 90 180 270 360 450 540 Threshold Number of bits (c) Finding the Right Threshold (d) Comparing Multipliers 9/12/2012 [VLSID 2008] CHES 2012, Leuven Belgium 7

  15. Finite Field Inversion Using Itoh-Tsujii Algorithm • Given a ∈ GF (2 m ), find a − 1 ∈ GF (2 m ) such that a · a − 1 = 1 • Fermat’s Little Theorem : a − 1 = a 2 m − 2 • Itoh-Tsujii Algorithm 1 Define the addition chain for m − 1 (for example m = 233 : (1 , 2 , 3 , 6 , 7 , 14 , 28 , 58 , 116 , 232)) 2 Compute a → a 2 2 − 1 → a 2 3 − 1 → a 2 6 − 1 → a 2 7 − 1 → a 2 14 − 1 · · · → a 2 232 − 1 3 Square to get a 2 233 − 2 9/12/2012 CHES 2012, Leuven Belgium 8

  16. Finite Field Inversion Using Itoh-Tsujii Algorithm • Given a ∈ GF (2 m ), find a − 1 ∈ GF (2 m ) such that a · a − 1 = 1 • Fermat’s Little Theorem : a − 1 = a 2 m − 2 • Itoh-Tsujii Algorithm 1 Define the addition chain for m − 1 (for example m = 233 : (1 , 2 , 3 , 6 , 7 , 14 , 28 , 58 , 116 , 232)) 2 Compute a → a 2 2 − 1 → a 2 3 − 1 → a 2 6 − 1 → a 2 7 − 1 → a 2 14 − 1 · · · → a 2 232 − 1 3 Square to get a 2 233 − 2 • Exponentiation requires a series of cascaded squarers called powerblock along with a finite field multiplier Input Square Square Square Square Circuit−1 Circuit−2 Circuit−3 Circuit−11 qsel Multiplexer Output 9/12/2012 CHES 2012, Leuven Belgium 8

  17. Using Higher Exponents in the Itoh-Tsujii Algorithm Consider using a quad circuit instead of a square. • This requires an addition chain to m − 1 instead of m − 1 thus 2 finishes faster. [IEEE TVLSI 2011, DATE 2011] 9/12/2012 CHES 2012, Leuven Belgium 9

  18. Using Higher Exponents in the Itoh-Tsujii Algorithm Consider using a quad circuit instead of a square. • This requires an addition chain to m − 1 instead of m − 1 thus 2 finishes faster. • The frequency of operation is not affected and area used is less due to better LUT utilization. Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA Field Squarer Circuit Quad Circuit Size ratio # LUTq # LUT s Delay (ns) # LUT q Delay (ns) 2(# LUTs ) GF (2 193 ) 96 1.48 145 1.48 0.75 GF (2 233 ) 153 1.48 230 1.48 0.75 [IEEE TVLSI 2011, DATE 2011] 9/12/2012 CHES 2012, Leuven Belgium 9

  19. Using Higher Exponents in the Itoh-Tsujii Algorithm Consider using a quad circuit instead of a square. • This requires an addition chain to m − 1 instead of m − 1 thus 2 finishes faster. • The frequency of operation is not affected and area used is less due to better LUT utilization. Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA Field Squarer Circuit Quad Circuit Size ratio # LUTq # LUT s Delay (ns) # LUT q Delay (ns) 2(# LUTs ) GF (2 193 ) 96 1.48 145 1.48 0.75 GF (2 233 ) 153 1.48 230 1.48 0.75 • Larger exponent circuits can similarly be used to obtain faster results. [IEEE TVLSI 2011, DATE 2011] 9/12/2012 CHES 2012, Leuven Belgium 9

Recommend


More recommend