How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT 2017, Paris
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 2/32
Higher-Order Masking x = x 1 + x 2 + · · · + x d 3/32
Higher-Order Masking x = x 1 + x 2 + · · · + x d � Linear operations: O ( d ) 3/32
Higher-Order Masking x = x 1 + x 2 + · · · + x d � Linear operations: O ( d ) � Non-linear operations: O ( d 2 ) 3/32
Higher-Order Masking x = x 1 + x 2 + · · · + x d � Linear operations: O ( d ) � Non-linear operations: O ( d 2 ) � Challenge for blockciphers: S-boxes 3/32
Ishai-Sahai-Wagner Multiplication � c i = � � × � � � � � a i b i = a i × b j i i i i,j a 1 b 1 a 1 b 2 . . . a 1 b d 0 0 . . . 0 0 r 1 , 2 . . . r 1 ,d . . . . . . 0 a 2 b 2 . . . . a 2 b 1 0 . . . . r 1 , 2 0 . . . . + + . . . . . . ... . . . . . . . . . . . . r d,d − 1 0 0 . . . a d b d a d b 1 a d b 2 . . . 0 r 1 ,d r d,d − 1 0 4/32
The Polynomial Methods � Sbox seen as a polynomial over GF (2 n ) n � a i x i S ( x ) = i =0 5/32
The Polynomial Methods � Sbox seen as a polynomial over GF (2 n ) n � a i x i S ( x ) = � i =0 Generic Methods � S ( x ) = ( p i ⋆ q i )( x ) i � CRV decomposition, ⋆ = × (CHES 2014) � Algebraic decomposition, ⋆ = ◦ (CRYPTO 2015) 5/32
The Polynomial Methods � Sbox seen as a polynomial over GF (2 n ) n � a i x i S ( x ) = � i =0 � Generic Methods AES Specific Methods � S AES ( x ) = Aff ( x 254 ) S ( x ) = ( p i ⋆ q i )( x ) i � CRV decomposition, ⋆ = × (CHES 2014) � RP multiplication chain (CHES 2010) � Algebraic decomposition, ⋆ = ◦ (CRYPTO 2015) � KHL multiplication chain (CHES 2011) 5/32
Our results � Optimized implementations of state of the art higher-order masking techniques � Bottom-up approach: ◮ base field multiplication ◮ ISW/CPRR ◮ polynomial methods � Finely tuned ARM assembly (parallelization) � Alternative strategy: bitslice method (new AES and PRESENT speed records) 6/32
ARM � 32-bit architecture with 16 registers (13 user accessible register) � Barrelshifter: shifts and rotates virtually free � Example: x -times and add on GF(2)[ x ] in 1 cycle EOR $acc , $var , $acc , LSL #1 7/32
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 8/32
Field Multiplication � Goal: efficient implementation of multiplication over GF(2 n ) � Fastest method: precomputed look-up table � Limitation: constrained memory on embedded system n 4 5 6 7 8 9 10 Table size 0.25 kiB 1 kiB 4 kiB 16 kiB 64 kiB 512 kiB 2048 kiB 9/32
Field Multiplication bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10 n + 3 7 n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 2 n − 1 + 48 2 n +1 + 48 3 · 2 n + 40 3 · 2 n + 42 2 +1 + 24 2 2 n + 12 3 n code size 52 2 10/32
Field Multiplication bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10 n + 3 7 n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 2 n − 1 + 48 2 n +1 + 48 3 · 2 n + 40 3 · 2 n + 42 2 +1 + 24 2 2 n + 12 3 n code size 52 2 n 2 + a ℓ ) × ( b h x n 2 + b ℓ ) a × b = ( a h x Karatsuba = T1[ a h | b h ] + T2[ a ℓ | b ℓ ] + T3[ a h + a ℓ | b h + b ℓ ] 10/32
Field Multiplication bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10 n + 3 7 n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 2 n − 1 + 48 2 n +1 + 48 3 · 2 n + 40 3 · 2 n + 42 2 +1 + 24 2 2 n + 12 3 n code size 52 2 n 2 + a ℓ ) × ( b h x n 2 + b ℓ ) a × b = ( a h x Half table = T1[ a h | a ℓ | b h ] + T2[ a h | a ℓ | b ℓ ] 10/32
Field Multiplication bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10 n + 3 7 n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 56 B 80 B 88 B 90 B 152 B 268 B � For n = 4 : full table ◮ Fastest multiplication: 4 clock cycles ◮ Low code size: 268 B 10/32
Field Multiplication bin mult v1 bin mult v2 exp-log v1 exp-log v2 kara. half-tab full-tab clock cycles 10 n + 3 7 n + 3 18 16 19 10 4 registers 5 5 5 5 6 5 5 code size 52 176 B 560 B 808 B 810 B 8216 B 64 kiB � For n = 8 : exp-log or half-tab ◮ tradeoff between clock cycles and code size 10/32
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 11/32
Quadratic Operations � ISW ◮ Secure GF-mult of 2 operands ◮ Might need refreshing (see paper for details) � CPRR ◮ Evaluation of quadratic functions in 1 operand ◮ Similar to ISW: GF-mult � lookup tables ◮ Twice more random 12/32
Performances Comparisons 3 , 500 ISW-FT ISW-HT 3 , 000 ISW-EL 2 , 500 CPRR Clock Cycles 2 , 000 1 , 500 1 , 000 500 0 d = 3 d = 5 d = 10 � ISW < CPRR when table too huge � Asymptotical comp: 1 CPRR � 1.16 ISW-FT, 0.88 ISW-HT, 0.75 ISW-EL 13/32
Parallelization � 32-bit register filled with only n -bit elements � Perform several ISW/CPRR in parallel: ◮ n = 4 � 8 elements/register ◮ n = 8 � 4 elements/register � Consequence: ◮ Parallel: load, store, xor, loops ◮ Sequential: GF mult, CPRR lookups 14/32
Performances Gain of Parallelization � n = 8 (4 elements) � n = 4 (8 elements) ISW-HT ISW-FT ISW-EL CPRR 15 , 000 15 , 000 CPRR sequential Clock Cycles Clock Cycles sequential parallel parallel 10 , 000 10 , 000 5 , 000 5 , 000 0 0 d = 3 d = 5 d = 10 d = 3 d = 5 d = 10 � Asympt. ratio: CPRR 54% . � Asympt. ratio: ISW 42% . 15/32
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 16/32
Polynomial Decomposition S ( x ) = � i q i ( x ) ⋆ p i ( x ) 17/32
Polynomial Decomposition S ( x ) = � i q i ( x ) ⋆ p i ( x ) � q i : random linear combinations from a basis B 17/32
Polynomial Decomposition S ( x ) = � i q i ( x ) ⋆ p i ( x ) � q i : random linear combinations from a basis B � find p i by solving a linear system 17/32
Polynomial Decomposition S ( x ) = � i q i ( x ) ⋆ p i ( x ) � q i : random linear combinations from a basis B � find p i by solving a linear system � CRV vs AD: ◮ CRV [CRV14]: ⋆ = GF-multiplication � ISW multiplication ◮ AD [CPRR15]: ⋆ = composition � CPRR evaluation 17/32
CRV Improvement � Use CPRR for the basis computation � Example for n = 8 : This paper CRV x 3 = x 3 x 3 = x · x 2 x 9 = ( x 3 ) 3 x 7 = x · ( x 3 ) 2 x 5 = x 5 x 29 = x · ( x 7 ) 4 x 25 = ( x 5 ) 5 x 87 = x 3 · x 29 x 125 = ( x 25 ) 5 x 251 = ( x 6 ) 16 · ( x 87 ) 128 x 115 = ( x 125 ) 5 5 ISW 6 CPRR 18/32
Implementation Results � n = 4 (8 s-boxes in / � n = 8 (4 s-boxes in / / ) / ) 3 , 000 Alge. dec. Alge. dec. 800 CRV-FT CRV-HT 2 , 500 CRV-EL Clock Cycles × 10 2 Clock Cycles × 10 600 2 , 000 1 , 500 400 1 , 000 200 500 0 0 d = 3 d = 5 d = 10 d = 3 d = 5 d = 10 19/32
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 20/32
Polynomial Methods for AES � Based on the specific algebraic structure of the AES: S ( x ) = Aff( x 254 ) � RP10 method : 4 ISW mult � Security flaw due to refreshing � Patch [CPRR13]: 1 CPRR + 3 ISW � Improvement [GPS14]: 3 CPRR + 1 ISW � KHL11 method: 5 ISW mult on GF(16) � Patch [this paper]: 1 CPRR + 4 ISW 21/32
Implementation Results � 16 s-boxes in / / KHL 100 RP-HT RP-EL Clock Cycles × 10 3 80 60 40 20 0 d = 3 d = 5 d = 10 � KHL < RP- ∗ : smaller elements � higher parallelization degree 22/32
1 � Introduction 2 � Field Multiplications 3 � Non-Linear Operations 4 � Generic Polynomial Methods 5 � Polynomial Methods for AES 6 � The Bitslice Strategy 23/32
Bitslice for the AES � Sbox seen as boolean circuit X 1 X 2 X n x 1 x 2 . . . x n . . . . . . . . . � + + CPU CPU XOR XOR . . . . . . + CPU AND � 16 S-boxes in / / 24/32
Application for AES S-boxes � Circuit for the AES S-box [BMP13] ◮ 83 XOR gates ◮ 32 AND gates � Bitslice (16 s-boxes) ◮ 83 XOR instructions ◮ 32 AND instructions � Masking at the order d : ◮ 83 × d XOR instructions ◮ 32 ISW-AND 25/32
Improvement 2 16-bit ISW-AND � 1 32-bit ISW-AND � Goal: grouping AND gates per pairs � Validation on BMP circuit � 16 s-boxes = 16 ISW-AND � 1 ISW-AND per s-box 26/32
Performance Comparison of ISW 8 , 000 ISW-AND (32 / / AND) ISW-FT (8 / / GF(16)-mult) ISW-HT (4 / / GF(256)-mult) 6 , 000 Clock Cycles 4 , 000 2 , 000 0 d = 3 d = 5 d = 10 27/32
Recommend
More recommend