Gimli: A Cross-Platform Permutation Daniel J. Bernstein, Stefan K¨ olbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, Peter Schwabe, Fran¸ cois-Xavier Standaert, Yosuke Todo, Benoˆ ıt Viguier Advances in permutation-based cryptography, Milan, October 10, 2018 1
What is a Permutation? Definition: A Permutation is a keyless block cipher. 2
What is a Permutation? Definition: A Permutation is a keyless block cipher. 2
What is a Permutation? Definition: A Permutation is a keyless block cipher. k 0 k 1 M f C Even-Mansour construction 2
What is a Permutation? Definition: A Permutation is a keyless block cipher. k 0 k 1 M f C Even-Mansour construction m 0 m 1 m 2 z 0 z 2 r bits f f f f c bits Absorbing phase Squeezing phase Sponge construction 2
Why Gimli? Currently we have: Permutation width in bits Benefits AES 128 very fast if the instruction is available . Chaskey 128 lightning fast on Cortex-M0/M3/M4 Keccak- f 200,400,800,1600 low-cost masking Salsa20,ChaCha20 512 very fast on CPUs with vector units . 3
Why Gimli? Currently we have: Permutation Hindrance AES Not that fast without HW . Chaskey Low security margin, slow with side-channel protection Keccak- f Huge state (800,1600) Salsa20,ChaCha20 Horrible on HW . 4
Why Gimli? Currently we have: Permutation Hindrance AES Not that fast without HW . Chaskey Low security margin, slow with side-channel protection Keccak- f Huge state (800,1600) Salsa20,ChaCha20 Horrible on HW . Can we have a permutation that is not too big, nor too small and good in all these areas? 4
Yes! Source: Wikipedia , Fair Use 5
What is Gimli? Gimli is: ◮ a 384-bit permutation (just the right size) • Sponge with c = 256 , r = 128 = ⇒ 128 bits of security • Cortex-M3/M4: full state in registers • AVR, Cortex-M0: 192 bits (half state) fit in registers 6
What is Gimli? Gimli is: ◮ a 384-bit permutation (just the right size) • Sponge with c = 256 , r = 128 = ⇒ 128 bits of security • Cortex-M3/M4: full state in registers • AVR, Cortex-M0: 192 bits (half state) fit in registers ◮ with high cross-platform performances ◮ designed for: • energy-efficient hardware • side-channel-protected hardware • microcontrollers • compactness • vectorization • short messages • high security level 6
Specifications: State j i Figure: State Representation 384 bits represented as: ◮ a parallelepiped with dimensions 3 × 4 × 32 (Keccak-like) ◮ or, as a 3 × 4 matrix of 32-bit words. 7
Specifications: Non-linear layer In parallel: x ← x ≪ 24 y ← y ≪ 9 x y In parallel: x ← x ⊕ ( z ≪ 1) ⊕ (( y ∧ z ) ≪ 2) z y ← y ⊕ x ⊕ (( x ∨ z ) ≪ 1) z ← z ⊕ y ⊕ (( x ∧ y ) ≪ 3) x y In parallel: z x ← z z ← x x y z Figure: The bit-sliced 9-to-3-bit SP-box applied to a column 8
Specifications: Linear layer Small Swap Big Swap Figure: The linear layer 9 e 3 7 7 9 ? ? ⊕ Figure: Constant addition 0x9e3779?? 9
Gimli in C extern void Gimli(uint32_t *state) { uint32_t round, column, x, y, z; for (round = 24; round > 0; --round) { for (column = 0; column < 4; ++column) { x = rotate(state[ column], 24); // x <<< 24 y = rotate(state[4 + column], 9); // y <<< 9 z = state[8 + column]; state[8 + column] = x ^ (z << 1) ^ ((y & z) << 2); state[4 + column] = y ^ x ^ ((x | z) << 1); state[column] = z ^ y ^ ((x & y) << 3); } if ((round & 3) == 0) { // small swap: pattern s...s...s... etc. x = state[0]; state[0] = state[1]; state[1] = x; x = state[2]; state[2] = state[3]; state[3] = x; } if ((round & 3) == 2) { // big swap: pattern ..S...S...S. etc. x = state[0]; state[0] = state[2]; state[2] = x; x = state[1]; state[1] = state[3]; state[3] = x; } if ((round & 3) == 0) { // add constant: pattern c...c...c... etc. state[0] ^= (0x9e377900 | round); } } } 10
Specifications: Rounds Round 24 Non-linear layer Small Swap & Round constant addition Round 23 Non-linear layer Round 22 Non-linear layer Big Swap Round 21 Non-linear layer Non-linear layer Round 20 Small Swap & Round constant addition Round 19 Non-linear layer Non-linear layer Round 18 Big Swap . . . . . . Figure: 7 first rounds of Gimli 11
Unrolled AVR & Cortex-M0 1. SP-box col. 0 2. SP-box col. 1 Round 24 swap word s 0 , 0 and s 0 , 1 1 2 7 8 3. SP-box col. 1 4. SP-box col. 1 5. SP-box col. 0 Round 23 5 3 11 9 6. SP-box col. 0 store columns 0,1 ; load columns 2,3 7. SP-box col. 2 Round 22 6 4 12 10 8. SP-box col. 3 swap word s 0 , 2 and s 0 , 3 9. SP-box col. 3 10. SP-box col. 3 11. SP-box col. 2 Round 21 21 23 13 15 12. SP-box col. 2 push word s 0 , 2 , s 0 , 3 ; load word s 0 , 0 , s 0 , 1 13. SP-box col. 2 Round 20 22 24 14 16 14. SP-box col. 2 15. SP-box col. 3 16. SP-box col. 3 swap word s 0 , 2 and s 0 , 3 Round 19 27 25 19 17 17. SP-box col. 3 18. SP-box col. 3 19. SP-box col. 2 Round 18 28 26 20 18 20. SP-box col. 2 store columns 2,3 ; load columns 0,1 . . . . . . Figure: Computation order on AVR & Cortex-M0 12
Implementation in Assembly # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 v ← z ≪ 1 v ← y u ← u ∧ v y ← y ≪ 9 x ← z ∧ y y ← u ∨ z u ← u ≪ 3 u ← x x ← x ≪ 2 y ← y ≪ 1 z ← z ⊕ v . x ← u ⊕ x y ← u ⊕ y z ← z ⊕ u . x ← x ⊕ v y ← y ⊕ v . The SP-box requires only 2 additional registers u and v . 13
Rotate for free on Cortex-M3/M4 # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 v ← z ≪ 1 v ← y u ← u ∧ (v ≪ 9) . x ← z ∧ (y ≪ 9) y ← u ∨ z u ← u ≪ 3 u ← x x ← x ≪ 2 y ← y ≪ 1 z ← z ⊕ (v ≪ 9) . x ← u ⊕ x y ← u ⊕ y z ← z ⊕ u . x ← x ⊕ v y ← y ⊕ (v ≪ 9) . Remove y <<< 9 . 14
Shift for free on Cortex-M3/M4 # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 . v ← y u ← u ∧ (v ≪ 9) . x ← z ∧ (y ≪ 9) y ← u ∨ z . u ← x . . z ← z ⊕ (v ≪ 9) . x ← u ⊕ (x ≪ 2) y ← u ⊕ (y ≪ 1) z ← z ⊕ (u ≪ 3) . x ← x ⊕ (z ≪ 1) y ← y ⊕ (v ≪ 9) . Get rid of the other shifts. 15
Free mov on Cortex-M3/M4 # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 . v ← y x ← x ∧ (v ≪ 9) . u ← z ∧ (y ≪ 9) y ← x ∨ z . . . . z ← z ⊕ (v ≪ 9) . u ← x ⊕ (u ≪ 2) y ← x ⊕ (y ≪ 1) z ← z ⊕ (x ≪ 3) . u ← u ⊕ (z ≪ 1) y ← y ⊕ (v ≪ 9) . Remove the last mov : u contains the new value of x y contains the new value of y z contains the new value of z 16
Free mov on Cortex-M3/M4 # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 . . x ← x ∧ (y ≪ 9) . u ← z ∧ (y ≪ 9) v ← x ∨ z . . . . z ← z ⊕ (y ≪ 9) . u ← x ⊕ (u ≪ 2) v ← x ⊕ (v ≪ 1) z ← z ⊕ (x ≪ 3) . u ← u ⊕ (z ≪ 1) v ← v ⊕ (y ≪ 9) . Remove the last mov : u contains the new value of x v contains the new value of y z contains the new value of z 17
Free swap on Cortex-M3/M4 # Rotate # Compute x # Compute y # Compute z x ← x ≪ 24 u ← z ∧ (y ≪ 9) v ← x ∨ z x ← x ∧ (y ≪ 9) . u ← x ⊕ (u ≪ 2) v ← x ⊕ (v ≪ 1) z ← z ⊕ (y ≪ 9) . u ← u ⊕ (z ≪ 1) v ← v ⊕ (y ≪ 9) z ← z ⊕ (x ≪ 3) Swap x and z : u contains the new value of z v contains the new value of y z contains the new value of x SP-box requires a total of 10 instructions. 18
How fast is Gimli? (Software) Cycles / Bytes (Lower is better) AVR ATmega 413 small 216 213 fast 171 small 151 fast Chaskey Gimli Salsa20 ChaCha20 AES-128 NORX-32-4-1 Keccak- f [400,12] Keccak- f [800,12]
How fast is Gimli? (Software) Cycles / Bytes (Lower is better) AVR ATmega 413 small 216 213 fast 171 small 151 fast Cortex-M0 49 40 9 . 8 Chaskey Gimli Salsa20 ChaCha20 AES-128 NORX-32-4-1 Keccak- f [400,12] Keccak- f [800,12]
How fast is Gimli? (Software) Cycles / Bytes (Lower is better) AVR ATmega 413 small 216 213 fast 171 small 151 fast Cortex-M0 49 40 9 . 8 Cortex-M3/M4 63 34 21 13 7 Chaskey Gimli Salsa20 ChaCha20 AES-128 NORX-32-4-1 Keccak- f [400,12] Keccak- f [800,12]
How fast is Gimli? (Software) Cycles / Bytes (Lower is better) Cortex-A8 AVR ATmega 413 small 19 . 3 x blocks 216 16 . 9 1 block 213 8 . 73 fast 1 block 171 6 . 25 small x blocks 151 5 . 48 fast x blocks Cortex-M0 49 40 9 . 8 Cortex-M3/M4 63 34 21 13 7 Chaskey Gimli Salsa20 ChaCha20 AES-128 NORX-32-4-1 Keccak- f [400,12] Keccak- f [800,12]
How fast is Gimli? (Software) Cycles / Bytes (Lower is better) Cortex-A8 AVR ATmega 413 small 19 . 3 x blocks 216 16 . 9 1 block 213 8 . 73 fast 1 block 171 6 . 25 small x blocks 151 5 . 48 fast x blocks Intel Haswell Cortex-M0 6 . 76 1 blocks 49 4 . 46 40 1 block 2 . 84 9 . 8 1 block 2 . 33 2 blocks 1 . 77 4 blocks Cortex-M3/M4 1 . 38 8 blocks 63 1 . 2 8 blocks 34 0 . 85 x blocks 21 13 7 Chaskey Gimli Salsa20 ChaCha20 AES-128 NORX-32-4-1 Keccak- f [400,12] Keccak- f [800,12] 19
Recommend
More recommend