SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange. Hwajeong Seo (Hansung University), Zhe Liu (Nanjing University of Aeronautics and Astronautics), Patrick Longa (Microsoft Research), Zhi Hu (Central South University)
Outline • Short Overview • Post-quantum supersingular isogeny Diffie-Hellman (SIDH) key exchange • Supersingular isogeny key encapsulation (SIKE) protocol • Our implementation • Optimized implementations for 32-bit ARMv7 • Optimized implementations for 64-bit ARMv8 • Implementation results • Conclusion 1
Post-Quantum Cryptography (Isogeny) • RSA and ECC: integer factorization and ECDLP • Hard problems can be solved by Shor’s algorithm in a quantum computer. • Quantum-Resistant Cryptography • NIST launches the post-quantum cryptography standardization project. “The goal of this process is to select a number of acceptable candidate cryptosystems for standardization.” • Code, Lattice, Hash, Multivariate, Isogeny … • Isogeny-based cryptography: (conjectured to be) hard for quantum computers • Supersingular isogeny Diffie-Hellman (SIDH) key exchange was proposed by Jao and De Feo in 2011 . • Among all the submitted post-quantum candidates, SIDH uses the smallest keys 2
Mobile Platform (32-bit/64-bit ARM) Platform ARM Cortex-A15 ARM Cortex-A53 ARM Cortex-A72 Architecture 32-bit ARMv7 64-bit ARMv8 64-bit ARMv8 Frequency 2.0 GHz 1.512 GHz 1.992 GHz No. registers 15 31 31 No. registers (NEON) 16 32 32 Application Wearable devices Smartphones 3
Previous Works • Hardware Implementation • FPGA: • Koziel et al. [INDOCRYPT’16, TCAS’17] • Software Implementation • 64-bit Intel processor: • Costello et al. [CRYPTO’16, EUROCRYPT’17], Faz- Hernández et al. [ToC’17], Zanon et al. [PQCrypto’18] • 64-bit ARM processor: • Jalali et al. [SAC’17] this work [CHES’18] • 32-bit ARM processor: • Koziel et al. [CANS’16] this work [CHES’18] 4
Motivation Type Algorithm Advantage Disadvantage Code McEliece Fast computation Long key size Hash XMSS, SPHINCS Security proof Long signature size Difficulty of Lattice (ring)-LWE Fast computation parameter selection Short signature size Multivariate UOV, Rainbow Long key size Fast computation Isogeny SIDH, SIKE Short key size Slow computation • All PQC candidates have their own pros and cons . • Disadvantage of SIDH/SIKE is slow computation. • In this talk, we address this problem on 32-bit and 64-bit ARM processors. 5
Contribution • Unified ARM/NEON multiplication: instruction level parallelism • New Montgomery reduction: “ UMAAL ” + “ hybrid-scanning ” • Efficient Implementation of SIDH: • p503 ( 88 msec ) / p751 ( 292 msec ) on 32-bit ARMv7-A @2.0GHz • p503 ( 45 msec ) on 64-bit ARMv8-A @1.992GHz 6
Outline • Short Overview • Post-quantum supersingular isogeny Diffie-Hellman (SIDH) key exchange • Supersingular isogeny key encapsulation (SIKE) protocol • Our implementation • Optimized implementations for 32-bit ARMv7 • Optimized implementations for 64-bit ARMv8 • Implementation results • Conclusion 7
Post-quantum key exchange algorithm • Supersingular Isogeny Diffie-Hellman (SIDH) • Shared key generation between two parties over an insecure communication channel. • SIDH works with the set of supersingular elliptic curves over 𝔾 𝑞 2 and their isogenies. 𝐹 𝐵𝐶 = Φ ′ 𝐵 + 𝑡 𝐵 𝑅 𝐵 , 𝑄 𝐶 + 𝑡 𝐶 𝑅 𝐶 ≅ 𝐹 𝐶𝐵 = Φ ′ 𝐶 Φ 𝐵 𝐹 0 ≅ 𝐹 0 / 𝑄 𝐵 Φ 𝐶 𝐹 0 8
Supersingular Isogeny Key Encapsulation (SIKE) • SIDH is not secure when keys are reused (Galbraith-Petit-Shani-Ti 2016) • SIKE: (Costello – De Feo – Jao – Longa – Naehrig – Renes 2017) • IND-CCA secure key encapsulation based on SIDH. • Uses a variant of Hofheinz – Hövelmanns – Kiltz (HHK) transform: IND-CPA PKE → IND-CCA KEM • For a starting curve 𝐹 0 / 𝔾 𝑞 2 : 𝑧 2 = 𝑦 3 + 𝑦 , where 𝑞 = 2 𝑓𝐵 3 𝑓𝐶 − 1 Scheme classicalsec. quantumsec. Securitylevel 𝑓 𝐵 , 𝑓 𝐶 (SIKEp + log 2 𝑞 ) SIKEp503 (250,159) 126 bits 84 bits AES-128 (NIST level 1) SIKEp751 (372,239) 188 bits 125 bits AES-192 (NIST level 3) 9 SIKEp964 (486,301) 241 bits 161 bits AES-256 (NIST level 5)
Outline • Short Overview • Post-quantum supersingular isogeny Diffie-Hellman (SIDH) key exchange • Supersingular isogeny key encapsulation (SIKE) protocol • Our implementation • Optimized implementations for 32-bit ARMv7 • Optimized implementations for 64-bit ARMv8 • Implementation results • Conclusion 10
Multiplication Instruction (32-bit ARMv7) 32 bits 32 bits ARM NEON R0 V0 a0 a3 a2 a1 a0 × × × R1 V1 b0 b3 b2 b1 b0 + V2 R2 a1b0 a0b0 c0 64 bits + R3 d0 UMULL R3, R2 a0b0 + c0 + d0 64 bits 11 UMAAL
Previous Multiprecision Multiplication (32-bit ARMv7) C[14] C[7] C[14] C[7] C[0] C[0] A[7]B[0] A[7]B[0] 1 4 A[0]B[0] A[7]B[7] A[7]B[7] A[0]B[0] 3 2 A[0]B[7] A[0]B[7] Consecutive Operand Caching ( COC ) for ARM Cascade Operand Scanning ( COS ) for NEON Bitlength Method Instruction Timings [ 𝒅𝒅 ] COC ARM (UMAAL) 158 BEST 256-bit COS NEON (UMULL) 188 COC ARM (UMAAL) 596 BEST 512-bit COS NEON (UMULL) 632 12 Target processor: 32-bit ARM Cortex-A15
Proposed Multiprecision Multiplication (32-bit ARMv7) • Instruction level parallelism • ARM and NEON instructions are issued together • Karatsuba multiplication: m -bit multiplication ( 𝐵 𝐼 ∙ 𝐶 𝐼 ∙ 2 𝑛 + 𝐵 𝐼 ∙ 𝐶 𝐼 + 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝐼 − 𝐶 𝑀 ∙ 2 𝑛/2 + 𝐵 𝑀 ∙ 𝐶 𝑀 ) • Two 𝒏/𝟑 -bit multiplication in ARM • One 𝒏/𝟑 -bit multiplication in NEON 13
ARM Operand Operand subtraction 1 passing NEON 15
ARM C[0] C[14] C[7] Operand Operand subtraction 1 passing NEON 3 3 C[6] 2 C[10] C[4] C[0] C[14] 4 4 2 A[7]B[7] A[0]B[0] C[8] 16
ARM C[0] C[14] C[7] Operand Operand subtraction 1 passing NEON 3 3 C[6] Result 2 C[10] C[4] passing C[0] C[14] 4 4 2 A[7]B[7] A[0]B[0] C[8] Result accumulation 5 17
Proposed Multiprecision Multiplication (32-bit ARMv7) Bitlength Method Instruction Timings [ 𝒅𝒅 ] COC ARM 596 GMP-6.1.2 ARM 1,138 512-bit 1.26x COS NEON 632 This work ARM/NEON 470 GMP-6.1.2 ARM 2,408 2.64x 768-bit This work ARM/NEON 912 Target processor: 32-bit ARM Cortex-A15 18
Proposed Modular Reduction (32-bit ARMv7) • m -bit modular reduction using Montgomery reduction • Two 𝒏/𝟑 -bit multiplication in ARM • Two 𝒏/𝟑 -bit multiplication in NEON 19
Operand ARM passing NEON 20
T[14] T[7] T[0] Operand Q[7]M[0] ARM passing NEON 3 T[6] 2 1 T[0] 1 4 T[10] T[4] T[10] Operand Q[7]M[7] Q[0]M[0] 4 3 passing 2 T[4] T[14] T[8] 21 Q[0]M[7]
T[14] T[7] T[0] Operand Q[7]M[0] ARM passing NEON 3 T[6] 2 1 T[0] 1 4 T[10] T[4] T[10] Q[7]M[7] Operand Q[0]M[0] 4 3 passing 2 T[4] T[14] T[8] Result Result 22 5 passing Accumulation Q[0]M[7]
Modular Reduction for SIDH • Efficient Montgomery reduction: Montgomery-friendly modulus • The lower word of the modulus is 𝟑 𝒙 − 𝟐 Montgomery constant is equal to 1. • Multiplications with an all-ones word ( 𝑈 × 0𝑦𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑈 × 2 32 − 𝑈 ): shifts and subtractions • (e.g., 𝑞503 = 2 250 3 159 − 1 ) 0x4066F541811E1E6045C6BDDA77A4D01B9BF6C87B7E7DAF13085BDA2211E7A0AB FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF (in hexadecimal) • A modulus M+1 turns the lower part of the modulus into all-zero words • (e.g., 𝑞503 + 1 = 2 250 3 159 ) 0x4066F541811E1E6045C6BDDA77A4D01B9BF6C87B7E7DAF13085BDA2211E7A0AC 00000000000000000000000000000000000000000000000000000000000000 (in hexadecimal) 23
Proposed Modular Reduction for SIDH (32-bit ARMv7) • m -bit modular reduction using Montgomery reduction • One 𝒏/𝟑 -bit multiplication in ARM • One 𝒏/𝟑 -bit multiplication in NEON 24
Operand ARM passing NEON 25
T[10] T[3] T[14] Operand ARM passing Q[7]M[3] NEON T[10] 2 1 2 T[3] Q[7]M[7] T[14] T[7] 1 Q[0]M[3] 26 Q[0]M[7]
T[10] T[3] T[14] Operand ARM passing Q[7]M[3] NEON T[10] 2 1 2 T[3] Q[7]M[7] T[14] T[7] 1 Result Q[0]M[3] 3 Accumulation 27 Q[0]M[7]
Outline • Short Overview • Post-quantum supersingular isogeny Diffie-Hellman (SIDH) key exchange • Supersingular isogeny key encapsulation (SIKE) protocol • Our implementation • Optimized implementations for 32-bit ARMv7 • Optimized implementations for 64-bit ARMv8 • Implementation results • Conclusion 28
Multiplication Instruction (64-bit ARMv8) X0 X0 a0 a0 × × X1 X1 b0 b0 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 X3 X2 MUL UMULH 29
Recommend
More recommend