SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum - PowerPoint PPT Presentation

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange. Hwajeong Seo (Hansung University), Zhe Liu (Nanjing University of Aeronautics and Astronautics), Patrick Longa (Microsoft Research), Zhi Hu (Central South University)

Outline • Short Overview • Post-quantum supersingular isogeny Diffie-Hellman (SIDH) key exchange • Supersingular isogeny key encapsulation (SIKE) protocol • Our implementation • Optimized implementations for 32-bit ARMv7 • Optimized implementations for 64-bit ARMv8 • Implementation results • Conclusion 1

Post-Quantum Cryptography (Isogeny) • RSA and ECC: integer factorization and ECDLP • Hard problems can be solved by Shor’s algorithm in a quantum computer. • Quantum-Resistant Cryptography • NIST launches the post-quantum cryptography standardization project. “The goal of this process is to select a number of acceptable candidate cryptosystems for standardization.” • Code, Lattice, Hash, Multivariate, Isogeny … • Isogeny-based cryptography: (conjectured to be) hard for quantum computers • Supersingular isogeny Diffie-Hellman (SIDH) key exchange was proposed by Jao and De Feo in 2011 . • Among all the submitted post-quantum candidates, SIDH uses the smallest keys 2

Mobile Platform (32-bit/64-bit ARM) Platform ARM Cortex-A15 ARM Cortex-A53 ARM Cortex-A72 Architecture 32-bit ARMv7 64-bit ARMv8 64-bit ARMv8 Frequency 2.0 GHz 1.512 GHz 1.992 GHz No. registers 15 31 31 No. registers (NEON) 16 32 32 Application Wearable devices Smartphones 3

Previous Works • Hardware Implementation • FPGA: • Koziel et al. [INDOCRYPT’16, TCAS’17] • Software Implementation • 64-bit Intel processor: • Costello et al. [CRYPTO’16, EUROCRYPT’17], Faz- Hernández et al. [ToC’17], Zanon et al. [PQCrypto’18] • 64-bit ARM processor: • Jalali et al. [SAC’17]  this work [CHES’18] • 32-bit ARM processor: • Koziel et al. [CANS’16]  this work [CHES’18] 4

Motivation Type Algorithm Advantage Disadvantage Code McEliece  Fast computation  Long key size Hash XMSS, SPHINCS  Security proof  Long signature size  Difficulty of Lattice (ring)-LWE  Fast computation parameter selection  Short signature size Multivariate UOV, Rainbow  Long key size  Fast computation Isogeny SIDH, SIKE  Short key size  Slow computation • All PQC candidates have their own pros and cons . • Disadvantage of SIDH/SIKE is slow computation. • In this talk, we address this problem on 32-bit and 64-bit ARM processors. 5

Contribution • Unified ARM/NEON multiplication: instruction level parallelism • New Montgomery reduction: “ UMAAL ” + “ hybrid-scanning ” • Efficient Implementation of SIDH: • p503 ( 88 msec ) / p751 ( 292 msec ) on 32-bit ARMv7-A @2.0GHz • p503 ( 45 msec ) on 64-bit ARMv8-A @1.992GHz 6

Post-quantum key exchange algorithm • Supersingular Isogeny Diffie-Hellman (SIDH) • Shared key generation between two parties over an insecure communication channel. • SIDH works with the set of supersingular elliptic curves over 𝔾 𝑞 2 and their isogenies. 𝐹 𝐵𝐶 = Φ ′ 𝐵 + 𝑡 𝐵 𝑅 𝐵 , 𝑄 𝐶 + 𝑡 𝐶 𝑅 𝐶 ≅ 𝐹 𝐶𝐵 = Φ ′ 𝐶 Φ 𝐵 𝐹 0 ≅ 𝐹 0 / 𝑄 𝐵 Φ 𝐶 𝐹 0 8

Supersingular Isogeny Key Encapsulation (SIKE) • SIDH is not secure when keys are reused (Galbraith-Petit-Shani-Ti 2016) • SIKE: (Costello – De Feo – Jao – Longa – Naehrig – Renes 2017) • IND-CCA secure key encapsulation based on SIDH. • Uses a variant of Hofheinz – Hövelmanns – Kiltz (HHK) transform: IND-CPA PKE → IND-CCA KEM • For a starting curve 𝐹 0 / 𝔾 𝑞 2 : 𝑧 2 = 𝑦 3 + 𝑦 , where 𝑞 = 2 𝑓𝐵 3 𝑓𝐶 − 1 Scheme classicalsec. quantumsec. Securitylevel 𝑓 𝐵 , 𝑓 𝐶 (SIKEp + log 2 𝑞 ) SIKEp503 (250,159) 126 bits 84 bits AES-128 (NIST level 1) SIKEp751 (372,239) 188 bits 125 bits AES-192 (NIST level 3) 9 SIKEp964 (486,301) 241 bits 161 bits AES-256 (NIST level 5)

Multiplication Instruction (32-bit ARMv7) 32 bits 32 bits ARM NEON R0 V0 a0 a3 a2 a1 a0 × × × R1 V1 b0 b3 b2 b1 b0 + V2 R2 a1b0 a0b0 c0 64 bits + R3 d0 UMULL R3, R2 a0b0 + c0 + d0 64 bits 11 UMAAL

Previous Multiprecision Multiplication (32-bit ARMv7) C[14] C[7] C[14] C[7] C[0] C[0] A[7]B[0] A[7]B[0] 1 4 A[0]B[0] A[7]B[7] A[7]B[7] A[0]B[0] 3 2 A[0]B[7] A[0]B[7] Consecutive Operand Caching ( COC ) for ARM Cascade Operand Scanning ( COS ) for NEON Bitlength Method Instruction Timings [ 𝒅𝒅 ] COC ARM (UMAAL) 158 BEST 256-bit COS NEON (UMULL) 188 COC ARM (UMAAL) 596 BEST 512-bit COS NEON (UMULL) 632 12 Target processor: 32-bit ARM Cortex-A15

Proposed Multiprecision Multiplication (32-bit ARMv7) • Instruction level parallelism • ARM and NEON instructions are issued together • Karatsuba multiplication: m -bit multiplication ( 𝐵 𝐼 ∙ 𝐶 𝐼 ∙ 2 𝑛 + 𝐵 𝐼 ∙ 𝐶 𝐼 + 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝐼 − 𝐶 𝑀 ∙ 2 𝑛/2 + 𝐵 𝑀 ∙ 𝐶 𝑀 ) • Two 𝒏/𝟑 -bit multiplication in ARM • One 𝒏/𝟑 -bit multiplication in NEON 13

ARM Operand Operand subtraction 1 passing NEON 15

ARM C[0] C[14] C[7] Operand Operand subtraction 1 passing NEON 3 3 C[6] 2 C[10] C[4] C[0] C[14] 4 4 2 A[7]B[7] A[0]B[0] C[8] 16

ARM C[0] C[14] C[7] Operand Operand subtraction 1 passing NEON 3 3 C[6] Result 2 C[10] C[4] passing C[0] C[14] 4 4 2 A[7]B[7] A[0]B[0] C[8] Result accumulation 5 17

Proposed Multiprecision Multiplication (32-bit ARMv7) Bitlength Method Instruction Timings [ 𝒅𝒅 ] COC ARM 596 GMP-6.1.2 ARM 1,138 512-bit 1.26x COS NEON 632 This work ARM/NEON 470 GMP-6.1.2 ARM 2,408 2.64x 768-bit This work ARM/NEON 912 Target processor: 32-bit ARM Cortex-A15 18

Proposed Modular Reduction (32-bit ARMv7) • m -bit modular reduction using Montgomery reduction • Two 𝒏/𝟑 -bit multiplication in ARM • Two 𝒏/𝟑 -bit multiplication in NEON 19

Operand ARM passing NEON 20

T[14] T[7] T[0] Operand Q[7]M[0] ARM passing NEON 3 T[6] 2 1 T[0] 1 4 T[10] T[4] T[10] Operand Q[7]M[7] Q[0]M[0] 4 3 passing 2 T[4] T[14] T[8] 21 Q[0]M[7]

T[14] T[7] T[0] Operand Q[7]M[0] ARM passing NEON 3 T[6] 2 1 T[0] 1 4 T[10] T[4] T[10] Q[7]M[7] Operand Q[0]M[0] 4 3 passing 2 T[4] T[14] T[8] Result Result 22 5 passing Accumulation Q[0]M[7]

Modular Reduction for SIDH • Efficient Montgomery reduction: Montgomery-friendly modulus • The lower word of the modulus is 𝟑 𝒙 − 𝟐  Montgomery constant is equal to 1. • Multiplications with an all-ones word ( 𝑈 × 0𝑦𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺  𝑈 × 2 32 − 𝑈 ): shifts and subtractions • (e.g., 𝑞503 = 2 250 3 159 − 1 ) 0x4066F541811E1E6045C6BDDA77A4D01B9BF6C87B7E7DAF13085BDA2211E7A0AB FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF (in hexadecimal) • A modulus M+1 turns the lower part of the modulus into all-zero words • (e.g., 𝑞503 + 1 = 2 250 3 159 ) 0x4066F541811E1E6045C6BDDA77A4D01B9BF6C87B7E7DAF13085BDA2211E7A0AC 00000000000000000000000000000000000000000000000000000000000000 (in hexadecimal) 23

Proposed Modular Reduction for SIDH (32-bit ARMv7) • m -bit modular reduction using Montgomery reduction • One 𝒏/𝟑 -bit multiplication in ARM • One 𝒏/𝟑 -bit multiplication in NEON 24

Operand ARM passing NEON 25

T[10] T[3] T[14] Operand ARM passing Q[7]M[3] NEON T[10] 2 1 2 T[3] Q[7]M[7] T[14] T[7] 1 Q[0]M[3] 26 Q[0]M[7]

T[10] T[3] T[14] Operand ARM passing Q[7]M[3] NEON T[10] 2 1 2 T[3] Q[7]M[7] T[14] T[7] 1 Result Q[0]M[3] 3 Accumulation 27 Q[0]M[7]

Multiplication Instruction (64-bit ARMv8) X0 X0 a0 a0 × × X1 X1 b0 b0 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 X3 X2 MUL UMULH 29

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum - PowerPoint PPT Presentation

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange. Hwajeong Seo (Hansung University), Zhe Liu (Nanjing University of Aeronautics and Astronautics), Patrick Longa (Microsoft Research), Zhi

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Efficient compression of SIDH public keys Craig Costello 1 David Jao 2 Patrick Longa 1 Michael

Magical parallel variant of SIDH Daniel Cervantes-V azquez Eduardo

Illustration: =0.4%, =1.2% n =35 per-arm per-stage Do all experimental treatments share a

BRI and In Indo-Pacific Dr. Arm Tungnirun Faculty of Law, Chulalongkorn University Dr. Arm

ARM A commodity risk management system. 1. . ARM: : A commodity ri risk management system.

Cooperative Task Management without Manual Stack Management or, Event-driven Programming is Not

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Wildlife Corridors Background AB 498 (2015) by Assemblymember Levine AB 2087 (2016) by

CENG 342 Digital Systems Algorithmic State Machine with Datapath (ASMD) Larry Pyeatt

Fractions of Numbers Follow the slides 1 2 of 8 = __ Follow the step by step guide which will

Aspect Based Sentiment Analysis Jared Kramer and Clara Gordon Overview Background Our

Origin and future draft-kristensen-avt-rtp-h264-extension-00 split in two based on last IETF

A L T EX template for presentation A Your name here Department of XXXX XXXX University Joint