Optimizing multiplications with vector instructions Chitchanok - PowerPoint PPT Presentation

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de Lyon 4 June 2018 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1

Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Previous position: PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Previous position: PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange Experience Software implementations Optimizing cryptographic software and algorithms Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2

Vectorization speedups without vector a + b = a + b Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

Vectorization speedups without vector with vector a a 0 a 1 a 2 a 3 + + + + + b b 0 b 1 b 2 b 3 = = = = = a + b a 0 + b 0 a 1 + b 1 a 2 + b 2 a 3 + b 3 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

Vectorization speedups without vector with vector a a 0 a 1 a 2 a 3 + + + + + b b 0 b 1 b 2 b 3 = = = = = a + b a 0 + b 0 a 1 + b 1 a 2 + b 2 a 3 + b 3 single instruction performing n independent operations on aligned inputs Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3

Side-channel attacks Prevent software side-channel attacks: constant-time no input-dependent branch no input-dependent array index Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4

Side-channel attacks Prevent software side-channel attacks: constant-time no input-dependent branch no input-dependent array index Constant-time table-lookup: read entire table select via arithmetic if c is 1, select tbl[i] if c is 0, ignore tbl[i] t = ( t · (1 − c )) + ( tbl [ i ] · ( c ) ) t = ( t ∧ ( c − 1)) ∨ ( tbl [ i ] ∧ ( − c )) Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4

Curve41417 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 5

Design of Curve41417 High-security elliptic curve (security level above 2 200 ) Defined over prime field F p where p = 2 414 − 17 In Edwards curve form x 2 + y 2 = 1 + 3617 x 2 y 2 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6

Design of Curve41417 High-security elliptic curve (security level above 2 200 ) Defined over prime field F p where p = 2 414 − 17 In Edwards curve form x 2 + y 2 = 1 + 3617 x 2 y 2 Large prime-order subgroup (cofactor 8) IEEE P1363 criteria (large embedding degree, etc.) Twist secure, i.e., twist of Curve41417 also secure Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6

ECC arithmetic Mixed-coordinate systems: doubling: projective X , Y , Z addition: extended X , Y , Z , T ( See https://hyperelliptic.org/EFD/ ) Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7

ECC arithmetic Mixed-coordinate systems: doubling: projective X , Y , Z addition: extended X , Y , Z , T ( See https://hyperelliptic.org/EFD/ ) Scalar multiplication: signed fixed windows of width w = 5 precompute 0 P , 1 P , 2 P , . . . , 16 P also multiply d = 3617 to T coordinate special first doubling compute T only before addition Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Point operations Point addition Point doubling x 2 y 2 z 2 d · t 2 x 1 y 1 z 1 � � � � � � + x 1 y 1 z 1 t 1 × + × × × + × × × � � � � × + × − − + + − − − − − × × × × × × × y 3 z 3 x 3 t 3 x 3 y 3 z 3 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 8

ARM Cortex-A8 vector unit 128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9

ARM Cortex-A8 vector unit 128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers Each cycle produces: four 32-bit integer additions: a 0 + b 0 , a 1 + b 1 , a 2 + b 2 , a 3 + b 3 or two 64-bit integer additions: c 0 + d 0 , c 1 + d 1 or one multiply-add instruction: a 0 b 0 + c 0 where a i , b i are 32- and c i , d i are 64-bit integers Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9

Redundant representation Use non-integer radix 2 414 / 16 = 2 25 . 875 Decompose integer f modulo 2 414 − 17 into 16 integer pieces Write f as 2 26 f 1 + 2 52 f 2 + 2 78 f 3 + f 0 + 2 104 f 4 + 2 130 f 5 + 2 156 f 6 + 2 182 f 7 + 2 207 f 8 + 2 233 f 9 + 2 259 f 10 + 2 285 f 11 + 2 311 f 12 + 2 337 f 13 + 2 363 f 14 + 2 389 f 15 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 10

Carries Goal: Bring each limb down to 26 or 25 bits Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 m 8 → m 9 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 m 8 → m 9 → m 10 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 m 8 → m 9 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 m 8 → m 9 m 4 → m 5 m 12 → m 13 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 → m 2 m 8 → m 9 → m 10 m 4 → m 5 m 12 → m 13 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11

Optimizing multiplications with vector instructions Chitchanok - PowerPoint PPT Presentation

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de Lyon 4 June 2018 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1 Introduction Current position: Postdoc (INRIA

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Outline 2.1 Assembly language program structure 2.2 Data transfer instructions 2.3 Arithmetic

Distinguishing Multiplications from Squaring Operations Frederic Amiel Benoit Feix Michael

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010

Side-Channel Analysis on Blinded Regular Scalar Multiplications Benoit Feix Mylne Roussellet

Double-Base Chains for Scalar Multiplications on Elliptic Curves Wei Yu , Saud Al Musa, and Bao Li

Minimum Number of Multiplications of U Hash Functions Mridul Nandi Indian Statistical

Genus 3 curves with nontrivial multiplications: Questions Jerome William Hoffman Louisiana State

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key

Efficient Modular Exponentiation Based on Multiple Multiplications by a Common Operand Christophe

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

Lecture 2 Combinational Logic Circuits Reference: Roth/John Text: Chapter 2 1 Combinational

Karnaugh-Maps September 14, 2006 Typeset by Foil T EX What are Karnaugh Maps? A simpler

CSEE 3827: Fundamentals of Computer Systems Lecture 4 & 5 February 2 & 4, 2009 Martha

Spiral 1 / Unit 3 Minterm and Maxterms Canonical Sums and Products 2- and 3-Variable Boolean

MCMC based machine learning a . (Bayesian Model Averaging) Nicos Angelopoulos

Minimizing Markov chains Beyond Bisimilarity* Giovanni Bacci, Giorgio Bacci, Kim G. Larsen , Radu

Lecture 7 Logistics HW2 due Wednesday --- Friday? Lab3 this week Lab3 this week

Software Security Lucas Cordeiro Department of Computer Science lucas.cordeiro@manchester.ac.uk

Optimizing multiplications with vector instructions Chitchanok - PowerPoint PPT Presentation

Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de Lyon 4 June 2018 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1 Introduction Current position: Postdoc (INRIA

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Outline 2.1 Assembly language program structure 2.2 Data transfer instructions 2.3 Arithmetic

Distinguishing Multiplications from Squaring Operations Frederic Amiel Benoit Feix Michael

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010

Side-Channel Analysis on Blinded Regular Scalar Multiplications Benoit Feix Mylne Roussellet

Double-Base Chains for Scalar Multiplications on Elliptic Curves Wei Yu , Saud Al Musa, and Bao Li

Minimum Number of Multiplications of U Hash Functions Mridul Nandi Indian Statistical

Genus 3 curves with nontrivial multiplications: Questions Jerome William Hoffman Louisiana State

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key

Efficient Modular Exponentiation Based on Multiple Multiplications by a Common Operand Christophe

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

Lecture 2 Combinational Logic Circuits Reference: Roth/John Text: Chapter 2 1 Combinational

Karnaugh-Maps September 14, 2006 Typeset by Foil T EX What are Karnaugh Maps? A simpler

CSEE 3827: Fundamentals of Computer Systems Lecture 4 &amp; 5 February 2 &amp; 4, 2009 Martha

Spiral 1 / Unit 3 Minterm and Maxterms Canonical Sums and Products 2- and 3-Variable Boolean

MCMC based machine learning a . (Bayesian Model Averaging) Nicos Angelopoulos

Minimizing Markov chains Beyond Bisimilarity* Giovanni Bacci, Giorgio Bacci, Kim G. Larsen , Radu

Lecture 7 Logistics HW2 due Wednesday --- Friday? Lab3 this week Lab3 this week

Software Security Lucas Cordeiro Department of Computer Science lucas.cordeiro@manchester.ac.uk

CSEE 3827: Fundamentals of Computer Systems Lecture 4 & 5 February 2 & 4, 2009 Martha