Optimizing multiplications with vector instructions Chitchanok Chuengsatiansup INRIA and ENS de Lyon 4 June 2018 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 1
Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2
Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Previous position: PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2
Introduction Current position: Postdoc (INRIA and ENS de Lyon) Supervisor: Damien Stehl´ e Previous position: PhD student at TU/Eindhoven, The Netherlands Cryptographic Implementations group Thesis: “Optimizing Curve-Based Cryptography” Supervisors: Daniel J. Bernstein and Tanja Lange Experience Software implementations Optimizing cryptographic software and algorithms Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 2
Vectorization speedups without vector a + b = a + b Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3
Vectorization speedups without vector with vector a a 0 a 1 a 2 a 3 + + + + + b b 0 b 1 b 2 b 3 = = = = = a + b a 0 + b 0 a 1 + b 1 a 2 + b 2 a 3 + b 3 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3
Vectorization speedups without vector with vector a a 0 a 1 a 2 a 3 + + + + + b b 0 b 1 b 2 b 3 = = = = = a + b a 0 + b 0 a 1 + b 1 a 2 + b 2 a 3 + b 3 single instruction performing n independent operations on aligned inputs Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 3
Side-channel attacks Prevent software side-channel attacks: constant-time no input-dependent branch no input-dependent array index Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4
Side-channel attacks Prevent software side-channel attacks: constant-time no input-dependent branch no input-dependent array index Constant-time table-lookup: read entire table select via arithmetic if c is 1, select tbl[i] if c is 0, ignore tbl[i] t = ( t · (1 − c )) + ( tbl [ i ] · ( c ) ) t = ( t ∧ ( c − 1)) ∨ ( tbl [ i ] ∧ ( − c )) Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 4
Curve41417 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 5
Design of Curve41417 High-security elliptic curve (security level above 2 200 ) Defined over prime field F p where p = 2 414 − 17 In Edwards curve form x 2 + y 2 = 1 + 3617 x 2 y 2 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6
Design of Curve41417 High-security elliptic curve (security level above 2 200 ) Defined over prime field F p where p = 2 414 − 17 In Edwards curve form x 2 + y 2 = 1 + 3617 x 2 y 2 Large prime-order subgroup (cofactor 8) IEEE P1363 criteria (large embedding degree, etc.) Twist secure, i.e., twist of Curve41417 also secure Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 6
ECC arithmetic Mixed-coordinate systems: doubling: projective X , Y , Z addition: extended X , Y , Z , T ( See https://hyperelliptic.org/EFD/ ) Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7
ECC arithmetic Mixed-coordinate systems: doubling: projective X , Y , Z addition: extended X , Y , Z , T ( See https://hyperelliptic.org/EFD/ ) Scalar multiplication: signed fixed windows of width w = 5 precompute 0 P , 1 P , 2 P , . . . , 16 P also multiply d = 3617 to T coordinate special first doubling compute T only before addition Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 7
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Point operations Point addition Point doubling x 2 y 2 z 2 d · t 2 x 1 y 1 z 1 � � � � � � + x 1 y 1 z 1 t 1 × + × × × + × × × � � � � × + × − − + + − − − − − × × × × × × × y 3 z 3 x 3 t 3 x 3 y 3 z 3 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 8
ARM Cortex-A8 vector unit 128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9
ARM Cortex-A8 vector unit 128-bit vector registers Arithmetic and load/store unit can perform in parallel Operate in parallel on vectors of four 32-bit integers or two 64-bit integers Each cycle produces: four 32-bit integer additions: a 0 + b 0 , a 1 + b 1 , a 2 + b 2 , a 3 + b 3 or two 64-bit integer additions: c 0 + d 0 , c 1 + d 1 or one multiply-add instruction: a 0 b 0 + c 0 where a i , b i are 32- and c i , d i are 64-bit integers Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 9
Redundant representation Use non-integer radix 2 414 / 16 = 2 25 . 875 Decompose integer f modulo 2 414 − 17 into 16 integer pieces Write f as 2 26 f 1 + 2 52 f 2 + 2 78 f 3 + f 0 + 2 104 f 4 + 2 130 f 5 + 2 156 f 6 + 2 182 f 7 + 2 207 f 8 + 2 233 f 9 + 2 259 f 10 + 2 285 f 11 + 2 311 f 12 + 2 337 f 13 + 2 363 f 14 + 2 389 f 15 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 10
Carries Goal: Bring each limb down to 26 or 25 bits Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 m 8 → m 9 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 m 8 → m 9 → m 10 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 m 8 → m 9 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 m 8 → m 9 m 4 → m 5 m 12 → m 13 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Carries Goal: Bring each limb down to 26 or 25 bits Typical carry chain: m 0 → m 1 → m 2 → · · · → m 14 → m 15 → m 0 → m 1 Increase throughput: m 0 → m 1 → m 2 → m 3 → m 4 → m 5 → m 6 → m 7 → m 8 → m 9 m 8 → m 9 → m 10 → m 11 → m 12 → m 13 → m 14 → m 15 → m 0 → m 1 Decrease latency: m 0 → m 1 → m 2 m 8 → m 9 → m 10 m 4 → m 5 m 12 → m 13 Chitchanok Chuengsatiansup Optimizing multiplications with vector instructions 11
Recommend
More recommend