Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 Peter L. Montgomery 2 1 EPFL IC LACAL, CH-1015 Lausanne, Switzerland 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA RAIM’09 October 27 th 2009 LIP ENS Lyon
Motivation Elliptic Curve Cryptography (ECC): Widely standardized Standard for Efficient Cryptography 2, SEC2 (112-521 bit) Wireless Transport Layer Security Specification (112-224 bit) Digital Signature Standard, FIPS 186-3, NIST (192-521 bit) Security relies on hardness of solving Elliptic Curve Discrete Logarithm Problem (ECDLP) Are the standardized key sizes secure? What is the practical cost of solving the ECDLP? 2
Objective Evaluate cost of solving 112-bit ECC standard ECDLP for small key sizes PlayStation 3: Use broadly available Low price platform Hybrid multi-core architecture Implement Pollard rho on the Cell architecture Design SIMD arithmetic algorithms Optimize modular arithmetic for 112-bit prime 3
ECDLP Settings: p F is an elliptic curve over with odd prime Ε p P ∈ E ( F ) is a point of order n p Q = k ⋅ P ∈ 〈 P 〉 Problem: E, p, P, n Q Given and what is ? k k = log Q P Largest solved instance 109-bit prime field (2002) It took “10 4 computers (mostly PCs) running 24 hours a day for 549 days”. 4
Solving the ECDLP Pollard rho: The most efficient algorithm in the literature (for generic curves). The underlying idea of this method is to search for two distinct pairs c j d ( , ) ( , ) ∈ Z/nZ × Z/nZ such that c i d , j i c c ⋅ P + d ⋅ Q = ⋅ P + ⋅ Q d i i j j c d ( c − ) ⋅ P = ( − ) ⋅ Q = ( − ) k ⋅ P d d d j j i j i i -1 k ≡ ( c − c ) ⋅ ( d − d ) mod n i j j i J.M. Pollard. Monte Carlo methods for index computation (mod p) . Mathematics of Computation, 32:918-924, 1978. 5
Pollard Rho 〈 P 〉 “Walk” through the set X = c ⋅ P + d ⋅ Q i i i f : 〈 P 〉 → 〈 P 〉 Iteration function X = f ( X ) , i ≥ 0 i + 1 i This sequence eventually collides Expected number of iterations π ⋅ 〈 P 〉 2 6
Optimization I X + X + i 2 Parallel version: distinguish points i 1 and send them to a central server X P.C. van Oorschot and M. J. Wiener, i [1999]. ′ X + X ′ i 2 X + i- 1 i 1 Mark points with a certain property X ′ e.g., X i =(x i ,y i ), DPT: 2 24 | x i i Communicate them to a central DB X ′ ′ ′ X + to check collisions i- 1 ′ ′ X + i 2 i 1 Leads to a linear speed-up on the X ′ ′ i number of processors. X ′ ′ i- 1 DB 7
Optimization II X + i 1 r-adding walks, E. Teske, [2001]. X + R i 0 〈 P 〉 r Divide into different partitions X + R i 1 h : 〈 P 〉 → [0, r - 1] X + R R j = c ⋅ P + d ⋅ Q For each partition: i 15 X = ( x , y ) j j i i i X = f ( X ) = X + R X i + 1 i i h ( X ) i- 1 i Use the least significant ≈ partitions random mapping r ≥ 16 4-bit to determine the next partition 8
Optimization III Simultaneous Inversion, trade X + X + inversions for multiplications i 2 i 1 P.L. Montgomery, [1987]. X i Suitable for cryptanalytic purposes ′ X + X ′ X + i 2 i- 1 i 1 Trade M modular inversions for 3(M-1) X ′ modular multiplications and 1 modular i inversion X ′ ′ ′ X + i- 1 ′ ′ i 2 X + i 1 X ′ ′ i Affine Weierstrass representation X ′ ′ i- 1 Apply to independent walks 9
Optimization IV Negation Map (not used) ( x + , y ) i 1 i + 1 M.J. Wiener and R. J. Zuccherato, [1998]. R ( x i , y ) i i ( x + − , y ) Computation of the negative is cheap ( x − , y ) R i 1 i + 1 i 1 i − 1 i − 1 - P = ( x , − y ) ( x i − , y ) i P Given an equivalence relation ~ on ( x − − , y ) i 1 i − 1 Iterate over the set of equivalence classes P / ~ Reduce search space by a factor of 2 10
The PlayStation 3 The Cell contains 1 “ Power Processor Element ” (PPE) 8 “Synergistic Processing Elements” (SPEs) (6 available to the user in the PS3 under Linux) Characteristics of the SPEs: Synergistic Processing Unit (SPU) Access to 128 registers of 128-bit SIMD operations Dual pipeline (odd and even) In-order processor 256 KB of fast local memory (Local Store) 11
Programming Constraints Memory The executable and all data should fit in the LS (256KB). Branches No “smart” dynamic branch prediction. Instead “prepare-to-branch” instructions to redirect instruction prefetch to branch targets. Instruction set limitations 16 x 16 → 32 bit multipliers (4-SIMD) Dual pipeline One odd and one even instruction can be dispatched per clock cycle. 12
Arithmetic Using affine Weierstrass representation P , Q ∈ E ( F ) { Ο } P = (x , y ) and Q = (x , y ) p 1 1 2 2 If P ≠ Q then P + Q = (x , y ) 3 3 y − y if P ≠ Q 2 1 2 x − x x = µ - x - x 2 1 3 1 2 μ = 2 3 x + a y = µ (x - x ) - y if P = Q 1 3 1 3 1 2 y 1 6 modular multiplications Using Montgomery’s simultaneous inversion 6 modular subtractions and running 1 modular inversions M curves in parallel. M 13
Integer Representation 2 16 Integers A, B, C, D represented in radix m - 1 m - 1 m - 1 m - 1 ∑ ∑ ∑ ∑ A = a i 2 ⋅ 16 ⋅ i B = b i 2 ⋅ 16 ⋅ i C = c i 2 ⋅ 16 ⋅ i D = d i 2 ⋅ 16 ⋅ i i = 0 i = 0 i = 0 i = 0 a b c d 0 0 0 0 16 − bit 16 − bit V[0] = high low a b c d i i i i V[i] = a b c d m − i m − i m − i m − i V[m - 1] = 4 - SIMD
Modular Reduction E ( F ) The prime 112-bit p in the target curve is p p = DB 7 C 2 ABF 62 E 35 E 668076 BEAD 208 B 16 15
Modular Reduction E ( F ) The prime 112-bit p in the target curve is p p = DB 7 C 2 ABF 62 E 35 E 668076 BEAD 208 B 16 2 128 − 3 p = 11 ⋅ 6949 16
Modular Reduction E ( F ) The prime 112-bit p in the target curve is p p = DB 7 C 2 ABF 62 E 35 E 668076 BEAD 208 B 16 2 128 − 3 p = 11 ⋅ 6949 Perform calculation using a redundant representation ~ 128 − p = 11 ⋅ 6949 ⋅ p = 2 3 17
Fast reduction ~ 2 128 p = − 3 = 11 ⋅ 6949 ⋅ p Use modulus x 2 128 x h l x × 3 3 ⋅ x x ′ x ′ h l + x ′ h × 3 ′ 3 ⋅ x h x ′ ′ + v l v v h ∈ { 0 , 1 } Overwhelming prob. v h = 0 256 256 R : Z/ 2 Z → Z/ 2 Z x 128 x → ( x mod 2 ) + 3 ⋅ 2 128 ~ x = x ⋅ 2 128 + x ≡ x + 3 ⋅ x = R ( x ) mod p H L L H 18
Fast Modular Multiplication Proposition For independent random 128-bit non-negative integers x and y there is overwhelming probability that ~ 0 ≤ R(R(x ⋅ y)) < p Counter-examples easy to construct: 128 + 0 ≤ R(R(x)) < 2 6 During the whole run not a single faulty reduction 19
Distinguish Point Property Need to uniquely determine the partition number and DTP property during the r-adding walk. P = ( x , y ) ~ x : 0 ≤ x < p Partial Montgomery Reduction in order to reduce modulo p. ′ 2 -16 x = x ⋅ mod p Check least significant 24 bits of x in partial Montgomery representation. 20
Modular Inversion -1 z ≡ x mod p Based on Extended Binary GCD algorithm: p 0 32 r = 2 A B A B 1 1 2 2 A × 1 ≡ B × x mod p 1 1 A × 1 ≡ B × x mod p 2 2 1 x p gcd ( A 1 , A ) Compute x 0 1 2 1 ⋅ Obtain from almost Montgomery inverse: − k B z = x 2 mod p 2 SIMD-operations: ← [A >> t , B << t , A >> t , B << t ] [A , B , A , B ] 1 1 1 2 2 2 2 1 1 1 2 2 [A , B , A , B ] ← [A − A , B − B , A , B ] 1 2 1 2 2 2 1 1 2 2 [A , B , A , B ] ← [A , B , A − A , B − B ] 1 1 2 2 1 1 2 1 2 1 Branches significantly reduced 21
Modular Inversion 22
Performance Results #cycles required by #operation per #cycles per Operation each operation iteration iteration Mod Mul 53 6 318 Mod Sub 5 6 30 Partial Mon 24 1 24 Red Mod Inv 4941 1/400 12 Misc. 69 1 69 Total 453 [ 1 SPU, 4-SIMD @3.2 GHZ ] Hence, our cluster of 214 PS3s computes: 9 33 9.1 ⋅ 10 ≈ 2 iterations per sec > 0.5M It works on curves in parallel 23
Recommend
More recommend