Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . 01 / 21
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . Why do we care? ◮ Computing DLs is apparently difficult for classical computers ◮ Inverse problem (modular exponentiation) is easy ◮ Many cryptographic protocols exploit this asymmetry 01 / 21
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . 02 / 21
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . Why do we care? ◮ If ϕ ( N ) is B -smooth, then Z ∗ N has smooth order ◮ Many DL-based cryptographic protocols work in Z ∗ N ◮ Pollard’s rho algorithm (plus Pohlig-Hellman) solves DLs in time proportional to smoothness of group order 02 / 21
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. 03 / 21
Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. Why do we care? ◮ Nvidia GPUs are widely deployed, and offer better price-to-GFLOP ratio than CPUs ◮ Modern GPUs have many cores and support highly parallel computation ◮ Pollard’s rho algorithm is extremely parallelizable 03 / 21
In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant 04 / 21
In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs 04 / 21
In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance 04 / 21
In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs 04 / 21
In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs ◮ construct and analyze trapdoor discrete logarithm groups 04 / 21
Part I: Pollard’s rho
Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . 05 / 21
Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Key observation: ◮ Consider elements g a h b ∈ G and search for collisions ◮ Since g a 1 h b 1 = g a 2 h b 2 = ⇒ g a 1 − a 2 = h b 2 − b 1 , we have ⇒ x ≡ ( a 1 − a 2 )( b 2 − b 1 ) − 1 mod n a 1 − a 2 ≡ x ( b 2 − b 1 ) mod n = � ◮ Birthday paradox: about π n / 2 selections should ⇒ expected runtime and storage in Θ( √ n ) suffice = 05 / 21
Pollard’s rho algorithm (2/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Pollard’s idea: ◮ Walk through G using iteration function f : G → G , f ( g a i h b i ) = g a i + 1 h b i + 1 ◮ Collisions = ⇒ cycles, which are cheap to detect ◮ If iteration function behaves “randomly enough”, then expected runtime is in Θ( √ n ) and storage is in Θ( 1 ) 06 / 21
Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 3 hbi + 3 gai hbi gaj hbj gaj − 1 hbj − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 1 hb 1 ga 0 hb 0 07 / 21
Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 1 hbi + 1 gai + 1 hbi + 1 gai + 3 hbi + 3 gai + 2 hbi + 2 gai + 2 hbi + 2 gai hbi gaj hbj gaj − 1 hbj − 1 gai + 3 hbi + 3 gai + 3 hbi + 3 gai hbi gaj hbj gai hbi gaj hbj gaj − 1 hbj − 1 gaj − 1 hbj − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 2 hb 2 ga 2 hb 2 ga 1 hb 1 ga 1 hb 1 ga 0 hb 0 ga 1 hb 1 ga 0 hb 0 ga 0 hb 0 07 / 21
Pollard’s rho algorithm (4/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . van Oorschot’s and Wiener’s idea: ◮ Define a distinguished point (DP) as any point with some cheap-to-detect property (e.g., m trailing zeros) ◮ Run Ψ client threads in parallel, each reporting DPs to a central server that checks for collisions � √ n / Ψ ◮ Expected runtime is in Θ � 08 / 21
Part II: GPUs and CUDA
SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21
SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST CUDA Core Core Core Core Core LD/ST SFU LD/ST Dispatch port Core Core Core Core LD/ST Operand collector LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST FPU unit INT unit LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Result queue LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21
CUDA memory hierarchy Thread ◮ Developer manages memory explicitly ◮ 1 clock pulse for shared Shared memory L1 cache memory and L1 cache ◮ ≈ 300 clock pulses for L2 cache Local RAM ◮ Many more clock pulses for system RAM Local RAM 10 / 21
Tesla M2050 Nvidia Tesla M2050 GPU cards: ◮ Based on Fermi architecture ◮ 14 SMPs × 32 cores / SMP = 448 cores (each running at 1.55 GHz) ◮ 2 15 × 32-bit registers / SMP ◮ Configurable: 64 KB shared memory / L1 cache price: 1,299.00 USD ◮ 3 GB GDDR5 of Local RAM Our experiments used a host PC with: ◮ Intel Xeon E5620 quad core (2.4 GHz) ◮ 2 × 4 GB of DDR3-1333 RAM ◮ 2 × Tesla M2050 GPU cards 11 / 21
Part III: Implementation
CUDA modular multiplication (1/2) ◮ Iteration function for Pollard rho: g x if 0 ≤ x < q 3 x 2 if q 3 ≤ x < 2 q f ( x ) = 3 h x if 2 q 3 ≤ x < q ◮ Need fast, multiprecision modular multiplication to solve DLs in Z ∗ N ◮ We used Koç et al’s CIOS algorithm for Montgomery multiplication ◮ Low auxiliary storage = ⇒ lots of threads ◮ We do one thread per multiplication 12 / 21
CUDA modular multiplication (2/2) Table: k -bit modular multiplications per second and (amortized) time per k -bit modular multiplication on a single Tesla M2050. Bit length Time per trial Amortized time Modmults of modulus ± std dev per modmult per second 192 30.538 s ± 4 ms 1.19 ns ≈ 840,336,000 256 50.916 s ± 5 ms 1.98 ns ≈ 505,050,000 512 186.969 s ± 4 ms 7.30 ns ≈ 136,986,000 768 492.6 s ± 200 ms 19.24 ns ≈ 51,975,000 1024 2304.5 s ± 300 ms 90.02 ns ≈ 11,108,000 = Larger k each multiplication takes longer ⇒ ◮ = can compute fewer multiplications in parallel ⇒ 13 / 21
CUDA Pollard rho (1/2) Goal Compute discrete logarithms modulo k N -bit RSA numbers N = pq with 2 k B -smooth totient. Our implementation: ◮ Optimized for k N = 1536 and k B ≈ 55 ◮ Assumes that the factorization of p − 1 and q − 1 is known ◮ Uses Pohlig-Hellman approach to decompose problem to k B -bit subproblems ◮ Distinguished points: at least 10 trailing zeros in binary (Montgomery) representation 14 / 21
Recommend
More recommend