Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan - PowerPoint PPT Presentation

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . 01 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g ∈ G be a generator. Given α ∈ G , the discrete logarithm (DL) problem is to find x ∈ Z q such that g x = α . Why do we care? ◮ Computing DLs is apparently difficult for classical computers ◮ Inverse problem (modular exponentiation) is easy ◮ Many cryptographic protocols exploit this asymmetry 01 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . 02 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition An integer n is called B -smooth if each of its prime factors is bounded above by B . A smooth-order group is just a group whose order is B -smooth for some “suitably small” value of B . Why do we care? ◮ If ϕ ( N ) is B -smooth, then Z ∗ N has smooth order ◮ Many DL-based cryptographic protocols work in Z ∗ N ◮ Pollard’s rho algorithm (plus Pohlig-Hellman) solves DLs in time proportional to smoothness of group order 02 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. 03 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition The Compute Unified Device Architecture (CUDA) is Nvidia’s parallel computing architecture. It enables developers to use CUDA-enabled Nvidia GPUs for general purpose computing. Why do we care? ◮ Nvidia GPUs are widely deployed, and offer better price-to-GFLOP ratio than CPUs ◮ Modern GPUs have many cores and support highly parallel computation ◮ Pollard’s rho algorithm is extremely parallelizable 03 / 21

In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant 04 / 21

In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs 04 / 21

In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance 04 / 21

In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs 04 / 21

In this presentation, we... ◮ describe Pollard’s rho algorithm and its parallel variant ◮ discuss CUDA and GP GPU computing on Nvidia GPUs ◮ present our implementation of modular multiplication and parallel rho in CUDA and analyze its performance ◮ point out a simple attack on Boudot’s zero-knowledge range proofs ◮ construct and analyze trapdoor discrete logarithm groups 04 / 21

Part I: Pollard’s rho

Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . 05 / 21

Pollard’s rho algorithm (1/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Key observation: ◮ Consider elements g a h b ∈ G and search for collisions ◮ Since g a 1 h b 1 = g a 2 h b 2 = ⇒ g a 1 − a 2 = h b 2 − b 1 , we have ⇒ x ≡ ( a 1 − a 2 )( b 2 − b 1 ) − 1 mod n a 1 − a 2 ≡ x ( b 2 − b 1 ) mod n = � ◮ Birthday paradox: about π n / 2 selections should ⇒ expected runtime and storage in Θ( √ n ) suffice = 05 / 21

Pollard’s rho algorithm (2/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . Pollard’s idea: ◮ Walk through G using iteration function f : G → G , f ( g a i h b i ) = g a i + 1 h b i + 1 ◮ Collisions = ⇒ cycles, which are cheap to detect ◮ If iteration function behaves “randomly enough”, then expected runtime is in Θ( √ n ) and storage is in Θ( 1 ) 06 / 21

Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 3 hbi + 3 gai hbi gaj hbj gaj − 1 hbj − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 1 hb 1 ga 0 hb 0 07 / 21

Pollard’s rho algorithm (3/4) gai + 1 hbi + 1 gai + 2 hbi + 2 gai + 1 hbi + 1 gai + 1 hbi + 1 gai + 3 hbi + 3 gai + 2 hbi + 2 gai + 2 hbi + 2 gai hbi gaj hbj gaj − 1 hbj − 1 gai + 3 hbi + 3 gai + 3 hbi + 3 gai hbi gaj hbj gai hbi gaj hbj gaj − 1 hbj − 1 gaj − 1 hbj − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 gai − 1 hbi − 1 ga 2 hb 2 ga 2 hb 2 ga 2 hb 2 ga 1 hb 1 ga 1 hb 1 ga 0 hb 0 ga 1 hb 1 ga 0 hb 0 ga 0 hb 0 07 / 21

Pollard’s rho algorithm (4/4) Problem Given g , h ∈ G , compute the discrete logarithm x ∈ Z n of h with respect to g . van Oorschot’s and Wiener’s idea: ◮ Define a distinguished point (DP) as any point with some cheap-to-detect property (e.g., m trailing zeros) ◮ Run Ψ client threads in parallel, each reporting DPs to a central server that checks for collisions � √ n / Ψ ◮ Expected runtime is in Θ � 08 / 21

Part II: GPUs and CUDA

SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21

SMPs and CUDA cores Fermi architecture Instruction cache ◮ GPU has several streaming Warp scheduler Warp scheduler multiprocessors (SMP) ◮ Our Tesla M2050 cards each Dispatch unit Dispatch unit have 14 SMPs Register file (2 15 × 32-bit) ◮ SIMD architecture LD/ST CUDA Core Core Core Core Core LD/ST SFU LD/ST Dispatch port Core Core Core Core LD/ST Operand collector LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST FPU unit INT unit LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Result queue LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect network 64 KB memory / L1 cache Uniform cache 09 / 21

CUDA memory hierarchy Thread ◮ Developer manages memory explicitly ◮ 1 clock pulse for shared Shared memory L1 cache memory and L1 cache ◮ ≈ 300 clock pulses for L2 cache Local RAM ◮ Many more clock pulses for system RAM Local RAM 10 / 21

Tesla M2050 Nvidia Tesla M2050 GPU cards: ◮ Based on Fermi architecture ◮ 14 SMPs × 32 cores / SMP = 448 cores (each running at 1.55 GHz) ◮ 2 15 × 32-bit registers / SMP ◮ Configurable: 64 KB shared memory / L1 cache price: 1,299.00 USD ◮ 3 GB GDDR5 of Local RAM Our experiments used a host PC with: ◮ Intel Xeon E5620 quad core (2.4 GHz) ◮ 2 × 4 GB of DDR3-1333 RAM ◮ 2 × Tesla M2050 GPU cards 11 / 21

Part III: Implementation

CUDA modular multiplication (1/2) ◮ Iteration function for Pollard rho:  g x if 0 ≤ x < q  3   x 2 if q 3 ≤ x < 2 q f ( x ) = 3  h x if 2 q  3 ≤ x < q  ◮ Need fast, multiprecision modular multiplication to solve DLs in Z ∗ N ◮ We used Koç et al’s CIOS algorithm for Montgomery multiplication ◮ Low auxiliary storage = ⇒ lots of threads ◮ We do one thread per multiplication 12 / 21

CUDA modular multiplication (2/2) Table: k -bit modular multiplications per second and (amortized) time per k -bit modular multiplication on a single Tesla M2050. Bit length Time per trial Amortized time Modmults of modulus ± std dev per modmult per second 192 30.538 s ± 4 ms 1.19 ns ≈ 840,336,000 256 50.916 s ± 5 ms 1.98 ns ≈ 505,050,000 512 186.969 s ± 4 ms 7.30 ns ≈ 136,986,000 768 492.6 s ± 200 ms 19.24 ns ≈ 51,975,000 1024 2304.5 s ± 300 ms 90.02 ns ≈ 11,108,000 = Larger k each multiplication takes longer ⇒ ◮ = can compute fewer multiplications in parallel ⇒ 13 / 21

CUDA Pollard rho (1/2) Goal Compute discrete logarithms modulo k N -bit RSA numbers N = pq with 2 k B -smooth totient. Our implementation: ◮ Optimized for k N = 1536 and k B ≈ 55 ◮ Assumes that the factorization of p − 1 and q − 1 is known ◮ Uses Pohlig-Hellman approach to decompose problem to k B -bit subproblems ◮ Distinguished points: at least 10 trailing zeros in binary (Montgomery) representation 14 / 21

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan - PowerPoint PPT Presentation

Solving Discrete Logarithms in Smooth-Order Groups with CUDA Ryan Henry Ian Goldberg Solving Discrete Logarithms in Smooth-Order Groups with CUDA Definition Let G be a cyclic group of order q and let g G be a generator. Given G , the

JUST THE MATHS SLIDES NUMBER 1.4 ALGEBRA 4 (Logarithms) by A.J.Hobson 1.4.1 Common

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

MATH 12002 - CALCULUS I 5.2: Laws of Logarithms Professor Donald L. White Department of

Rules for logarithms We review the properties of logarithms from the previous lecture. In that

Logarithms 2-1 Definition of Logarithms

Solving exponential and logarithmic equations We explore some results involving exponential

Complexity news: Also somewhat related: discrete logarithms in Im starting to analyze

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

Private key versus Public Key Reviews on PKC DH ElGamal Massey Omura Discrete Logarithms

Fixed points for discrete logarithms Carl Pomerance , Dartmouth College Suppose that G is a group

Linear forms in logarithms and integral points on varieties Aaron Levin Michigan State

R03 - Regression: using logarithms STAT 587 (Engineering) Iowa State University October 24, 2020

Superleading logarithms in QCD Soft gluons in QCD: an introduction. Gaps between jets I:

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

On the correct use of the negation map in the Pollard rho method D. J. Bernstein University of

Assumptions of (a) Set Theory Carl Pollard Ohio State University Linguistics 680 Formal

On the Use of the Negation Map in the Pollard Rho Method Joppe W. Bos Thorsten Kleinjung Arjen

Traces Exist (Hypothetically)! Carl Pollard Structure and Evidence in Linguistics Workshop in

ECC mod 8^91+5 especially elliptic curve 2y^2=x^3+x for cryptography Andrew Allen and Dan Brown,

Pollard Rho on the PlayStation 3 Joppe W. Bos 1 Marcelo E. Kaihara 1 Peter L. Montgomery 2 1 EPFL

Unboxing Cluster Heatmaps Sophie Engle Sean Whalen Alark Joshi Katherine Pollard

Discovery Logic / NIH Discovery Logic, Inc. Established in 2001 SBA 8(a) Certified in