Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL – LACAL Nov. 4, 2010
Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP 800-131) RSA 1024 Deprecated from January 1, 2011 RSA 2048 8x Computational Effort 2 2
Object Use GPUs as cryptographic accelerators to offload work from the CPU. Low latency Parallel implementation Generic implementation OpenCL Server application Speed 3 3
RSA 2048 Decryption Decryption d m p q e d 1 mod (m) z c mod m dP e 1 mod (p - 1) Precomputed values dQ e 1 mod (q - 1) ( p , q, dP, dQ, qInv) 1 dInv q mod p Chinese Remainder Theorem dP z 1 c mod p Mod Exp 1024 moduli dQ z 2 c mod q (32 limbs of 32-bits) s 32 32 B 2 h qInv ( z 1 z ) mod p 2 z z h q 2 4 4
Montgomery Multiplication General overview Ordinary Representation Montgomery Representation ~ u u ~ v v ~ ~ u v u v ~ z z Sequential multiplications performed in Montgomery representation 5
Montgomery Multiplication R B s m , gcd(R, m) 1 Montgomery radix Ordinary Representation Montgomery Representation ( , ) ( , * ) ~ u u u R mod m Isomorphic ~ ~ ~ ~ u v ( u v ) mod m 1 u * v u v R mod m ~ v v v R mod m
Montgomery Multiplication Definition: m : large odd integer ~ ~ , gcd(m, B) 1 u , v Z / m Z ~ ~ ~ ~ u * v u v R 1 mod m s ( usually R B ) R m
Sequential Computation on CPU ~ ~ ~ ~ 1 u * v u v R mod m m Algorithm ~ u ~ v ~ * z 0; ~ for (i 0; i s - 1; i ) z { ~ ~ ~ ~ z z u v ; i ~ 1 q ( z m ) mod B; M 0 0 ~ ~ ~ ~ 1 ( z ( z m mod B) m) mod B z ( z q M m ) div B; 0 0 } ~ ~ 1 ( z ( z m m mod B)) mod B ~ ~ ~ 0 0 if z m then z z - m; ~ ~ ( z z ) mod B 0 0
Sequential Computation on CPU ~ ~ ~ ~ 1 u * v u v R mod m m Algorithm ~ u ~ v ~ * z 0; ~ for (i 0; i s - 1; i ) z { ~ ~ ~ ~ z z u v ; i ~ 1 i 0 q ( z m ) mod B; M 0 0 ~ z m ( B 1 ) ~ ~ z ( z q M m ) div B; } m ( B 1 ) m ( B 1 ) ~ z ~ ~ ~ if z m then z z - m; B 2 m ( B 1 ) 2 m B
Sequential Computation on CPU ~ ~ ~ ~ 1 u * v u v R mod m m Algorithm ~ u ~ v ~ * z 0; ~ for (i 0; i s - 1; i ) z { ~ ~ ~ ~ z z u v ; i ~ 1 q ( z m ) mod B; M 0 0 ~ ~ z ( z q M m ) div B; } ~ ~ ~ if z m then z z - m;
Sequential Computation on CPU ~ ~ ~ ~ 1 u * v u v R mod m m Algorithm ~ u ~ v ~ * z 0; ~ for (i 0; i s - 1; i ) z { ~ ~ ~ ~ i 1 z z u v ; i ~ 0 z 2 m ~ 1 q ( z m ) mod B; M 0 0 ~ ~ ~ z 2 m m ( B 1 ) z ( z q M m ) div B; } 2 m m ( B 1 ) m ( B 1 ) ~ z ~ ~ ~ if z m then z z - m; B m ( 2 B 1 B 1 ) 2 m B
Fermi architecture Specifications: 3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 12 12
Fermi architecture Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 1K 32-bit registers per core 13 13
Representation of Integers Parallel version Sequential version x x x 31 31 31 x x x x 0 31 2 1 x x x x 0 x x 31 x 2 1 2 2 2 x x x x x x x 31 0 2 1 1 1 1 x x x 0 0 0 • Low Latency • High Latency • Cryptography • Cryptanalysis 14 14
Representation of Integers To avoid barriers (mem fence) try to fit entire operand within a block of 32 threads (Warps) Data coherence is maintained within a warp. Each thread operates in one limb in radix B=2 32 Possible representations: Avizienis representation (signed-digit) Residue Number System Carry-save c c c c 31 2 1 0 x x x x 31 0 2 1 15 15
Montgomery Multiplication ~ ~ ~ ~ 1 u * v u v R mod m M m m m m 31 2 1 0 Algorithm a a A a a 31 0 2 1 ~ ~ A : u ; B : v ; M : m; b b B b b 31 0 2 1 T : 0; for (i 0; i s - 1; i ) H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { L(a b ) L(a b ) L(a b ) L(a b ) T : T b A; 31 0 1 0 1 0 0 0 i -1 q : t ( m ) mod B; M 0 0 T : ( T q M M ) div B; } c c c c if T M then Z : T - M; 30 0 31 1 else Z : T; T t t t t t t t t 31 0 0 30 2 1 31 1 16
Montgomery Multiplication ~ ~ ~ ~ 1 u * v u v R mod m M m m m m 31 2 1 0 Algorithm a a A a a 31 0 2 1 ~ ~ A : u ; B : v ; M : m; b b B b b 31 0 2 1 T : 0; for (i 0; i s - 1; i ) H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { L(a b ) L(a b ) L(a b ) L(a b ) T : T b A; 31 0 1 0 1 0 0 0 i -1 q : t ( m ) mod B; M 0 0 T : ( T q M M ) div B; } c c c c if T M then Z : T - M; 30 0 31 1 else Z : T; T q t t t t t t t t 31 0 0 M 30 2 1 31 1 17
Montgomery Multiplication ~ ~ ~ ~ 1 u * v u v R mod m M m m m m 31 2 1 0 Algorithm a a A a a 31 0 2 1 ~ ~ A : u ; B : v ; M : m; b b B b b 31 0 2 1 T : 0; for (i 0; i s - 1; i ) H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T : T b A; i -1 q : t ( m ) mod B; H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T : ( T q M M ) div B; L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M } c c c c if T M then Z : T - M; 30 0 31 1 else Z : T; T q t t t t 31 0 M 2 1 18
Montgomery Multiplication ~ ~ ~ ~ 1 u * v u v R mod m M m m m m 31 2 1 0 Algorithm a a A a a 31 0 2 1 ~ ~ A : u ; B : v ; M : m; b b B b b 31 0 2 1 T : 0; for (i 0; i s - 1; i ) H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T : T b A; i -1 q : t ( m ) mod B; H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T : ( T q M M ) div B; L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M } c c c c if T M then Z : T - M; 30 0 31 1 else Z : T; T q t t t t 31 0 M 2 1 19
Montgomery Multiplication ~ ~ ~ ~ 1 u * v u v R mod m M m m m m 31 2 1 0 Algorithm a a A a a 31 0 2 1 ~ ~ A : u ; B : v ; M : m; b b B b b 31 0 2 1 T : 0; for (i 0; i s - 1; i ) H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T : T b A; i -1 q : t ( m ) mod B; H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T : ( T q M M ) div B; } c c c c if T M then Z : T - M; 30 0 31 1 else Z : T; T q t t t t 31 0 M 2 1 20
Recommend
More recommend