implementation of rsa 2048 on gpus
play

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL - PowerPoint PPT Presentation

Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL LACAL Nov. 4, 2010 Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP


  1. Implementation of RSA 2048 on GPUs Marcelo E. Kaihara EPFL – LACAL Nov. 4, 2010

  2. Motivation NIST Recommendations for Key Management (SP 800-57) NIST DRAFT recommendation for the Transitioning of Cryptographic Algorithms and Key Sizes (SP 800-131) RSA 1024 Deprecated from January 1, 2011 RSA 2048 8x Computational Effort 2 2

  3. Object Use GPUs as cryptographic accelerators to offload work from the CPU. Low latency Parallel implementation Generic implementation OpenCL Server application Speed 3 3

  4. RSA 2048 Decryption Decryption d m  p  q e  d  1 mod  (m) z  c mod m dP  e  1 mod (p - 1) Precomputed values dQ  e  1 mod (q - 1) ( p , q, dP, dQ, qInv)  1 dInv  q mod p Chinese Remainder Theorem dP z 1  c mod p Mod Exp 1024 moduli dQ z 2  c mod q (32 limbs of 32-bits) s  32 32 B  2 h  qInv  ( z 1  z ) mod p 2 z  z  h  q 2 4 4

  5. Montgomery Multiplication General overview Ordinary Representation Montgomery Representation ~ u u ~ v v ~  ~ u  v u v     ~ z z Sequential multiplications performed in Montgomery representation 5

  6. Montgomery Multiplication R  B s  m , gcd(R, m)  1 Montgomery radix Ordinary Representation Montgomery Representation  (   ,  ) ( , * )  ~ u u  u  R mod m Isomorphic ~ ~ ~ ~ u  v  ( u  v ) mod m  1 u * v  u  v  R mod m ~ v v  v  R mod m

  7. Montgomery Multiplication Definition: m : large odd integer ~ ~ , gcd(m, B)  1 u , v  Z / m Z ~ ~ ~ ~ u * v  u  v  R  1 mod m s ( usually R  B ) R  m

  8. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ ~  1 ( z  (  z  m mod B)  m) mod B  z  ( z  q M  m ) div B; 0 0 } ~ ~  1  ( z  (  z  m  m mod B)) mod B ~ ~ ~ 0 0 if z  m then z  z - m; ~ ~  ( z  z ) mod B  0 0

  9. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 i  0 q  (  z  m ) mod B; M 0 0 ~ z  m  ( B  1 ) ~ ~ z  ( z  q M  m ) div B; } m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B 2  m  ( B  1 )   2 m B

  10. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ z  z  u  v ; i ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ z  ( z  q M  m ) div B; } ~ ~ ~ if z  m then z  z - m;

  11. Sequential Computation on CPU ~ ~ ~ ~  1 u * v  u  v  R mod m m Algorithm ~ u ~ v ~  * z 0; ~ for (i  0; i  s - 1; i   ) z { ~ ~ ~ ~ i  1 z  z  u  v ; i ~ 0  z  2  m ~  1 q  (  z  m ) mod B; M 0 0 ~ ~ ~ z  2  m  m  ( B  1 ) z  ( z  q M  m ) div B; } 2  m  m  ( B  1 )  m  ( B  1 ) ~ z  ~ ~ ~ if z  m then z  z - m; B m  ( 2  B  1  B  1 )   2  m B

  12. Fermi architecture Specifications: 3 billon transistors 16 Streaming Multiprocessors (SM) 6 x 64-bit memory partitions Up to total 6GB GDDR5 with ECC GigaThread global scheduler Shared L2 Cache (768KB) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 12 12

  13. Fermi architecture Streaming Multiprocessor 32 CUDA Cores (16 x 32 = 512) Dual warp scheduler 16 LD/ST Units 4 Special Function Units (SFU) 64KB of configurable Shared Memory and L1 Cache (48KB/16KB) CUDA Core Pipelined ALU and FPU ALU supports 32-bit int FPU single precision (512 FMA ops / clock) Source: NVIDIA’s next Generation CUDA TM Compute Architecture: Fermi 1K 32-bit registers per core 13 13

  14. Representation of Integers Parallel version Sequential version x x x 31 31 31  x x x x  0  31 2 1   x x x x 0 x x 31 x 2 1 2 2 2  x x x x x x x 31 0 2 1 1 1 1 x x x 0 0 0 • Low Latency • High Latency • Cryptography • Cryptanalysis 14 14

  15. Representation of Integers To avoid barriers (mem fence) try to fit entire operand within a block of 32 threads (Warps) Data coherence is maintained within a warp. Each thread operates in one limb in radix B=2 32 Possible representations: Avizienis representation (signed-digit) Residue Number System Carry-save  c c c c 31 2 1 0  x x x x 31 0 2 1 15 15

  16. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0 T :  ( T  q M  M ) div B; }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  t t t t t t t t 31 0 0 30 2 1 31 1 16

  17. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 {  L(a b ) L(a b ) L(a b ) L(a b ) T :  T  b A; 31 0 1 0 1 0 0 0 i -1 q :  t  (  m ) mod B; M 0 0     T :  ( T  q M  M ) div B;     }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t t t t t 31 0 0 M 30 2 1 31 1 17

  18. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }  c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 18

  19. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;  L(m q ) L(m q ) L(m q ) L(m q ) 31 M 0 M 2 M 1 M }           c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 19

  20. Montgomery Multiplication ~ ~ ~ ~  1 u * v  u  v  R mod m  M m m m m 31 2 1 0 Algorithm  a a A a a 31 0 2 1 ~ ~ A :  u ; B :  v ; M :  m;  b b B b b 31 0 2 1 T  : 0; for (i  0; i  s - 1; i   )  H(a b ) H(a b ) H(a b ) H(a b ) 31 0 2 0 1 0 0 0 { T :  T  b A; i -1 q :  t  (  m ) mod B;  H(m q ) H(m q ) H(m q ) H(m q ) M 0 0 31 M 2 M 0 M 1 M T :  ( T  q M  M ) div B;        }   c c c c if T  M then Z :  T - M; 30 0 31 1 else Z :  T; T  q t t t t 31 0 M 2 1 20

Recommend


More recommend