using vector instructions
play

Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel - PowerPoint PPT Presentation

Montgomery Multiplication Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Gregory M. Zaverucha SAC 2013 Motivation E.g. ECDSA, ECDH E.g. DH, ( ) Point DSA, RSA arithmetic or


  1. Montgomery Multiplication Using Vector Instructions Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Gregory M. Zaverucha SAC 2013

  2. Motivation E.g. ECDSA, ECDH E.g. DH, 𝐹(𝐆 π‘ž ) Point DSA, RSA arithmetic 𝐆 π‘ž or 𝐚/π‘πš Montgomery Multiplication

  3. Motivation E.g. ECDSA, ECDH E.g. DH, 𝐹(𝐆 π‘ž ) Point DSA, RSA arithmetic 𝐆 π‘ž or 𝐚/π‘πš ECC often use primes of a Useful for special form: pairings NIST curves, Montgomery curve25519 Multiplication

  4. Modular Multiplication Compute 𝐷 = 𝐡 Γ— 𝐢 (mod 𝑁) 𝑆 = 𝐡 Γ— 𝐢 write 𝑆 = π‘Ÿ Γ— 𝑁 + 𝐷 such that 0 ≀ 𝐷 < 𝑁 Cost: One multiplication + one division with remainder

  5. Modular Multiplication Compute 𝐷 = 𝐡 Γ— 𝐢 (mod 𝑁) 𝑆 = 𝐡 Γ— 𝐢 write 𝑆 = π‘Ÿ Γ— 𝑁 + 𝐷 such that 0 ≀ 𝐷 < 𝑁 Cost: One multiplication + one division with remainder Montgomery (Math. Comp. 1985) observed that we can avoid the expensive division when M is odd 𝐡 2 if 𝐡 is even 𝐡 2 mod 𝑁 = 𝐡+𝑁 if 𝐡 is odd 2 A Γ— βˆ’π‘ βˆ’1 mod 2 32 ≑ 0 mod 2 32 , A + M Γ— precompute 𝜈 = βˆ’π‘ βˆ’1 mod 2 32

  6. Interleaved Montgomery Multiplication π‘œβˆ’1 𝑏 𝑗 , 𝐢 , 𝑁 , 𝜈 = βˆ’π‘ βˆ’1 mod 2 32 Input: 𝐡 = 𝑗=0 Output: 𝐷 = 𝐡𝐢2 βˆ’32π‘œ mod 𝑁 𝐷 = 0 for 𝑗 = 0 to π‘œ βˆ’ 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐢 ( 1 Γ— π‘œ) limbs π‘Ÿ = 𝜈𝐷 mod 2 32 ( 1 Γ— 1) limb 𝐷 = (𝐷 + π‘Ÿπ‘)/ 2 32 ( 1 Γ— π‘œ) limbs If 𝐷 β‰₯ 𝑁 then 𝐷 = 𝐷 βˆ’ 𝑁

  7. Interleaved Montgomery Multiplication π‘œβˆ’1 𝑏 𝑗 , 𝐢 , 𝑁 , 𝜈 = βˆ’π‘ βˆ’1 mod 2 32 Input: 𝐡 = 𝑗=0 Output: 𝐷 = 𝐡𝐢2 βˆ’32π‘œ mod 𝑁 𝐷 = 0 2 Γ— (1 Γ— 1) limb for 𝑗 = 0 to π‘œ βˆ’ 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐢 ( 1 Γ— π‘œ) limbs π‘Ÿ = (𝑑 0 + 𝑏 𝑗 𝑐 0 )𝜈 mod 2 32 π‘Ÿ = 𝜈𝐷 mod 2 32 ( 1 Γ— 1) limb 𝐷 = (𝐷 + 𝑏 𝑗 𝐢 + π‘Ÿπ‘)/ 2 32 𝐷 = (𝐷 + π‘Ÿπ‘)/ 2 32 ( 1 Γ— π‘œ) limbs If 𝐷 β‰₯ 𝑁 then 𝐷 = 𝐷 βˆ’ 𝑁 At the cost of one extra ( 1 Γ— 1) limb 2 Γ— (1 Γ— π‘œ) limbs multiplication the two ( 1 Γ— π‘œ) limbs multiplications become independent.

  8. Interleaved Montgomery Multiplication π‰πžπŸπ› π‘œβˆ’1 𝑏 𝑗 , 𝐢 , 𝑁 , 𝜈 = βˆ’π‘ βˆ’1 mod 2 32 Flip the sign of 𝜈 : 𝜈 = +𝑁 βˆ’1 mod 2 32 Input: 𝐡 = 𝑗=0 Output: 𝐷 = 𝐡𝐢2 βˆ’32π‘œ mod 𝑁 𝐷 = 0 2 Γ— (1 Γ— 1) limb for 𝑗 = 0 to π‘œ βˆ’ 1 do 𝐷 = 𝐷 + 𝑏 𝑗 𝐢 ( 1 Γ— π‘œ) limbs π‘Ÿ = (𝑑 0 + 𝑏 𝑗 𝑐 0 )𝜈 mod 2 32 π‘Ÿ = 𝜈𝐷 mod 2 32 ( 1 Γ— 1) limb 𝐷 = (𝐷 + 𝑏 𝑗 𝐢 + π‘Ÿπ‘)/ 2 32 𝐷 = (𝐷 + π‘Ÿπ‘)/ 2 32 ( 1 Γ— π‘œ) limbs If 𝐷 β‰₯ 𝑁 then 𝐷 = 𝐷 βˆ’ 𝑁 At the cost of one extra ( 1 Γ— 1) limb 2 Γ— (1 Γ— π‘œ) limbs multiplication the two ( 1 Γ— π‘œ) limbs multiplications become independent.

  9. 2-way SIMD Interleaved Montgomery Multiplication

  10. 2-way SIMD Interleaved Montgomery Multiplication Non-SIMD part 𝑒 𝑗 2 32𝑗 βˆ’ 𝑓 𝑗 2 32𝑗 𝐷 = 𝑗 𝑗 mod 2 32 π‘Ÿ = πœˆπ‘ 0 𝑏 π‘˜ + 𝜈 𝑒 0 βˆ’ 𝑓 0 πœˆπ‘ 0 𝑏 π‘˜ + πœˆπ‘‘ 0 mod 2 32 = = (𝑑 0 + 𝑏 π‘˜ 𝑐 0 )𝜈 mod 2 32

  11. Expected Performance Speedup Sequential Montgomery Multiplication Long Muls: 2π‘œ 2 Short Muls: π‘œ 2-way SIMD Montgomery Multiplication Long Muls: π‘œ 2 Short Muls: 2π‘œ

  12. Expected Performance Speedup Sequential Montgomery Multiplication Long Muls: 2π‘œ 2 Short Muls: π‘œ 2-way SIMD Montgomery Multiplication Long Muls: π‘œ 2 Short Muls: 2π‘œ Based on #multiplications only we expect: β€’ 32-bit 2-way SIMD to be at most 2x as fast as 32-bit sequential β€’ 32-bit 2-way SIMD to be approximately 2x as slow as 64-bit sequential

  13. Performance Results – x86 Intel Xeon E31230 (3.2 GHz) - PC Intel Atom Z2760 (1.8 GHz) - Tablet RSA Classic SIMD Ratio Classic SIMD Ratio enc 2048 181,412 414,787 0.44 2,583,643 1,601,878 1.61 dec 2048 4,928,633 12,211,700 0.40 80,204,317 52,000,367 1.54

  14. Performance Results - ARM Dell XPS 10 tablet (1.8 GHz) NVIDIA Tegra 4 (1.9 GHz) NVIDIA Tegra 3 T30 (1.4 GHz) Snapdragon S4 (dev board, Cortex-A15) (dev board, Cortex-A9) RSA Classic SIMD Ratio Classic SIMD Ratio Classic SIMD Ratio enc 1,087,318 710,910 1.53 725,336 712,542 1.02 872,468 1,358,955 0.64 2048 dec 34,769,147 21,478,047 1.62 23,177,617 22,812,040 1.02 27,547,434 47,205,919 0.58 2048

  15. Performance Results Compare to results from: eBACS: ECRYPT Benchmarking of Cryptographic Systems and OpenSSL Snapdragon S4 (1.8 GHz) vs Intel Atom Z2760 (1.8 GHz) Snapdragon S3 (1.78 GHz) - Tablet RSA Classic OpenSSL Classic OpenSSL enc 2048 1,087,318 609,593 2,583,643 2,323,800 dec 2048 34,769,147 39,746,105 80,204,317 75,871,800

  16. Can we do (asymptotically) better? What about faster multiplication methods (Karatsuba)? β€’ Incompatible with interleaved Montgomery multiplication β€’ Possible gain ([A]) on 32-bit platform for 1024-bit Montgomery multiplication Following the analysis from [A] (one level Karatsuba) for 32-bit platforms Sequential Karatsuba montmul Sequential Karatsuba reduces muls by 1.14x versus Sequential Karatsuba reduces adds by 1.18x Sequential interleaved montmul Sequential Karatsuba montmul SIMD interleaved reduces muls by 1.70x versus SIMD interleaved reduces adds by 1.67x SIMD interleaved montmul [A] J. GroßschΓ€dl, R. M. Avanzi, E. Savas, and S. Tillich. Energy-efficient software implementation of long integer modular arithmetic. CHES 2005

  17. Can we do (asymptotically) better? What about SIMD Karatsuba montmul versus SIMD interleaved montmul? β€’ SIMD Karatsuba, but how to GMP SIMD GMP SIMD calculate SIMD reduction? RSA-2048 enc RSA-2048 enc RSA-2048 dec RSA-2048 dec β€’ This approach is used in GMP Atom Z2760 2,184,436 1,601,878 37,070,875 52,000,367 β€’ GMP is not a crypto lib Intel Xeon E3-1230 695,861 414,787 11,929,868 12,211,700 (32-bit mode)

  18. Can we do (asymptotically) better? What about SIMD Karatsuba montmul versus SIMD interleaved montmul? β€’ SIMD Karatsuba, but how to GMP SIMD GMP SIMD calculate SIMD reduction? RSA-2048 enc RSA-2048 enc RSA-2048 dec RSA-2048 dec β€’ This approach is used in GMP Atom Z2760 2,184,436 1,601,878 37,070,875 52,000,367 β€’ GMP is not a crypto lib Intel Xeon E3-1230 695,861 414,787 11,929,868 12,211,700 (32-bit mode) Modular Squaring Modular Squaring β€’ Time(Montgomery squaring) β‰ˆ 0.80 Γ— Time(Montgomery Multiplication) [A] β€’ SIMD Montgomery squaring? β€’ We didn’t use this optimization [A] J. GroßschΓ€dl, R. M. Avanzi, E. Savas, and S. Tillich. Energy-efficient software implementation of long integer modular arithmetic. CHES 2005

  19. Conclusions οƒΌ Current vector instructions can be used to enhance the performance of Montgomery multiplication on modern embedded devices Examples: 32-bit x86 (SSE) and ARM (NEON) platforms οƒΌ Faster RSA-2048 on some tablets: performance on ARM differs significantly οƒΌ If future instruction set(s) support 64 Γ— 64 β†’ 128 -bit 2-way SIMD multipliers: enhance interleaved Montgomery multiplication performance Future work  Investigate SIMD Karatsuba + SIMD (?) Montgomery reduction  Investigate SIMD Montgomery squaring

Recommend


More recommend