multiprecision multiplication on armv8
play

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 - PowerPoint PPT Presentation

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T )


  1. Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JÄRVINENÄDL 2 , WEIQIANG LIU 3 , HWAJEONG SEO 4 1 A P S I A , I N T E R D I S C I P L I N A R Y C E N T R E F O R S E C U R I T Y , R E L I A B I L I T Y A N D T R U S T ( S N T ) , U N I V E R S I T Y O F L U X E M B O U R G , L U X E M B O U R G 2 D E P A R T M E N T O F C O M P U T E R S C I E N C E , U N I V E R S I T Y O F H E L S I N K I , H E L S I N K I , F I N L A N D 3 C O L L E G E O F E L E C T R O N I C A N D I N F O R M A T I O N E N G I N E E R I N G , N A N J I N G U N I V E R S I T Y O F A E R O N A U T I C S A N D A S T R O N A U T I C S 4 D E P A R T M E N T O F I T , H A N S U N G U N I V E R S I T Y

  2. Motivation • Cryptography degrades the performance of smartphone • In particular, public key cryptography imposes high overheads • Fast PKC implementation is important to achieve high availability

  3. Motivation • Multi-precision arithmetic operation (for PKC) • Compact big number implementation is an open problem • Few works focus on ARMv8 • GCM (CT- RSA’15)  Binary field multiplication (SCN’16)  Binary ECC (SG- CRC’17) • This work improves the performance of multiplication on ARMv8!

  4. Contribution • Compact implementations of multi-precision multiplication • Subtractive Karatsuba algorithm • Evaluation of multiple-level Karatsuba • Test input size (128, 256, 384, 512-bit) • Squaring dedicated routine

  5. Target Platform – ARMv8 • 95% of smartphones based on ARM architecture • Modern smartphone supports 64-bit ARMv8

  6. Target Platform – ARMv8 • 32-bit mode (AArch32) & 64-bit mode (AArch64) • 64-bit ARM & 128-bit NEON registers and instruction sets • Crypto (AES and SHA) operation

  7. Multiplication on ARMv8 X0 X0 a0 a0 × × X1 X1 b0 b0 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 X3 X2 MUL UMULH

  8. Multiplication on ARMv8 SIMD (NEON) SISD (A64) V0 X0 a3 a2 a1 a0 a0 32 bits 64 bits V1 X1 b3 b2 b1 b0 b0 64 bits 32 bits V2 a1b0 a0b0 a0b0 a0b0 64 bits 64 bits X3 X2 UMULL UMULH MUL For 64-bit multiplication on ARMv8, NEON requires 4 UMULL routines but A64 only needs 1 MUL and 1 UMULH. A64 is more efficient than NEON for big integer multiplication.

  9. Multi-precision Multiplication 256~2048-bit multiplication on 64-bit architecture - divide big integer (256~2048-bit) into small integer (64-bit) Method Operand-scanning Product-scanning Hybrid-scanning Computation order Row-wise Column-wise Mixture of row/column Requirement Many registers Efficient MAC routine General processor - ARMv8 supports 31x64-bit registers  operand-scanning (previous works)

  10. Multi-precision Multiplication Operand-scanning method

  11. Multi-precision Squaring A special case of multiplication where both operands are the same (i.e., A = B) Certain partial products become the same and need to be performed only once (i.e., 𝐵 0 × 𝐶 1 + 𝐵[1] × 𝐶[0] becomes 2 × 𝐵[0] × 𝐵[1] if 𝐵 = 𝐶 ) Two approaches: - Doubling the operand (i.e., 𝟑 × 𝐵[0]  2 × 𝐵[0] × 𝐵[1] ) - Doubling the result (i.e., 𝐵[0] × 𝐵[1]  𝟑 × 𝐵[0] × 𝐵[1] )  Sliding-block-doubling

  12. Multi-precision Squaring Sliding-block-doubling method

  13. Karatsuba-Ofman Algorithm Number of partial product School-book Karatsuba-Ofman 𝑂 2 𝑂 log 2 3 𝑜 2 and 𝐶 = 𝐶 𝑀 + 𝐶 𝐼 2 Τ 𝑜 2 The product 𝐷 = 𝐵 ∙ 𝐶 of two n-bit integers 𝐵 = 𝐵 𝑀 + 𝐵 𝐼 2 Τ 𝐷 = 𝐵 𝐼 ∙ 𝐶 𝐼 2 𝑜 + 𝑜 2 + 𝐵 𝑀 ∙ 𝐶 𝑀 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 2 Τ

  14. Subtractive Karatsuba Algorithm 𝐷 = 𝐵 𝐼 ∙ 𝐶 𝐼 2 𝑜 + 𝑜 2 + 𝐵 𝑀 ∙ 𝐶 𝑀 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 2 Τ 𝐵 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝑀 + 𝐶 𝐼 − 𝐵 𝑀 ∙ 𝐶 𝑀 − 𝐵 𝐼 ∙ 𝐶 𝐼 = 𝐵 𝑀 ∙ 𝐶 𝑀 + 𝐵 𝐼 ∙ 𝐶 𝐼 − |𝐵 𝐼 − 𝐵 𝑀 | ∙ |𝐶 𝐼 − 𝐶 𝑀 | Advantage: - constant size of operands ( n/2 )  fast constant-time multiplication Requirement: - Absolute value in two’s complement representation

  15. Multi-precision Multiplication on ARMv8 128-bit Karatsuba multiplication

  16. Multi-precision Squaring on ARMv8 No need for absolute value handling 𝐵 𝐼 − 𝐵 𝑀 ∙ 𝐵 𝐼 − 𝐵 𝑀  always positive value

  17. Optimizations of instruction set Generation of the carry register - MOV X0, #0  …  ADDS X1, X1, X2  ADCS X3, X0, X0 Two’s complement - SBCS X2, X2, X2  EOR X0, X0, X2  AND X2, X2, #1  ADD X0, X0, X2

  18. Evaluation IDE: Xcode Target: - 64-bit ARMv8-A architecture - Apple A7 (APL0698) @1.3GHz Program language: assembly Optimization level: -Ofast

  19. Evaluation

  20. Evaluation

  21. Conclusion Achievements - Efficient implementations of multi-precision multiplication / squaring on ARMv8 Future works - Cryptography implementations (ECC, RSA, SIDH)

Recommend


More recommend