A Galois Field Arithmetic Library Pakize S ¸ANAL, MSc Candidate Supervisor: Asst. Prof. H¨ useyin HIS ¸IL Yasar University Faculty of Engineering Department of Computer Engineering June 5, 2017 1
Outline Content of the bachelor thesis Studied assembly optimizations Test results 2
Content of the bachelor thesis A Galois Field Arithmetic Library ◮ + , − , ∗ . ◮ GF (2 w − c ) where w = 127 , 128 , 255 , 256 and GF (2 127 − 1). ◮ Constant time AMD64 Assembly. ◮ Extensive validation and performance tests. 3
1. By scheduling of the operations Four digits schoolbook vs. one level recursive schoolbook multiplication vs. . . . a 1 a 0 a 3 a 2 SCB RSCB OSCB b 3 b 2 b 1 b 0 2 256 − c 38 - - x a 0 · b 0 a 1 · b 0 a 2 · b 0 a 3 · b 0 a 0 · b 1 a 1 · b 1 a 2 · b 1 a 3 · b 1 a 0 · b 2 a 1 · b 2 a 2 · b 2 a 3 · b 2 a 0 · b 3 a 1 · b 3 a 2 · b 3 a 3 · b 3 + a · b 4
1. By scheduling of the operations Four digits schoolbook vs. one level recursive schoolbook multiplication vs. . . . SCB RSCB OSCB 2 256 − c 38 35 - a 3 a 2 a 1 a 0 b 3 b 2 b 1 b 0 x a 3 · b 3 a 1 · b 1 a 0 · b 0 a 2 · b 2 a 3 · b 2 a 1 · b 0 a 2 · b 3 a 0 · b 1 a 3 · b 1 a 2 · b 0 a 3 · b 0 a 2 · b 1 a 1 · b 3 a 0 · b 2 a 1 · b 2 a 0 · b 3 + a · b 4
1. By scheduling of the operations Four digits schoolbook vs. one level recursive schoolbook multiplication vs. . . . SCB RSCB OSCB 2 256 − c 38 35 37 a 3 a 2 a 1 a 0 b 0 b 3 b 2 b 1 x a 3 · b 3 a 2 · b 2 a 1 · b 1 a 0 · b 0 a 3 · b 2 a 3 · b 0 a 1 · b 0 a 2 · b 3 a 2 · b 1 a 0 · b 1 a 3 · b 1 a 2 · b 0 a 1 · b 3 a 0 · b 2 a 1 · b 2 a 0 · b 3 + a · b 4
1. By scheduling of the operations One level Karatsuba multiplication vs. one level schoolbook multiplication Karatsuba SCB 2 127 − 1 a 1 a 0 12 6 2 127 − c 17 13 2 128 − c 12 10 b 1 b 0 x a 1 · b 1 a 0 · b 0 ( a 1 + a 0 ) · ( b 1 + b 0 ) a 1 · b 1 - a 0 · b 0 - + a · b 5
2. By making optimization Register optimization // ... 1 movq 8*0( %r8), %rax 2 mulq 8*0( %r9) 3 movq %rax , %rbx 4 movq %rdx , %rsi 5 movq 8*1( %r8), %rax 6 mulq 8*1( %r9) 7 a 3 a 2 a 1 a 0 movq %rax , %r10 8 b 3 b 2 b 1 b 0 movq %rdx , %r11 9 x a 3 · b 3 a 2 · b 2 a 1 · b 1 a 0 · b 0 movq 8*1( %r8), %rax 10 a 3 · b 2 a 1 · b 0 mulq 8*0( %r9) 11 a 2 · b 3 a 0 · b 1 addq %rax , %rsi 12 a 3 · b 1 a 2 · b 0 adcq %rdx , %r10 13 a 3 · b 0 adcq $0 , %r11 a 2 · b 1 14 a 1 · b 3 a 0 · b 2 movq 8*0( %r8), %rax 15 a 1 · b 2 mulq 8*1( %r9) 16 a 0 · b 3 addq %rax , %rsi 17 + a · b adcq %rdx , %r10 18 adcq $0 , %r11 19 movq %rbx , 8*0( %rdi) 20 movq %rsi , 8*1( %rdi) 21 // ... 22 Listing 1 : < GF (2 255 − c ) , ∗ > 6
3. By using special instructions The instruction cmovxx if r 13 = 0 then if r 12 = 0 then Return 0. Return 0. Conditional Move else else Return r 14 . Return r 15 . // ... end end 1 movq %r12 , %rax 2 mulq %r14 3 r 13 r 12 movq $0 , %rbp 4 cmp $0 , %r13 5 cmovz %rbp , %r14 6 r 15 r 14 cmp $0 , %r15 7 x cmovz %rbp , %r12 8 a 12 · b 14 andq %r13 , %r15 9 addq %r12 , %rdx 10 r 13 .r 14 adcq $0 , %rbp 11 addq %r14 , %rdx 12 r 12 · r 15 adcq %r15 , %rbp 13 // ... 14 ? + Listing 2 : < GF (2 128 − c ) , ∗ > a · b 7
3. By using special instructions The instruction btxx Bit Test and Reset // ... 1 /*r11 , r10 , r9 , r8 */ 2 shlq $1 , %r11 3 btrq $63 , %r10 4 adcq $0 , %r11 5 r 11 r 10 r 9 r 8 shlq $1 , %r10 6 btrq $63 , %r9 7 r 9 r 8 adcq $0 , %r10 8 r 11 r 10 9 + addq %r8 , %r10 10 r 11 r 10 adcq %r9 , %r11 11 12 r 11 r 10 btrq $63 , %r11 13 adcq $0 , %r10 14 + adcq $0 , %r11 15 r 11 r 10 // ... 16 Listing 3 : < GF (2 127 − 1) , ∗ > Faster compact Diffie-Hellman: Endomorphisms on the x − line C. Costello, H. Hisil, and B. Smith 8
3. By using special instructions Comparing with the MPFQ library < GF (2 127 − 1) , ∗ > 45 instructions, 9 clock cycles 33 instructions, 6 clock cyles // ... /* r11 , r10 , r9 , r8*/ 1 movq $9223372036854775807 , %rax 2 // ... 1 movq %r9 , %r12 3 /*r11 , r10 , r9 , r8 */ 2 andq %rax , %r9 4 shlq $1 , %r11 3 shrq $63 , %r12 5 btrq $63 , %r10 4 movq %r10 , %rdx 6 adcq $0 , %r11 5 shlq $1 , %r10 7 shlq $1 , %r10 6 orq %r10 , %r12 8 btrq $63 , %r9 7 shlq $1 , %r11 9 adcq $0 , %r10 8 shrq $63 , %rdx 10 9 orq %r11 , %rdx 11 addq %r8 , %r10 10 addq %r12 , %r8 12 adcq %r9 , %r11 11 adcq %rdx , %r9 13 12 movq %r9 , %r12 14 btrq $63 , %r11 13 andq %rax , %r9 15 adcq $0 , %r10 14 shlq $1 , %r12 16 adcq $0 , %r11 15 adcq $0 , %r8 17 // ... 16 adcq $0 , %r9 18 // ... 19 Listing 4 : My schoolbook’s code reduction part Listing 5 : MPFQ schoolbook’s code reduction part https://www.imsc.res.in/~ecc14/slides/hisil.pdf 9
Test Results Timing benchmarks were taken on an Intel Core i7-6500U processor running Ubuntu 14.04.5 LTS with TurboBoost disabled and all cores but one are switched-off (i.e. hyperthreading is disabled). To obtain the executables, we used GNU- gcc version 4.8.4 with the -O2 flag set and GNU assembler version 2.24. Karatsuba Schoolbook (SCB) Recursive SCB 2 127 − 1 12 6 - 2 127 − c 17 13 - 2 128 − c 12 10 - 2 255 − c - 46 40 2 256 − c - 38 34 10
1 / ∗ l i b r a r i e s ∗ / 2 #d e f i n e TRIAL 100000000000 3 i n t main () { 4 l on g l on g st , fn ; 5 s t = c p u c y c l e s () ; 6 u n si gn e d l on g an [ 2 ] , bn [ 2 ] , cn [ 2 ] ; 7 an [ 0 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 8 an [ 1 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 9 bn [ 0 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 10 bn [ 1 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 11 cn [ 0 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 12 cn [ 1 ] = ( u n si gn e d l on g ) rand () ∗ ( u n si gn e d l on g ) rand () ; 13 u n si gn e d l on g i n t i ; 14 f o r ( i = 0; i < TRIAL ; i ++) { 15 mul127 scb v01 ( an , bn , cn ) ; 16 an [ 0 ] = bn [ 1 ] ; 17 an [ 1 ] = cn [ 0 ] ; 18 bn [ 0 ] = an [ 1 ] ; 19 bn [ 1 ] = cn [ 1 ] ; 20 cn [ 0 ] = an [ 1 ] ; 21 cn [ 1 ] = bn [ 0 ] ; 22 } 23 fn = c p u c y c l e s () ; 24 double f i r s t = (( double ) fn − s t ) / TRIAL ; 25 s t = c p u c y c l e s () ; 26 f o r ( i = 0; i < TRIAL ; i ++) { 27 mu l 127 sc b te st ( an , bn , cn ) ; 28 an [ 0 ] = bn [ 1 ] ; 29 an [ 1 ] = cn [ 0 ] ; 30 bn [ 0 ] = an [ 1 ] ; 31 bn [ 1 ] = cn [ 1 ] ; 32 cn [ 0 ] = an [ 1 ] ; 33 cn [ 1 ] = bn [ 0 ] ; 34 } 35 fn = c p u c y c l e s () ; 36 double second = (( double ) fn − s t ) / TRIAL ; 37 p r i n t f (” net c l oc k c y c l e : %l f \ n \ n” , f i r s t − second ) ; 38 r e t u r n 1; 39 } Listing 6 : A performance test 11
Recommend
More recommend