Efficient Implementation of Modular Division by Input Bit Splitting Danila Gorodecky Tiziano Villa National Academy of Science of Belarus, University of Verona, Minsk, Belarus Verona, Italy danila.gorodecky@gmail.com tiziano.villa@univr.it
Motivation – due to historical and algorithmic reasons RNS is not a common arithmetic approach and this topic has not studied proper and not common implemented in hardware computing; – there are a numeric of a perspective applications such parallel computing, cryptography; – there is no an “efficient” hardware approach of X mod P calculation (not synthesizable in state- of-art EDA tools) 2
Implementation of RNS – Digital filtering with finite impulse response (FIR-filtering); – Crypto system of Federal Reserve System of USA; – Space flight control (Russia); – Data transferring between Space satellites and Earth (Russia); – Air Defense System (USA, Russia); 3
Common architecture of computation in RNS A 1 A 2 S 1 summator/multiplier (mod p 1 ) A 1 A n Converter of Converter of A 1 positional modular A 2 A 2 S numbers S 2 representation summator/multiplier (mod p 2 ) to to A n modular positional A n representation number A 1 A 2 (1) (2) S m summator/multiplier (mod p m ) A n = m m A A ... A R (mod p ) 1 2 n i i = S or R S (mod P ) ( ) i i A mod p = = i i + + + = i 1 i 1 A A ... A S (mod p ) m 1 2 n i i = P p 4 i = i 1
Example of computation in RNS A = { A , A , A } 1 2 3 = + + − S S Y S Y S Y r P 1 1 2 2 3 3 A = − A (mod p ) A p i i i Y = p i i A 1 (mod p ) 0 314 i p i B = { B , B , B } P = 1 2 3 Y k i i p i B = − B (mod p ) B p i i i p i B 0 314 = = = 0 , 1 , 2 ,... r i 1 , 2 , 3 k 1 , p i i 5
Main hardware design problems • X (mod P) realizations for an arbitrary P (forward conversion) . Synthesis error in the most of all EDA tools; • Transformation modular representation into position numbers (reverse conversion). It is a long critical pass or big hardware costs. 6
Approaches of X (mod P) design [1] P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer International Publishing, 2016, 351 p. [2] J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”// 25th IEEE International Parallel and Distributed Processin g Symposium Anchorage, Ak, USA, May 16-17, 2011, p. 289-292. 7 [3] Mark A. Will and Ryan K. L. Ko, “Computing Mod Without Mod”
Input bit splitting approach • splitting input into small tuples (up to 12-bit); • Boolean minimization (Disjunctive Normal Form, Reed-Muller expansion, Binary Decision Tree, Majority Graph) 8
Fourier transformation multiplication (FTM) FTM splits up binary vectors into m -bit tuples and multiplies to each others: A = 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km multiplication 1 2 … m m+1 m+2 … 2m 2m+1 … 3m … (k -1)m (k- 1)m+1 … km B = R = 1 2 … m m+1 m+2 … 2m … (k -1)m (k- 1)m+1 … km … 2k(m - 1) … 2km k k There are of 2m -bit operands in FTM + − = m ( i j 2 ) 2 k 2 R A i B j = = i 1 j 1 m log k It leads to -levels of adders tree 9 2
Multiplication in Synopsys in 28 nm technology Comparison of monolith and Synopsys multipliers (regular arithmetic) 6.66 6.25 7 5 6 4.16 Frequency, GHz 5 3.85 3.33 3.03 4 2.5 2.17 2.5 2.38 2.12 2.04 3 1.75 2 1 0 2x2 3x3 4x4 5x5 6x6 7x7 8x8 Synopsis Monolith 10
FTM for both arithmetic = A Example. , where and B are 14-bits operands. A B R ( ) ( ) B = A = , , , B B B B A , A , A , A 4 3 2 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = A = , , , b , b , b , b b b b b a , a , a , a a , a , a , a 2 8 7 6 5 1 4 3 2 1 2 8 7 6 5 1 4 3 2 1 ( ) ( ) ( ) B = ( ) B = A = b , b A = b , b , b , b a , a a , a , a , a 4 14 13 3 12 11 10 9 4 14 13 3 12 11 10 9 Multiplication in regular arithmetic, where is 28 bit vector: R = + + + + 4 8 12 R A B A B 2 A B 2 A B 2 1 1 1 2 1 3 1 4 + + + + + 4 8 12 16 A B 2 A B 2 A B 2 A B 2 2 1 2 2 2 3 2 4 + + + + + 8 12 16 20 A B 2 A B 2 A B 2 A B 2 3 1 3 2 3 3 3 4 + + + + 12 16 20 24 A B 2 A B 2 A B 2 A B 2 . 4 1 4 2 4 3 4 4 11
Adder-tree levels reduction k log m FTM needs levels in adder-tree, k – number of m -bits 2 sub vectors. We propose techniques to upgrade architectures of monolith based multipliers and as a consequence to minimize speed of calculation. The principle of adder-tree optimization concludes in concatenating of detached results of monolith multiplication. For instance, for k=4 and m=4 , concatenating of the following operands can be joint into one vector: = + + + 8 16 24 R ... A B A B 2 A B 2 A B 2 .... 1 1 1 3 2 4 4 4 ( ) A 1 B 1 & 8 bits A ( 1 B , 00000000 ) 3 & concatenation 8 bits 8 bits A ( 2 B , 0000000000 000000 ) 4 & 8 bits 16 bits A ( , 0000000000 0000000000 0000 ) 4 B 4 8 bits 24 bits ( ) 12 , , , A B A B A B A B 4 4 2 4 1 3 1 1
Efficient architecture = A B R A Example. , where and are 14-bits operands and R is 28 bits. B ( ) ( ) B = A = B , B , B , B A , A , A , A 4 3 2 1 4 3 2 1 = = = = A B R A B R A B R A B R 2 1 5 3 1 9 4 1 13 1 1 1 = = A B R = A B R = A B R A B R 2 2 6 3 2 10 4 2 14 1 2 2 = = = = A B R A B R A B R A B R 3 3 11 2 3 7 1 3 3 4 3 15 = = = A B R = A B R A B R A B R 2 4 8 3 4 12 1 4 4 4 4 16 ( ) ( ) = + + R R , R , R , R R , R , R , 0000 16 11 3 1 12 7 2 ( ) ( ) + + + R , R , R , 0000 R , R , 00000000 15 10 5 8 6 ( ) ( ) 13 + + + R , R , 00000000 R R , 0000000000 00 14 9 4 13
Multiplication in Synopsys in 28 nm technology Comparison of monolith based and Synopsis multipliers (regular arithmetic) 2.5 2.04 2.08 1.85 2 1.69 1.7 1.7 1.66 1.61 1.61 1.61 1.53 Frequency, GHz 1.51 1.45 1.38 1.38 1.39 1.36 1.38 1.37 1.35 1.31 1.5 1.3 1.28 1.31 1.22 1.24 1 0.5 0 8x8 10x10 12x12 14x14 16x16 18x18 20x20 22x22 24x24 26x26 28x28 30x30 32x32 Synopsys Monolith based 14
Fourier transformation for X (mod P) Fourier transformation for X (mod P): 15
Two-level Boolean functions minimization 16
Boolean Functions minimization Espresso (Berkeley, USA) or ELS (Minsk, Belarus) - DNF minimization (two-level minimization); - Binary Decision Diagram (DBB) minimization (multi-level minimization); - minimization in class of Reed-Muller expansions; - etc. 17
Verilog for [100:1] (mod 997) module x_100_mod_997( input [100:1] X, output [10:1] R ); R_temp_1 < 3566179 (22-bit number) wire [22:1] R_temp_1; wire [15:1] R_temp_2; R_temp_2 < 30831 (15-bit number) wire [11:1] R_temp_3; R_temp_3 < 1833 (11-bit number) reg [10:1] R_temp; assign R_temp_1 = X [ 10 : 1 ] + X [ 20 : 11 ] * 5'b11011 + X [ 30 : 21 ] * 10'b1011011001 + X [ 40 : 31 ] * 10'b1011100100 + X [ 50 : 41 ] * 6'b101000 + X [ 60 : 51 ] * 7'b1010011 + X [ 70 : 61 ] * 8'b11110111 + X [ 80 : 71 ] * 10'b1010101111 + X [ 90 : 81 ] * 10'b1001011011 + X [ 100 : 91 ] * 9'b101001001 ; assign R_temp_2 = R_temp_1 [ 10 : 1 ] + R_temp_1 [ 20 : 11 ] * 5'b11011 + R_temp_1 [ 22 : 21 ] * 10'b1011011001 ; assign R_temp_3 = R_temp_2 [ 10 : 1 ] + R_temp_2 [ 15 : 11 ] * 5'b11011 ; always @(R_temp_3) begin if (R_temp_3 >= 10'b1111100101) R_temp <= R_temp_3 - 10'b1111100101; else R_temp <= R_temp_3; 18 endassign R = R_temp; endmodule
Performance in Synopsys in 28 nm for X (mod P) Bit range [300:1] 1000 900 775 746 800 680 frequency, MHz 671 662 653 637 632 625 700 584 581 549 600 507 500 400 217 300 200 45 43 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [400:1] 900 724 800 709 637 636 625 625 700 621 617 frequency, MHz 609 574 543 600 507 495 500 400 300 168 200 28 26 100 0 19 53 113 241 461 977 2011 4051 mod P approach Synopsys Bit range [500:1] 800 719 699 700 625 621 613 595 588 584 571 552 frequency, MHz 600 523 500 487 500 400 300 132 200 100 21 20 0 19 53 113 241 461 977 2011 4051 19 mod P approach Synopsys
Recommend
More recommend