An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22 nd IEEE Symposium on Computer Arithmetic Martin Kumm, Shahid Abbas and Peter Zipf University of Kassel, Germany
CONTENTS 1. State-of-the-art 2. Proposed multiplier 3. Results 2
WHY FPGA SOFTCORE MULTIPLIERS? The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks FPGA softcore multipliers are still required: Small word sizes (worse mapping for embedded mults) Large word sizes ("fill gaps") Replace embedded mults on small/low-cost FPGAs 3
WHY THEY ARE DIFFERENT? Research for efficient multipliers is an ongoing process nearly since >50 years Efficient multipliers in terms of gates may not be efficient on FPGAs FPGA optimized structures are relatively rare 4
WHY THEY ARE DIFFERENT? Xilinx slice 6/7 series 5
PREVIOUS WORK A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary LUT LUT LUT LUT 0 0 0 0 1 1 1 1 Carry Logic 6
PREVIOUS WORK A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary full adder LUT LUT LUT LUT 0 0 0 0 1 1 1 1 Carry Logic 6
PREVIOUS WORK Another idea was discussed in [Brunie 2013]: Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4 Use a compression tree to add partial results p = M 1 + 2 3 M 2 + 2 6 M 3 + . . . . . . + 2 3 M 4 + 2 6 M 5 + 2 9 M 6 + . . . . . . + 2 6 M 7 + 2 9 M 8 + 2 12 M 9 7
BOOTH RECODING M X a · BE m 2 m a · b = m =0 m even b m +1 b m b m − 1 BE m z m c m s m 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 2 0 0 1 1 0 0 -2 0 1 1 1 0 1 -1 0 1 0 1 1 0 -1 0 1 0 1 1 1 0 1 0 0 8
BOOTH MULTIPLIER b 0 LSB c 0 c 0 c 0 c 0 c 0 c 0 c 0 c 2 c 2 c 2 c 2 c 2 c 0 c 4 c 4 c 4 c 2 c 6 c 4 c 6 + = 0 0 MSB 9
BOOTH MULTIPLIER b 0 LSB c 0 1 1 c 0 c 2 1 c 2 c 4 1 c 4 c 6 c 6 + = 0 0 MSB 10
PROPOSED ARCHITECTURE 0 1 0 1 0 1 0 1 0 0 0 1 0 1 LUT LUT LUT LUT 0 0 0 0 1 1 1 1 Carry Logic 11
PROPOSED ARCHITECTURE 0 1 0 1 0 1 0 1 0 0 0 1 0 1 LUT LUT LUT LUT 0 0 0 0 1 1 1 1 Carry Logic full adder 11
PROPOSED ARCHITECTURE 12
RESULTS The number of slices can be precisely predicted: #slices( M, N ) = d N/ 4 + 1 e · b M/ 2 + 1 c | {z } | {z } slices per row no of rows Design was implemented as generic VHDL A pipelined multiplier can be obtained by using the (otherwise unused) slice FFs without much additional cost Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012] Xilinx Coregen was used as a commercial reference 13
RESULTS VIRTEX 6 COMBINATORIAL, SLICES 2 , 000 1x4 LUT Multiplier 1 , 800 3x2 LUT Multiplier 3x3 LUT Multiplier 1 , 600 Parandeh-Afshar Multiplier Coregen (area) 1 , 400 Coregen (speed) 1 , 200 proposed #Slices 1 , 000 800 600 400 200 0 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Input word size (N) 14
RESULTS VIRTEX 6 COMBINATORIAL, SLICE RED. 80 60 Slice reduction (%) 40 1x4 LUT Multiplier 20 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) 0 Coregen (speed) 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Input word size (N) 15
RESULTS VIRTEX 6 COMBINATORIAL, FREQ. 700 1x4 LUT Multiplier 3x2 LUT Multiplier 600 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) 500 Coregen (speed) Frequency [MHz] proposed 400 300 200 100 0 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Input word size (N) 16
RESULTS VIRTEX 6 PIPELINED, SLICES 2 , 000 1x4 LUT Multiplier 1 , 800 3x2 LUT Multiplier 3x3 LUT Multiplier 1 , 600 Parandeh-Afshar Multiplier Coregen (area) 1 , 400 Coregen (speed) 1 , 200 proposed #Slices 1 , 000 800 600 400 200 0 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Input word size (N) 17
RESULTS VIRTEX 6 PIPELINED, SLICE RED. 80 70 60 50 Slice reduction (%) 40 30 20 1x4 LUT Multiplier 3x2 LUT Multiplier 10 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) 0 Coregen (speed) − 10 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Input word size (N) 18
RESULTS VIRTEX 6 PIPELINED, FREQ. 700 1x4 LUT Multiplier 3x2 LUT Multiplier 600 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) 500 Coregen (speed) Frequency [MHz] proposed 400 300 200 100 0 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Input word size (N) 19
UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS Altera ALM 20
MAYBE POSSIBLE NEXT? 21
CONCLUSION Compared to the best known design, up to 50% slices can be saved for the combinatorial multiplier 30% slices can be saved for the pipelined multiplier Portable to FPGAs providing a 5-input LUT at one full adder input "Free addition" supports multiply-accumulate (MAC) operation 22
THANK YOU! LITERATURE [Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs , FPL 2011 [Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps , FPL 2013 [de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012 23
BOOTH RECODING b = b M − 1 2 M − 1 + . . . + b 2 2 2 + b 1 2 1 + b 0 = b M − 1 2 M − 1 + . . . + b 2 2 2 + 2 b 1 2 1 + − b 1 2 1 + b 0 | {z } BE 0 = − 2 b 1 + b 0 = b M − 1 2 M − 1 + . . . . . . + 2 b 3 2 3 − b 3 2 3 + b 2 2 2 + 2 b 1 2 1 +BE 0 | {z } BE 2 =( − 2 b 3 + b 2 + b 1 )2 2 M X BE m 2 m with BE m = − 2 b m +1 + b m + b m − 1 = m =0 m even 25
WHY THEY ARE DIFFERENT? Altera ALM 26
WHY THEY ARE DIFFERENT? SRHI SRLO Q INIT1 CE INIT0 CK SR D6:1 FF/LAT INIT1 Q INIT0 D SRHI CE SRLO CK SR 27
Recommend
More recommend