Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * , Matei Istoan † and Peter Zipf * * University of Kassel, Germany † University Lyon, France 24'th IEEE Symposium on Computer Arithmetic 25.07.2017
Motivation Multiplication is a fundamental arithmetic operation Embedded multipliers available in the FPGA fabric are limited in size (& quantity) Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources Question of interest: How to do the decomposition in a (resource) optimal way? 2
Outline 1. How to formulate the problem as tiling problem? 2. How do the tiles look like? 3. How to solve the problem? 3
Outline 1. How to formulate the problem as tiling problem? 2. How do the tiles look like? 3. How to solve the problem? 4
Multiplier Decomposition A large multiplier can be decomposed into several smaller multipliers: A × B = ( A H 2 n + A L )( B H 2 m + B L ) 2 n + A L B H 2 n + m + A H B L = A H B H 2 m + A L B L | {z } | {z } | {z } | {z } M4 M3 M2 M1 5
Multiplier Tiling 32 The multiplier can be graphically M 4 M 2 represented as an X × Y board which is ↑ 16 tiled by smaller multiplier, represented y M 3 M 1 as rectangles [de Dinechin 2009] 0 32 16 0 The required left shift can be obtained ← x from the sum of the tile coordinates 32 × 32 board with (x,y) n = m =16 bit mult. A × B = ( A H 2 16 + A L )( B H 2 16 + B L ) 2 32 + A H B L 2 16 + A L B H 2 16 + A L B L = A H B H | {z } | {z } | {z } | {z } M4 M3 M2 M1 6
Multiplier Tiling A valid multiplier tiling is as follows: 58 53 The board must completely 41 covered without overlaps of the ↑ 34 tiles y 24 17 Overlaps with the border of the board are allowed 0 5853 41 34 24 17 0 ← x 53 × 53 multiplier [de Dinechin 2009] 7
Outline 1. How to formulate the problem as tiling problem? 2. How do the tiles look like? 3. How to solve the problem? 8
Logic-based Tiles Several LUT-based multipliers can be used: 3 × 3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013] 2 × 3 Mult. which can be mapped to three LUT6 (realizing five LUT5) [Kumm 2015] 1 × 2 Mult., uses a single LUT6 (realizing two LUT5) In addition, LUT/carry-chain multipliers are used: Single row of an FPGA-optimized Baugh-Wooley multiplier [Parandeh-Afshar 2011] 9
Shapes of the Logic-based Tiles 3 3 2 2 1 0 0 0 0 0 3 0 3 0 2 0 1 0 2 0 (a) 3 × 3 (b) 3 × 2/2 × 3 (c) 2 × 1/1 × 2 k . . . . . . . . . 2 . . . 0 0 k 0 2 0 (d) k × 2 (e) 2 × k 10
LUT Requirements in the Compressor Tree 1 , 000 #LUTs 500 multi-input addition x 3 operation 0.65 × #bits 0 0 200 400 600 800 1 , 000 1 , 200 1 , 400 1 , 600 Input bits (#bits) 11
Logic-based Multipliers Cost is composed to: cost s = #LUT m + 0 . 65 w s To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio: E s = area s cost s Shape Tile area Word size ( w s ) #LUT m Total cost (cost s ) Efficiency ( E s ) 1 × 1 1 1 1 1.65 0.625 1 × 2 2 2 1 2.3 0.87 2 × 3 6 5 3 6.25 0.96 3 × 3 9 6 6 9.9 0.91 2 k 2 × k 2 k k + 2 k + 1 1 . 65 k + 2 . 3 1 . 65 k +2 . 3 (= 1 . 21 for k → ∞ ) 12
DSP-based Tiles Xilinx DSP blocks contain 18 × 25 bit (signed)/17 × 24 bit (unsigned) multipliers They contain additional post-adders These can be used to add a multiplier result already obtained This reduces the size of the compressor tree Graphically, this can be represented as a so-called super-tile [Banescu 2010] 13
Super-Tiles of Xilinx FPGAs (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) 14
Outline 1. How to formulate the problem as tiling problem? 2. How do the tiles look like? 3. How to solve the problem? 15
Formalizing the Problem Constant/Variable Meaning x, y ∈ N 0 Coordinates X, Y ∈ N 0 Outer bounds of the multiplier to be designed M x,y ∈ { 0 , 1 } Shape of the multiplier to be designed; true when ( x, y ) is within the area of the multiplier S Set of small multipliers with different shape S = |S| Number of available smaller multipliers s ∈{ 0 , 1 , . . . , S − 1 } Shape index of smaller Multiplier x,y ∈ { 0 , 1 } Boolean constant describing each small multiplier; true when m s ( x, y ) is within the area of the multiplier of shape s cost s ∈ R Cost of a small multiplier of shape s x,y ∈ { 0 , 1 } Decision variable, which is true when multiplier of shape s is d s placed at coordinate ( x, y ) 20
Specification of a Tile Setting m 0 0 , 0 = m 0 0 , 1 = m 0 0 , 2 = m 0 1 , 0 = m 0 1 , 1 = 1 with all other m's zero would define the following tile: 3 ↑ 2 y 1 0 2 1 0 ← x 21
ILP Formulation The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows: S − 1 X − 1 Y − 1 X X X cost s d s minimize x,y s =0 x =0 y =0 subject to 9 for 0 ≤ x ≤ X, S − 1 X − 1 Y − 1 = X X X m s x − x 0 ,y − y 0 d s 0 ≤ y ≤ Y x 0 ,y 0 = 1 with M x,y = 1 ; s =0 x 0 =0 y 0 =0 The ILP problem can be solved by using standard solvers 22
ILP Formulation Graphical representation of the left-hand-side of the ILP constraint: m 0 0 , 3 d 0 1 , 2 = 0 m 0 0 , 2 d 0 1 , 2 = 1 m 0 1 , 1 d 0 1 , 2 = 1 5 4 ↑ m 0 0 , 1 d 0 1 , 2 = 1 3 y 2 m 0 1 , 0 d 0 1 , 2 = 1 1 m 0 0 , 0 d 0 1 , 2 = 1 0 5 4 3 2 1 0 ← x 23
Additional DSP Constraint The cost of DSP blocks are hard to compare with the cost of LUTs Better to constrain the DSP count of a certain application A single additional constraint can be used to specify the number of DSPs ( # DSP): S − 1 X − 1 Y − 1 X X X D s d s x,y = #DSP s =0 x =0 y =0 where D s specifies the number of DSPs in multiplier shape s 24
Results Four important cases were considered: 24 × 24 (single precision) 32 × 32 53 × 53 (double precision) 64 × 64 Each evaluated for varying DSP count up to DSP-only implementation 25
Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 24 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26
Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 Baugh-Wooley multiplier 24 [Parandeh-Afshar 2011] 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26
Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 2 × k and 1:2 performs 24 best for LUT-based 17 multiplication 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26
Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 24 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 efficient solution 24 utilizing two super-tiles 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26
Resulting Tilings 53 Bit 58 53 53 53 41 41 34 34 24 17 17 0 0 0 53 49 24 8 0 53 50 24 0 53 34 27 17 3 0 53 × 53 , 5 DSP 53 × 53 , 6 DSP 53 × 53 , 7 DSP 58 58 53 41 41 29 24 12 12 0 58 53 41 29 12 58 41 24 12 0 53 × 53 , 8 DSP 53 × 53 , 9 DSP 27
Resulting Tilings 53 Bit 58 53 53 53 41 41 34 34 24 17 17 0 0 0 53 49 24 8 0 53 50 24 0 53 34 27 17 3 0 53 × 53 , 5 DSP 53 × 53 , 6 DSP 53 × 53 , 7 DSP 58 58 pinwheel inside of a pinwheel 53 41 41 logic-mult. consumes 29 1/4 are compared to 24 previous hand-optimized 12 12 design [de Dinechin 2009] 0 58 53 41 29 12 58 41 24 12 0 53 × 53 , 8 DSP 53 × 53 , 9 DSP 27
Resulting Tilings 64 Bit 64 64 64 58 58 47 41 40 34 30 24 24 23 17 6 0 0 0 64 58 51 34 17 0 64 58 34 17 0 64 47 40 23 6 0 64 × 64 , 7 DSP 64 × 64 , 8 DSP 64 × 64 , 9 DSP 64 67 64 47 50 47 43 40 30 33 23 23 19 13 16 0 72 48 24 0 2 0 64 40 23 16 0 64 × 64 , 10 DSP 64 × 64 , 11 DSP 28
Optimization & Synthesis Results Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. f clk [MHz] [Brunie 2013] 1 216 65 212.4 24 × 24 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 0 0 418.9 proposed 2 0 0 0.0% 418.9 [Banescu 2010] 0 1024 339 275.8 proposed 0 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 32 × 32 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 0 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 0 13 23.5% 181.7 29
Recommend
More recommend