Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference on Field Programmable Logic 03.09.2014 Martin Kumm, Peter Zipf University of Kassel, Germany
C ONTENTS 1. Introduction to Compressor Trees 2. Compressor Trees on FPGAs 3. Optimal Compressor Tree Synthesis 2
C OMPRESSOR T REES A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications are versatile: Multiplier (real, complex, squarer) Evaluation of polynomials (e.g., for function approximation) Linear transforms (e.g., FFT, DCT) Digital filters … 3
E XAMPLE 1: M ULTI -I NPUT A DDITION Dot representation Formula: 5 bit, 5-input addition: X S = X i i input vectors 2 4 2 3 2 2 2 1 2 0 4
E XAMPLE 1: M ULTI -I NPUT A DDITION Dot representation Formula: 5 bit, 5-input addition: X 1 0 1 0 1 21 S = X i i 1 1 0 1 1 +27 input +13 0 1 1 0 1 vectors +7 0 0 1 1 1 1 0 1 1 0 +22 = 90 3 · 2 4 +2 · 2 3 +4 · 2 2 +3 · 2 1 +4 · 2 0 = 90 5
E XAMPLE 2: M ULTIPLIER Dot Representation Formula: 5x5 Multiplication: 6
E XAMPLE 3: A DVANCED A RITHMETIC sine/cosine computation: Dot representation for Z-Z 3 /6 : [Dinechin HEART’13] 7
B ASIC C OMPRESSION Full adder/ Ripple carry adder: (3;2) counter: FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA 8
F LOW OF C OMPRESSION ⇓ 9
T ABULAR R EPRESENTATION 5 5 5 5 5 bits in stage 0 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 = 1 4 4 4 4 3 bits in stage 1 10
T ABULAR R EPRESENTATION 1 4 4 4 4 3 bits in stage 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 = 1 3 3 3 3 1 bits in stage 2 11
T ABULAR R EPRESENTATION 1 3 3 3 3 1 bits in stage 2 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 3 o − (3;2) counter + 1 1 = 2 2 2 2 1 1 bits in stage 3 12
T ABULAR R EPRESENTATION 2 2 2 2 1 1 bits in stage 3 2 2 2 2 o − ripple carry adder + 1 1 1 1 1 = 1 1 1 1 1 1 1 bits in final stage 13
A PPLICATION TO FPGA S The compression using full adders is unsuitable for FPGAs: Mapping of a full adder on FPGA LUTs is inefficient and slow ( ➯ large routing delays) Fast carry chain is not exploited Conventional Solution: Ripple-carry adder tree Delay reduction possible by using Generalized Parallel Counters (GPCs) [Parandeh–Afshar TRETS’11] 14
(1,5;3) GPC ON FPGA Dot transform: Realization: FA FA FA ⇓ 15
(1,5;3) GPC ON FPGA (1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]: Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0 [Dinechin FPL’13] FA FA Slice LUT 0 0 0 0 1 1 1 1 Carry Logic 16
E FFICIENT GPC S ON FPGA S (1,4,1,5;5) GPC [Kumm MBMV’14]: Efficiency = 1.5 FA FA FA FA Slice LUT 0 0 0 0 1 1 1 1 Carry Logic 17
E FFICIENT GPC S ON FPGA S (1,4,0,6;5) GPC [Kumm MBMV’14]: Efficiency = 1.5 FA FA FA FA HA HA Slice LUT 0 0 0 0 1 1 1 1 Carry Logic 18
E FFICIENT GPC S ON FPGA S (1,3,2,5;5) GPC (proposed): Efficiency = 1.5 FA FA FA FA FA FA FA FA FA HA HA FA FA FA FA FA FA FA Slice LUT 0 0 0 0 1 1 1 1 Carry Logic 19
E FFICIENT GPC S ON FPGA S (6,0,6;5) GPC (proposed): Efficiency = 1.75 FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA Slice LUT 0 0 0 0 1 1 1 1 Carry Logic 20
C OMPRESSOR T REE O PTIMIZATION Problem 1: The presented GPCs have irregular input pattern How to select them to get the least LUT resources? Problem 2: Pipelining is important on FPGAs to obtain a high throughput. How to select them to get the least LUT/FF resources? (least pipeline balancing FFs) 21
E XAMPLE FOR P ROBLEM 1 5 5 5 5 5 bits in stage 0 1 4 1 5 o − (1,4,1,5;5) GPC + 1 1 1 1 1 1 4 1 4 o − (1,4,1,5;5) GPC + 1 1 1 1 1 = 1 6 2 2 2 1 bits in stage 1 1 6 2 2 2 1 bits in stage 1 6 o − (6;3) GPC + 1 1 1 = 1 2 1 2 2 2 1 bits in stage 2 22
E XAMPLE FOR P ROBLEM 2 5 5 5 5 5 bits in stage 0 2 0 4 5 o − (2,0,4,5;5) GPC + 1 1 1 1 1 5 0 5 o − (6,0,6;5) GPC + 1 1 1 1 1 3 1 o − 4 FF for pipeline balancing + 3 1 = 1 1 2 5 2 2 1 bits in stage 1 1 1 2 5 2 2 1 bits in stage 1 1 1 2 5 o − (1,3,2,5;5) GPC + 1 1 1 1 1 2 2 1 o − 5 FF for pipeline balancing + 2 2 1 = 1 1 1 1 1 2 2 1 bits in stage 2 23
P ROPOSED O PTIMIZATION A generic ILP optimizer was used Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs For that, a `pseudo compressor´ with one input and one output is introduced (no compression) To optimize a combinatorial compressor tree (problem 1) the cost are set to zero (a wire) To optimize a pipelined compressor tree (problem 2) the cost are set to the flip flop cost 24
ILP F ORMULATION ILP variables: No. of bits in stage s and column c : N s,c No. of GPCs in stage s , of type e and column c : k s,e,c No. of inputs and outputs of GPC (Typ e ) in column c : and , respectively M e,c K e,c LUT cost of GPC e : c e Binary variable to select the active stage: ( if stage s is used 1 D s = otherwise 0 25
ILP F ORMULATION S − 1 C − 1 E − 1 X X X minimize c e k s,e,c s =0 c =0 e =0 subject to s = 1 . . . S − 1 , E − 1 C e − 1 ) X X M e,c + c 0 k s − 1 ,e,c + c 0 C1: N s − 1 ,c ≤ c = 0 . . . C − 1 , if D s = 0 e =0 c 0 =0 E − 1 C e − 1 ) s = 1 . . . S − 1 , X X K e,c + c 0 k s − 1 ,e,c + c 0 C2: N s,c = c = 0 . . . C − 1 e =0 c 0 =0 ⇢ 2 for two-input VMA C3: N s,c ≤ if D s = 1 3 for ternary VMA S − 1 X C4: D s = 1 s =1 26
ILP F ORMULATION C1 and C3 have to be linearized: E − 1 C e − 1 X X M e,c + c 0 k s − 1 ,e,c + c 0 + ID s C1’: N s − 1 ,c ≤ e =0 c 0 =0 ⇢ 2 + (1 − D s ) I for two-input VMA C3’: N s,c ≤ 3 + (1 − D s ) I for ternary VMA I must be a sufficiently large integer. 27
R ESULTS (a) 250 700 600 200 500 150 400 #LUT #LUT 300 100 200 Heuristic [8] Heuristic [8] 50 100 prop. ILP prop. ILP 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Compressed bits Compressed bits (a) Virtex 4 FPGA Virtex 6 FPGA The required LUTs could be reduced by 23% (Virtex 4) and 30% (Virtex 6) compared to Dinechin (FPL’13) [8] The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis. 28
E XAMPLE C OMPRESSION T REE WITH 16 I NPUTS , 16 B IT E ACH FloPoCo Proposed ILP [Dinechin FPL’13] 29
C ONCLUSION & O UTLOOK A novel ILP formulation for the optimization of pipelined compressor trees was presented There is a notable gap between the former state-of-the-art heuristic and our optimal solution Extensions are proposed for minimal stage count or variable column counters like 4:2 compressors Good heuristics are still required for problem sizes >100 bit due to the runtime of the ILP solver So far there is no heuristic considering pipelining 30
T HANK Y OU !
L ITERATURE [Parandeh-Afshar TRETS’11]: H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Inne, “Compressor Tree Synthesis on Commercial High-Performance FPGAs,” ACM TRETS , 2011 [Dinechin HEART’13]: F. de Dinechin, M. Istoan, and G. Sergent, “Fixed-Point Trigonometric Functions on FPGAs,” HEART 2013 , Jun. 2013. [Dinechin FPL’13]: N. Brunie, F. de Dinechin, M. Istoan, G. Sergent, K. Illyes, and B. Popa, “Arithmetic Core Generation Using Bit Heaps,” FPL 2013 [Matsunaga’13]: T. Matsunaga, S. Kimura, and Y. Matsunaga, “An Exact Approach for GPC-Based Compressor Tree Synthesis,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , Dec. 2013.
ATTACHMENTS 34
Recommend
More recommend