UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program
Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions
Why? FPGA is Known to be Power Inefficient! Source: [Zuchowski, et al, ICCAD02] � FPGA consumes 50-100X more power � Why do we care about power optimization for FPGAs ?!
FPGA Advantages � Short TAT (total turnaround time) � No or very low NRE
ASICs Become Increasingly Expensive � Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below $2.5 60 $60 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Total Cost for Mask Set ($M) Process (um) $50 $2.0 Single Mask 1.5 1.5 2.5 4.5 7.5 12 40 60 40 $40 cost ($K) Cost/Mask ($K) $1.5 $30 12 12 12 16 20 26 30 34 # of Masks $1.0 $20 Mask Set cost 12 18 18 30 72 150 312 1,000 2,000 $0.5 7.5 ($K) $10 $0.0 0 250nm 180nm 130nm 100nm Source: EETimes
Our Research Fabric Circuit Design Design Power Efficient FPGAs Synthesis System Tools Design
Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions
FPGA Architecture K Out Inputs D FF LUT Clock Programm able IO BLE # 1 N N Programm I Outputs I able Logic Inputs BLE # N Clock Programm able Routing
Evaluation Framework – fpgaEva-LP fpgaEva-LP [Li, et al, FPGA’03] BLIF BLIF SLIF SLIF Logic Optimization(SIS) Logic Optimization(SIS) BC-Netlist Tech-Mapping (RASP) Tech-Mapping (RASP) Generator Arch Timing-Driven Packing (TV-Pack) Timing-Driven Packing (TV-Pack) BC-Netlist Spec Placement & Routing (VPR) Placement & Routing (VPR) Power Simulator Area Area Delay Delay Power
BC -Netlist Generator Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation Back-annotation BC-Netlist
Mixed-level Power Model – Overview � Static Power � Dynamic power � Switching power � Sub-threshold leakage � Gate leakage � Short-circuit power � Reverse biased leakage � Related to signal � Depending on the input transitions vector � Functional switch � Glitch components power Logic Block Interconnect & sources clock Dynamic Macro-model Switch-level model Static Macro-model Macro-model
Cycle-Accurate Power Simulator BC-Netlist Random Vector Generation Post-layout extracted delay & capacitance Cycle Accurate Power Simulation with Glitch Analysis Mixed-level Power Model All cycles No finished? ∑ ∑ = + E E ( n ) E ( n ) Yes cycle a s ∈ ∈ i active j idle Power Values
Power Breakdown Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Logic Block Power Logic Block Clock Power Clock Power 19% Power 15% 22% 40% Interconnect Interconnect Power Power 45% 59% � Interconnect power is dominant
Power Breakdown (cont’d) Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Leakage Leakage Power Power 42% 52% Dynamic Dynamic Power Power 48% 58% � Leakage power becomes increasingly important (100nm)
Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture � Low Power Synthesis with Dual-Vdd � Conclusion
Total Power along LUT and Cluster Size Changes 2 Cluster Size = 4 1.9 Cluster Size = 6 Total FPGA Power (normalized Cluster Size = 8 1.8 Cluster Size = 10 1.7 geometric mean) Cluster Size = 12 1.6 1.5 1.4 1.3 1.2 1.1 1 3 4 5 6 7 LUT Size Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches
Routing Architecture Evaluation
Architecture of Low-power and High-performance Applications Best FPGA architecture Energy Delay E 3 t Et 3 (E) (t) Cluster size 10, LUT size 4, Low-power 0.9653 0.9904 0.8909 1.0080 wire segment length 4, (E 3 t) 25% buffered routing switches Cluster size 12, LUT size 4, High- 1.0502 0.8865 1.0268 0.7865 Wire segment length 4, performance 100% buffered routing (Et 3 ) switches � Arch. Parameter selection leads to 10% power/delay trade-off � Uniform FPGA fabrics provide limited power-performance tradeoff � Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual- Vdd fabrics
Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04] � Low Power Synthesis with Dual-Vdd � Conclusion
Dual-Vdd LUT Design � Dual-Vdd technique makes use of the timing slack to reduce power � VddH devices on critical path performance � VddL devices on non-critical paths power � Assume uniform Vdd for one LUT � Threshold voltage Vt should be adjusted carefully for different Vdd levels � To compensate delay increase � To avoid excessive leakage power increase
Vdd/Vt-Scaling for LUTs � Constant-leakage scaling obtains � Three scaling schemes a good tradeoff � Constant-Vt scaling � useful for both single-Vdd � Fixed-Vdd/Vt-ratio scaling scaling and dual-Vdd design � Constant-leakage scaling 0.7 10 constant Vt constant Vt 9 fixed-Vdd/Vt-ratio fixed-Vdd/Vt-ratio 0.6 constant leakage constant leakage 8 Leakage Power ( uW) 7 0.5 Delay (ns) 6 5 0.4 4 0.3 3 2 0.2 1 0 0.1 1.3v 1.0v 0.9v 0.8v 1.3v 1.0v 0.9v 0.8v Vdd (V) Vdd (V)
Dual-Vt LUT Design � LUT is divided into two parts � Part I: configuration cells high Vt � Part II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling) � Configuration SRAM cells � Content remains unchanged after configuration � Read/write delay is not related to FPGA performance � Use high Vt ~40% of Vdd � Maintain signal integrity � Reduce SRAM leakage by 15X and LUT leakage by 2.4X � Increase configuration time by 13%
Pre-Defined Dual-Vt Fabric � Power saving � 11.6% for combinational circuits � 14.6% for sequential circuits arch-SVST arch-SVDT arch-SVST arch-SVDT (Single Vt ) (Dual Vt ) (Single Vt ) (Dual Vt ) circuit Circuit power (watt) power saving power (watt) power saving bigkey 0.148 12.3% alu4 0.0798 8.5% clma 0.632 14.8% apex2 0.108 9.3% diffeq 0.0391 19.7% apex4 0.0536 12.3% dsip 0.134 14.5% des 0.234 10.7% elliptic 0.140 16.3% ex1010 0.179 17.3% frisc 0.190 19.2% ex5p 0.059 11.6% s298 0.0736 13.4% misex3 0.0753 9.4% s38417 0.307 11.7% pdc 0.256 14.7% s38484 0.261 10.2% seq 0.0927 9.4% tseng 0.0351 14.0% spla 0.180 12.4% Avg. 14.6% Avg. 11.6% Table1 Combinational circuits Table2 Sequential circuits
Dual-Vdd FPGA Fabric � Granularity: logic block (i.e., cluster of LUTs) � Smaller granularity => intuitively more power saving � But a larger implementation overhead � Layout pattern: pre-defined dual-Vdd pattern � Row-based or interleaved pattern � Ratio of VddL/VddH blocks is 2:1 (benchmark profiling) � Interconnect uses uniform VddH L-block: VddL H-block: VddH
Simple Design Flow for Dual-Vdd Fabric � Based on traditional design flow, but with new steps Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)
Comparison Between Vdd-Scaling and Dual-Vdd � For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving) � For low clock frequency, single-Vdd scaling is better � Still a large gap between ideal dual-Vdd and real case � Ideal dual-Vdd is the result without layout pattern constraint 0.09 arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) 0.08 1.5v arch-DVDT(pre-defined Vdd) 0.07 1.5/1.0v Power (watt) 1.3v 1.5v/1.0v circuit: alu4 0.06 1.3/0.9v 1.3/1.0v 1.3v/0.8v 0.05 1.0v 1.0/0.9v 0.04 0.9v 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 Max. Clock Frequency (MHz)
Vdd-Programmable Logic Block � Power switches for Vdd selection and power gating � One-bit control is needed for Vdd selection, but two-bit control power gating
Experimental Results with Vdd- Programmable Blocks � Power v.s. performance Circuit: alu4 0.09 arch-SV (Vdd scaling) 1.5v arch-DV (configurable Vdd) 0.08 arch-DV (ideal case) total power (watt) arch-DV (pre-defined Vdd) 1.5v/0.8v 0.07 1.3 1.5v/1.0v 1.5v/1.0v v 1.5v/1.0v 1.3v/0.9v 0.06 1.3v/0.8v 1.3v/0.8v 1.3v/0.8v 0.05 1.0v 1.0v/0.8v 1.0v/0.8v 1.0v/0.9v 0.04 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 clock frequency (MHz)
Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions
Recommend
More recommend