Architecture and Synthesis for Power-Efficient FPGAs Jason Cong - PowerPoint PPT Presentation

UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program

Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions

Why? FPGA is Known to be Power Inefficient! Source: [Zuchowski, et al, ICCAD02] � FPGA consumes 50-100X more power � Why do we care about power optimization for FPGAs ?!

FPGA Advantages � Short TAT (total turnaround time) � No or very low NRE

ASICs Become Increasingly Expensive � Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below $2.5 60 $60 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Total Cost for Mask Set ($M) Process (um) $50 $2.0 Single Mask 1.5 1.5 2.5 4.5 7.5 12 40 60 40 $40 cost ($K) Cost/Mask ($K) $1.5 $30 12 12 12 16 20 26 30 34 # of Masks $1.0 $20 Mask Set cost 12 18 18 30 72 150 312 1,000 2,000 $0.5 7.5 ($K) $10 $0.0 0 250nm 180nm 130nm 100nm Source: EETimes

Our Research Fabric Circuit Design Design Power Efficient FPGAs Synthesis System Tools Design

FPGA Architecture K Out Inputs D FF LUT Clock Programm able IO BLE # 1 N N Programm I Outputs I able Logic Inputs BLE # N Clock Programm able Routing

Evaluation Framework – fpgaEva-LP fpgaEva-LP [Li, et al, FPGA’03] BLIF BLIF SLIF SLIF Logic Optimization(SIS) Logic Optimization(SIS) BC-Netlist Tech-Mapping (RASP) Tech-Mapping (RASP) Generator Arch Timing-Driven Packing (TV-Pack) Timing-Driven Packing (TV-Pack) BC-Netlist Spec Placement & Routing (VPR) Placement & Routing (VPR) Power Simulator Area Area Delay Delay Power

BC -Netlist Generator Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation Back-annotation BC-Netlist

Mixed-level Power Model – Overview � Static Power � Dynamic power � Switching power � Sub-threshold leakage � Gate leakage � Short-circuit power � Reverse biased leakage � Related to signal � Depending on the input transitions vector � Functional switch � Glitch components power Logic Block Interconnect & sources clock Dynamic Macro-model Switch-level model Static Macro-model Macro-model

Cycle-Accurate Power Simulator BC-Netlist Random Vector Generation Post-layout extracted delay & capacitance Cycle Accurate Power Simulation with Glitch Analysis Mixed-level Power Model All cycles No finished? ∑ ∑ = + E E ( n ) E ( n ) Yes cycle a s ∈ ∈ i active j idle Power Values

Power Breakdown Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Logic Block Power Logic Block Clock Power Clock Power 19% Power 15% 22% 40% Interconnect Interconnect Power Power 45% 59% � Interconnect power is dominant

Power Breakdown (cont’d) Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Leakage Leakage Power Power 42% 52% Dynamic Dynamic Power Power 48% 58% � Leakage power becomes increasingly important (100nm)

Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture � Low Power Synthesis with Dual-Vdd � Conclusion

Total Power along LUT and Cluster Size Changes 2 Cluster Size = 4 1.9 Cluster Size = 6 Total FPGA Power (normalized Cluster Size = 8 1.8 Cluster Size = 10 1.7 geometric mean) Cluster Size = 12 1.6 1.5 1.4 1.3 1.2 1.1 1 3 4 5 6 7 LUT Size Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

Routing Architecture Evaluation

Architecture of Low-power and High-performance Applications Best FPGA architecture Energy Delay E 3 t Et 3 (E) (t) Cluster size 10, LUT size 4, Low-power 0.9653 0.9904 0.8909 1.0080 wire segment length 4, (E 3 t) 25% buffered routing switches Cluster size 12, LUT size 4, High- 1.0502 0.8865 1.0268 0.7865 Wire segment length 4, performance 100% buffered routing (Et 3 ) switches � Arch. Parameter selection leads to 10% power/delay trade-off � Uniform FPGA fabrics provide limited power-performance tradeoff � Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual- Vdd fabrics

Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04] � Low Power Synthesis with Dual-Vdd � Conclusion

Dual-Vdd LUT Design � Dual-Vdd technique makes use of the timing slack to reduce power � VddH devices on critical path performance � VddL devices on non-critical paths power � Assume uniform Vdd for one LUT � Threshold voltage Vt should be adjusted carefully for different Vdd levels � To compensate delay increase � To avoid excessive leakage power increase

Vdd/Vt-Scaling for LUTs � Constant-leakage scaling obtains � Three scaling schemes a good tradeoff � Constant-Vt scaling � useful for both single-Vdd � Fixed-Vdd/Vt-ratio scaling scaling and dual-Vdd design � Constant-leakage scaling 0.7 10 constant Vt constant Vt 9 fixed-Vdd/Vt-ratio fixed-Vdd/Vt-ratio 0.6 constant leakage constant leakage 8 Leakage Power ( uW) 7 0.5 Delay (ns) 6 5 0.4 4 0.3 3 2 0.2 1 0 0.1 1.3v 1.0v 0.9v 0.8v 1.3v 1.0v 0.9v 0.8v Vdd (V) Vdd (V)

Dual-Vt LUT Design � LUT is divided into two parts � Part I: configuration cells high Vt � Part II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling) � Configuration SRAM cells � Content remains unchanged after configuration � Read/write delay is not related to FPGA performance � Use high Vt ~40% of Vdd � Maintain signal integrity � Reduce SRAM leakage by 15X and LUT leakage by 2.4X � Increase configuration time by 13%

Pre-Defined Dual-Vt Fabric � Power saving � 11.6% for combinational circuits � 14.6% for sequential circuits arch-SVST arch-SVDT arch-SVST arch-SVDT (Single Vt ) (Dual Vt ) (Single Vt ) (Dual Vt ) circuit Circuit power (watt) power saving power (watt) power saving bigkey 0.148 12.3% alu4 0.0798 8.5% clma 0.632 14.8% apex2 0.108 9.3% diffeq 0.0391 19.7% apex4 0.0536 12.3% dsip 0.134 14.5% des 0.234 10.7% elliptic 0.140 16.3% ex1010 0.179 17.3% frisc 0.190 19.2% ex5p 0.059 11.6% s298 0.0736 13.4% misex3 0.0753 9.4% s38417 0.307 11.7% pdc 0.256 14.7% s38484 0.261 10.2% seq 0.0927 9.4% tseng 0.0351 14.0% spla 0.180 12.4% Avg. 14.6% Avg. 11.6% Table1 Combinational circuits Table2 Sequential circuits

Dual-Vdd FPGA Fabric � Granularity: logic block (i.e., cluster of LUTs) � Smaller granularity => intuitively more power saving � But a larger implementation overhead � Layout pattern: pre-defined dual-Vdd pattern � Row-based or interleaved pattern � Ratio of VddL/VddH blocks is 2:1 (benchmark profiling) � Interconnect uses uniform VddH L-block: VddL H-block: VddH

Simple Design Flow for Dual-Vdd Fabric � Based on traditional design flow, but with new steps Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)

Comparison Between Vdd-Scaling and Dual-Vdd � For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving) � For low clock frequency, single-Vdd scaling is better � Still a large gap between ideal dual-Vdd and real case � Ideal dual-Vdd is the result without layout pattern constraint 0.09 arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) 0.08 1.5v arch-DVDT(pre-defined Vdd) 0.07 1.5/1.0v Power (watt) 1.3v 1.5v/1.0v circuit: alu4 0.06 1.3/0.9v 1.3/1.0v 1.3v/0.8v 0.05 1.0v 1.0/0.9v 0.04 0.9v 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 Max. Clock Frequency (MHz)

Vdd-Programmable Logic Block � Power switches for Vdd selection and power gating � One-bit control is needed for Vdd selection, but two-bit control power gating

Experimental Results with Vdd- Programmable Blocks � Power v.s. performance Circuit: alu4 0.09 arch-SV (Vdd scaling) 1.5v arch-DV (configurable Vdd) 0.08 arch-DV (ideal case) total power (watt) arch-DV (pre-defined Vdd) 1.5v/0.8v 0.07 1.3 1.5v/1.0v 1.5v/1.0v v 1.5v/1.0v 1.3v/0.9v 0.06 1.3v/0.8v 1.3v/0.8v 1.3v/0.8v 0.05 1.0v 1.0v/0.8v 1.0v/0.8v 1.0v/0.9v 0.04 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 clock frequency (MHz)

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong - PowerPoint PPT Presentation

UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Flex Ray: Serial Interface - a Formal Model for Coding and Decoding Seminar: The FlexRay

Context More a more devices are powered by battery: High performance Required features: Long

Booster Fast Loss Monitoring PIP Booster Workshop R.J. Tesarek 11/23/15 1 Fast Loss Monitor

Synchronous Control and State Machines in Modelica Hilding Elmqvist Dassault Systmes Sven Erik

A Dataset for Developing and Benchmarking Active Vision Phil Ammirato, Patrick Poirson, Eunbyung

C18 Computer Vision Lecture 5 Imaging geometry, camera calibration Victor Adrian Prisacariu

BADGr: A Toolbox for Box-based Approximation, Decomposition and Grasping Kai Huebner

5. Situated Agents (Robots) Part 1: Introduction to Robotics. ) Vision and uncertainty Vision

Sambuz

Useful Links

Newsletter

Mail Us

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong - PowerPoint PPT Presentation

UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Flex Ray: Serial Interface - a Formal Model for Coding and Decoding Seminar: The FlexRay

Context More a more devices are powered by battery: High performance Required features: Long

Booster Fast Loss Monitoring PIP Booster Workshop R.J. Tesarek 11/23/15 1 Fast Loss Monitor

Synchronous Control and State Machines in Modelica Hilding Elmqvist Dassault Systmes Sven Erik

A Dataset for Developing and Benchmarking Active Vision Phil Ammirato, Patrick Poirson, Eunbyung

C18 Computer Vision Lecture 5 Imaging geometry, camera calibration Victor Adrian Prisacariu

BADGr: A Toolbox for Box-based Approximation, Decomposition and Grasping Kai Huebner

5. Situated Agents (Robots) Part 1: Introduction to Robotics. ) Vision and uncertainty Vision

Sambuz

Useful Links

Newsletter

Mail Us

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are