Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin - PowerPoint PPT Presentation

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA

Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

High-Level Synthesis  Synthesis as a model refinement Behavioral process Model  Mature RTL-to-layout flow today  Behavior model: one level above RTL Model RTL  C/C++/SystemC/Matlab, etc. Gate-Level  High-level synthesis Netlist  Untimed behavioral model to cycle- accurate RTL Layout  Typically: C to Verilog

A Typical Synthesis Flow from Behavior Level t 1 = a + b; Compiler transformation t 2 = c * d; • Program -> CDFG t 3 = e + f; t 4 = t 1 * t 2 ; z = t 4 – t 3 ; Scheduling × + + × • CDFG -> FSMD  Binding S0 d S0 • FSMD -> RTL b a S1 S1 – * S2 S2 z RTL Synthesis, P&R … 3 cycles

A Short History of High-Level Synthesis  1980s—early 1990s: research and prototype  Late 1990s: early commercialization  Synopsys Behavioral Compiler, etc.  Mostly from behavioral VHDL/Verilog  2000—present: another wave of commercialization  C-based languages (C/C++/SystemC) as input  AutoESL (Xilinx), Cadence, Forte, Mentor (Calypto), NEC, Synfora (Synopsys), Synopsys  Growing interest driven by design complexity and time-to-market pressue

xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006]  Advanced transformtion/optimizations Behavioral spec.  Loop unrolling/shifting/pipelining in C/C++/SystemC  Strength reduction / Tree height reduction Platform  Bitwidth analysis Frontend description  Memory analysis … compiler  Core behvior synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding SSDM   Arch-generation & RTL/constraints generation RTL + constraints  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … FPGAs/ASICs

AutoPilot Compilation Tool (based UCLA xPilot system) Design Specification C/C++/SystemC User Constraints Common Testbench Platform-based C to FPGA  Simulation, Verification, and Prototyping synthesis Compilation & AutoPilot TM Elaboration Synthesize pure ANSI-C and  ESL Synthesis C++, GCC-compatible compilation flow Presynthesis Optimizations Full support of IEEE-754  floating point data types & Behavioral & Communication = Platform operations Characterization Synthesis and Optimizations Efficiently handle bit-accurate  Library fixed-point arithmetic RTL HDLs & Timing/Power/Layout More than 10X design  RTL SystemC Constraints productivity gain High quality-of-results  FPGA Co-Processor Developed by AutoESL, acquired by Xilinx in Jan. 2011

AutoPilot Results: Sphere Decoder (from Xilinx) Toplevel Block Diagram • W ireless MI MO Sphere 4x4 4x4 Matrix Inverse Norm Matrix Matrix Decoder Back Search/ multiply H multiply QRD Subst. Reorder – ~ 4 0 0 0 lines of C code – Xilinx Virtex-5 at 2 2 5 MHz 3x3 3x3 Matrix Inverse Norm Matrix Matrix • Com pared to optim ized I P Back Search/ QRD multiply multiply Subst. Reorder – 1 1 -3 1 % better resource usage 2x2 2x2 Matrix Inverse Norm Matrix Matrix Back Search/ QRD multiply multiply Subst. Reorder … Metric RTL AutoPilot Diff Tree Search Sphere Detector Min 8x8 RVD Stage 1 Stage 8 Expert Expert ( % ) Search QRD LUTs 32,708 29,060 -11% Registers 44,885 31,000 -31% TCAD April 2011 (keynote paper) “High-Level Synthesis for FPGAs: From DSP48s 225 201 -11% Prototyping to Deployment” BRAMs 128 99 -26%

AutoPilot Results: DQPSK Receiver (from BDTI)  Application Hand-coded AutoPilot  DQPSK receiver RTL  18.75Msamples @75MHz clock speed Xilinx 5.9% 5.6% XC3SD3400A chip utilization ratio  Area better than hand-coded (lower the better) BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pd

AutoPilot Results: Optical Flow (from BDTI) Input Video Input Video  Application  Optical flow, 1280x720 progress scan  Design too complex for an RTL team  Compared to high-end DSP:  30X higher throughput, 40X better cost/fps Output Video Chip Highest Cost/ perform ance Unit Fram e Rate @ ( $ / fram e/ second) Cost 7 2 0 p ( fps) Xilinx $27 183 $0.14 Spartan3ADSP XC3SD3400A chip Texas $21 5.1 $4.20 Instruments TMS320DM6437 DSP processor BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pdf

Impact on Quality of Result  Big impact on QoR due to drastically different architectures  Parallel/sequential/pipelined  Different ways to map operations to control states  Different ways to share functional units/registers/interconnects  Opportunity to select from multiple possible implementations  Instead of struggling with a sub-optimal RTL  Need metrics/models to decide which implementation is superior  Performance/throughput/area can be estimated reasonably well in HLS  Frequency/congestion is quite hard  Some RTL structures lead to long interconnect delay after layout

Interconnect Estimation: the Challenge  Estimation of interconnect timing and congestion is hard at a high level  Long wires/congestion occur during layout  Incorporate layout in synthesis?  Reasonable, but time consuming.  May not be necessary if we just want to estimate if one solution is better than the other  Try to get the more layout-friendly solution  In this work  Experimentally evaluate the impact of HLS decisions on congestion  Evaluate some possible metrics without doing layout

Experiment Setup  Varying strategies in HLS Compiler transformation  Impacts of compiler transformation • Program -> CDFG Loop unrolling, memory partitioning, etc. and synthesis engine (scheduling Binding objective constraint & binding) evaluated separately 1 Total area None Scheduling  5 DSP benchmarks (lots of 2 Total area Mux_input <= 4 Scheduling objective Resource constraint • CDFG -> FSMD multiplication/addition, simple or no 3 #R (total number of registers) Mux_input <= 4 1 ASAP (as soon as possible) None control flow) for synthesis engine 4 #M None 2 ALAP (as late as possible) None Number of lines in C Number of nodes in CDFG 5 #M Mux_input <= 4 Binding 3 MINREG (reduce registers) None Test1 96 78 6 #M and #R None • FSMD -> RTL 4 ALAP #M = ceil(0.25 * m) Test2 20 90 7 #M and #R Mux_input <= 4 5 ALAP #M = ceil(0.25 * m), #A = ceil(0.4 * a) Test3 97 160 8 #M and #A None 6 MINREG Test4 16 #M = ceil(0.1 * m), #A = ceil(0.2 * a) 50 9 #M and #A Mux_input <= 4 #M: number of multiplier m: number of multiplication Test5 87 390 10 #M and #A and #R Mux_input <= 4 #A: number of adder a: number of addition/subtraction

The RTL Implementation Flow for Routability Evaluation RTL elaboration by Quartus C program Logic synthesis high-level synthesis by ABC by xPilot Evaluation (with different strategies) Pack & place by VPACK+VPR Verilog code Routing by VPR

Implementation Flow Setup  Target platform: island-style FPGA  10 4-LUTs per CLB, with routing channels between CLBs (span = 1 CLB)  The number of routing tracks per channel ( channel width ) is constant  Configurations of the toolchain  Logic synthesis by ABC with default settings  Packing by T-VPACK with default settings  Wirelength-driven placement by VPR using simulated-annealing  Routing by VPR using negotiation-based routing and directed search • The channel width is variable and determined by binary search  Post-layout characteristics  Maximum channel width (CW_max)  Average wirelength (WL_avg) = average #tracks per net

Impact of the Synthesis Engine  60 RTLs generated for each design  6 scheduling strategies, 10 binding strategies  Some are equivalent  Results: min/max for each metric  Clearly, very different although behaviorally equivalent

Impact of the Synthesis Engine (min vs max) 60 140 CW_max CW_avg 50 120 100 40 80 30 60 20 40 10 20 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5 18 18 WL_tot WL_avg 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5

Impact of Compiler Transformations  A matrix multiplication example outer_loop: for (i = 0; i < 8; i++) { middle_loop: for (j = 0; j < 8; j++) { Result[i][j] = 0; inner_loop: for (k = 0; k < 8; k++) Result[i][j] += X[i][k] * Y[k][j]; } }  Different ways to transform/pipeline the code, partition memory loop memory 1 Keep all loops, pipeline inner loop As is 2 Unroll inner loop, pipeline middle Partition X into columns and Y into rows loop to allow simultaneous accesses 3 Unroll inner and middle loop, pipeline Partition X and Y into scalars, partition outer loop Result into columns

Impact of Compiler Transformations

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin - PowerPoint PPT Presentation

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level

Layout design I. Chapter 6 Basic layout types Systematic layout planning procedure Computerized

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Layouts Dynamic layout Swing and Layout Managers Layout strategies 1 CS 349 - Layouts 2 CS

Layout design III. Chapter 6 Layout generation MCRAFT BLOCPLAN LOGIC Methods for layout

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

Layout design II. Chapter 6 Layout generation Pairwise exchange method Graph-based method

Friendly Communities Sarah Prescott and Jude Woods Time to Shine Leeds Older Peoples Forum

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Special layout models Chapter 7 (Warehouse Operations) Chapter 10 (Facility Planning Models)

Japanese Layout Requirements Richard Ishida 1 Japanese Layout Requirements This presentation

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Where are we? Layout - Line of Diffusion Lots of Layout issues Very common layout method

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

Synthesis and optimization of railway signalling layouts Bjrnar Luteberget FMCAD student

Electronic structure calculations for magnetic systems Manuel Richter (IFW Dresden) 1.

MFTP: a Clean-Slate Transport Protocol for the Informa8on

Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey,

An application: foreign function bindings C int puts(const char *s); 1/ 19 C in two minutes

Programming Abstractions Week 2: Environments and Closures Stephen Checkoway Using variables

D3 Tutorial Data Binding and Loading Edit by Jiayi Xu and Han-Wei Shen, The Ohio State University

Axis2 Data-binding Thoughts Major changes from Axis 1.x Investigate the possibility

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access