Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA
Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level decisions Evaluation of metrics for scheduling/binding Conclusion
High-Level Synthesis Synthesis as a model refinement Behavioral process Model Mature RTL-to-layout flow today Behavior model: one level above RTL Model RTL C/C++/SystemC/Matlab, etc. Gate-Level High-level synthesis Netlist Untimed behavioral model to cycle- accurate RTL Layout Typically: C to Verilog
A Typical Synthesis Flow from Behavior Level t 1 = a + b; Compiler transformation t 2 = c * d; • Program -> CDFG t 3 = e + f; t 4 = t 1 * t 2 ; z = t 4 – t 3 ; Scheduling × + + × • CDFG -> FSMD Binding S0 d S0 • FSMD -> RTL b a S1 S1 – * S2 S2 z RTL Synthesis, P&R … 3 cycles
A Short History of High-Level Synthesis 1980s—early 1990s: research and prototype Late 1990s: early commercialization Synopsys Behavioral Compiler, etc. Mostly from behavioral VHDL/Verilog 2000—present: another wave of commercialization C-based languages (C/C++/SystemC) as input AutoESL (Xilinx), Cadence, Forte, Mentor (Calypto), NEC, Synfora (Synopsys), Synopsys Growing interest driven by design complexity and time-to-market pressue
xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006] Advanced transformtion/optimizations Behavioral spec. Loop unrolling/shifting/pipelining in C/C++/SystemC Strength reduction / Tree height reduction Platform Bitwidth analysis Frontend description Memory analysis … compiler Core behvior synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM Arch-generation & RTL/constraints generation RTL + constraints Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … FPGAs/ASICs
AutoPilot Compilation Tool (based UCLA xPilot system) Design Specification C/C++/SystemC User Constraints Common Testbench Platform-based C to FPGA Simulation, Verification, and Prototyping synthesis Compilation & AutoPilot TM Elaboration Synthesize pure ANSI-C and ESL Synthesis C++, GCC-compatible compilation flow Presynthesis Optimizations Full support of IEEE-754 floating point data types & Behavioral & Communication = Platform operations Characterization Synthesis and Optimizations Efficiently handle bit-accurate Library fixed-point arithmetic RTL HDLs & Timing/Power/Layout More than 10X design RTL SystemC Constraints productivity gain High quality-of-results FPGA Co-Processor Developed by AutoESL, acquired by Xilinx in Jan. 2011
AutoPilot Results: Sphere Decoder (from Xilinx) Toplevel Block Diagram • W ireless MI MO Sphere 4x4 4x4 Matrix Inverse Norm Matrix Matrix Decoder Back Search/ multiply H multiply QRD Subst. Reorder – ~ 4 0 0 0 lines of C code – Xilinx Virtex-5 at 2 2 5 MHz 3x3 3x3 Matrix Inverse Norm Matrix Matrix • Com pared to optim ized I P Back Search/ QRD multiply multiply Subst. Reorder – 1 1 -3 1 % better resource usage 2x2 2x2 Matrix Inverse Norm Matrix Matrix Back Search/ QRD multiply multiply Subst. Reorder … Metric RTL AutoPilot Diff Tree Search Sphere Detector Min 8x8 RVD Stage 1 Stage 8 Expert Expert ( % ) Search QRD LUTs 32,708 29,060 -11% Registers 44,885 31,000 -31% TCAD April 2011 (keynote paper) “High-Level Synthesis for FPGAs: From DSP48s 225 201 -11% Prototyping to Deployment” BRAMs 128 99 -26%
AutoPilot Results: DQPSK Receiver (from BDTI) Application Hand-coded AutoPilot DQPSK receiver RTL 18.75Msamples @75MHz clock speed Xilinx 5.9% 5.6% XC3SD3400A chip utilization ratio Area better than hand-coded (lower the better) BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pd
AutoPilot Results: Optical Flow (from BDTI) Input Video Input Video Application Optical flow, 1280x720 progress scan Design too complex for an RTL team Compared to high-end DSP: 30X higher throughput, 40X better cost/fps Output Video Chip Highest Cost/ perform ance Unit Fram e Rate @ ( $ / fram e/ second) Cost 7 2 0 p ( fps) Xilinx $27 183 $0.14 Spartan3ADSP XC3SD3400A chip Texas $21 5.1 $4.20 Instruments TMS320DM6437 DSP processor BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pdf
Impact on Quality of Result Big impact on QoR due to drastically different architectures Parallel/sequential/pipelined Different ways to map operations to control states Different ways to share functional units/registers/interconnects Opportunity to select from multiple possible implementations Instead of struggling with a sub-optimal RTL Need metrics/models to decide which implementation is superior Performance/throughput/area can be estimated reasonably well in HLS Frequency/congestion is quite hard Some RTL structures lead to long interconnect delay after layout
Interconnect Estimation: the Challenge Estimation of interconnect timing and congestion is hard at a high level Long wires/congestion occur during layout Incorporate layout in synthesis? Reasonable, but time consuming. May not be necessary if we just want to estimate if one solution is better than the other Try to get the more layout-friendly solution In this work Experimentally evaluate the impact of HLS decisions on congestion Evaluate some possible metrics without doing layout
Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level decisions Evaluation of metrics for scheduling/binding Conclusion
Experiment Setup Varying strategies in HLS Compiler transformation Impacts of compiler transformation • Program -> CDFG Loop unrolling, memory partitioning, etc. and synthesis engine (scheduling Binding objective constraint & binding) evaluated separately 1 Total area None Scheduling 5 DSP benchmarks (lots of 2 Total area Mux_input <= 4 Scheduling objective Resource constraint • CDFG -> FSMD multiplication/addition, simple or no 3 #R (total number of registers) Mux_input <= 4 1 ASAP (as soon as possible) None control flow) for synthesis engine 4 #M None 2 ALAP (as late as possible) None Number of lines in C Number of nodes in CDFG 5 #M Mux_input <= 4 Binding 3 MINREG (reduce registers) None Test1 96 78 6 #M and #R None • FSMD -> RTL 4 ALAP #M = ceil(0.25 * m) Test2 20 90 7 #M and #R Mux_input <= 4 5 ALAP #M = ceil(0.25 * m), #A = ceil(0.4 * a) Test3 97 160 8 #M and #A None 6 MINREG Test4 16 #M = ceil(0.1 * m), #A = ceil(0.2 * a) 50 9 #M and #A Mux_input <= 4 #M: number of multiplier m: number of multiplication Test5 87 390 10 #M and #A and #R Mux_input <= 4 #A: number of adder a: number of addition/subtraction
The RTL Implementation Flow for Routability Evaluation RTL elaboration by Quartus C program Logic synthesis high-level synthesis by ABC by xPilot Evaluation (with different strategies) Pack & place by VPACK+VPR Verilog code Routing by VPR
Implementation Flow Setup Target platform: island-style FPGA 10 4-LUTs per CLB, with routing channels between CLBs (span = 1 CLB) The number of routing tracks per channel ( channel width ) is constant Configurations of the toolchain Logic synthesis by ABC with default settings Packing by T-VPACK with default settings Wirelength-driven placement by VPR using simulated-annealing Routing by VPR using negotiation-based routing and directed search • The channel width is variable and determined by binary search Post-layout characteristics Maximum channel width (CW_max) Average wirelength (WL_avg) = average #tracks per net
Impact of the Synthesis Engine 60 RTLs generated for each design 6 scheduling strategies, 10 binding strategies Some are equivalent Results: min/max for each metric Clearly, very different although behaviorally equivalent
Impact of the Synthesis Engine (min vs max) 60 140 CW_max CW_avg 50 120 100 40 80 30 60 20 40 10 20 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5 18 18 WL_tot WL_avg 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5
Impact of Compiler Transformations A matrix multiplication example outer_loop: for (i = 0; i < 8; i++) { middle_loop: for (j = 0; j < 8; j++) { Result[i][j] = 0; inner_loop: for (k = 0; k < 8; k++) Result[i][j] += X[i][k] * Y[k][j]; } } Different ways to transform/pipeline the code, partition memory loop memory 1 Keep all loops, pipeline inner loop As is 2 Unroll inner loop, pipeline middle Partition X into columns and Y into rows loop to allow simultaneous accesses 3 Unroll inner and middle loop, pipeline Partition X and Y into scalars, partition outer loop Result into columns
Impact of Compiler Transformations
Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level decisions Evaluation of metrics for scheduling/binding Conclusion
Recommend
More recommend