towards layout friendly high level synthesis
play

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin - PowerPoint PPT Presentation

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA Outline High-level synthesis and layout-friendly architecture Evaluation of the impact of high-level


  1. Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University Guojie Luo Raghu Prabhakar UCLA

  2. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

  3. High-Level Synthesis  Synthesis as a model refinement Behavioral process Model  Mature RTL-to-layout flow today  Behavior model: one level above RTL Model RTL  C/C++/SystemC/Matlab, etc. Gate-Level  High-level synthesis Netlist  Untimed behavioral model to cycle- accurate RTL Layout  Typically: C to Verilog

  4. A Typical Synthesis Flow from Behavior Level t 1 = a + b; Compiler transformation t 2 = c * d; • Program -> CDFG t 3 = e + f; t 4 = t 1 * t 2 ; z = t 4 – t 3 ; Scheduling × + + × • CDFG -> FSMD  Binding S0 d S0 • FSMD -> RTL b a S1 S1 – * S2 S2 z RTL Synthesis, P&R … 3 cycles

  5. A Short History of High-Level Synthesis  1980s—early 1990s: research and prototype  Late 1990s: early commercialization  Synopsys Behavioral Compiler, etc.  Mostly from behavioral VHDL/Verilog  2000—present: another wave of commercialization  C-based languages (C/C++/SystemC) as input  AutoESL (Xilinx), Cadence, Forte, Mentor (Calypto), NEC, Synfora (Synopsys), Synopsys  Growing interest driven by design complexity and time-to-market pressue

  6. xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006]  Advanced transformtion/optimizations Behavioral spec.  Loop unrolling/shifting/pipelining in C/C++/SystemC  Strength reduction / Tree height reduction Platform  Bitwidth analysis Frontend description  Memory analysis … compiler  Core behvior synthesis optimizations  Scheduling  Resource binding, e.g., functional unit binding register/port binding SSDM   Arch-generation & RTL/constraints generation RTL + constraints  Verilog/VHDL/SystemC  FPGAs: Altera, Xilinx  ASICs: Magma, Synopsys, … FPGAs/ASICs

  7. AutoPilot Compilation Tool (based UCLA xPilot system) Design Specification C/C++/SystemC User Constraints Common Testbench Platform-based C to FPGA  Simulation, Verification, and Prototyping synthesis Compilation & AutoPilot TM Elaboration Synthesize pure ANSI-C and  ESL Synthesis C++, GCC-compatible compilation flow Presynthesis Optimizations Full support of IEEE-754  floating point data types & Behavioral & Communication = Platform operations Characterization Synthesis and Optimizations Efficiently handle bit-accurate  Library fixed-point arithmetic RTL HDLs & Timing/Power/Layout More than 10X design  RTL SystemC Constraints productivity gain High quality-of-results  FPGA Co-Processor Developed by AutoESL, acquired by Xilinx in Jan. 2011

  8. AutoPilot Results: Sphere Decoder (from Xilinx) Toplevel Block Diagram • W ireless MI MO Sphere 4x4 4x4 Matrix Inverse Norm Matrix Matrix Decoder Back Search/ multiply H multiply QRD Subst. Reorder – ~ 4 0 0 0 lines of C code – Xilinx Virtex-5 at 2 2 5 MHz 3x3 3x3 Matrix Inverse Norm Matrix Matrix • Com pared to optim ized I P Back Search/ QRD multiply multiply Subst. Reorder – 1 1 -3 1 % better resource usage 2x2 2x2 Matrix Inverse Norm Matrix Matrix Back Search/ QRD multiply multiply Subst. Reorder … Metric RTL AutoPilot Diff Tree Search Sphere Detector Min 8x8 RVD Stage 1 Stage 8 Expert Expert ( % ) Search QRD LUTs 32,708 29,060 -11% Registers 44,885 31,000 -31% TCAD April 2011 (keynote paper) “High-Level Synthesis for FPGAs: From DSP48s 225 201 -11% Prototyping to Deployment” BRAMs 128 99 -26%

  9. AutoPilot Results: DQPSK Receiver (from BDTI)  Application Hand-coded AutoPilot  DQPSK receiver RTL  18.75Msamples @75MHz clock speed Xilinx 5.9% 5.6% XC3SD3400A chip utilization ratio  Area better than hand-coded (lower the better) BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pd

  10. AutoPilot Results: Optical Flow (from BDTI) Input Video Input Video  Application  Optical flow, 1280x720 progress scan  Design too complex for an RTL team  Compared to high-end DSP:  30X higher throughput, 40X better cost/fps Output Video Chip Highest Cost/ perform ance Unit Fram e Rate @ ( $ / fram e/ second) Cost 7 2 0 p ( fps) Xilinx $27 183 $0.14 Spartan3ADSP XC3SD3400A chip Texas $21 5.1 $4.20 Instruments TMS320DM6437 DSP processor BDTi evaluation of AutoPilot http:/ / w w w .bdti.com / articles/ AutoPilot.pdf

  11. Impact on Quality of Result  Big impact on QoR due to drastically different architectures  Parallel/sequential/pipelined  Different ways to map operations to control states  Different ways to share functional units/registers/interconnects  Opportunity to select from multiple possible implementations  Instead of struggling with a sub-optimal RTL  Need metrics/models to decide which implementation is superior  Performance/throughput/area can be estimated reasonably well in HLS  Frequency/congestion is quite hard  Some RTL structures lead to long interconnect delay after layout

  12. Interconnect Estimation: the Challenge  Estimation of interconnect timing and congestion is hard at a high level  Long wires/congestion occur during layout  Incorporate layout in synthesis?  Reasonable, but time consuming.  May not be necessary if we just want to estimate if one solution is better than the other  Try to get the more layout-friendly solution  In this work  Experimentally evaluate the impact of HLS decisions on congestion  Evaluate some possible metrics without doing layout

  13. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

  14. Experiment Setup  Varying strategies in HLS Compiler transformation  Impacts of compiler transformation • Program -> CDFG Loop unrolling, memory partitioning, etc. and synthesis engine (scheduling Binding objective constraint & binding) evaluated separately 1 Total area None Scheduling  5 DSP benchmarks (lots of 2 Total area Mux_input <= 4 Scheduling objective Resource constraint • CDFG -> FSMD multiplication/addition, simple or no 3 #R (total number of registers) Mux_input <= 4 1 ASAP (as soon as possible) None control flow) for synthesis engine 4 #M None 2 ALAP (as late as possible) None Number of lines in C Number of nodes in CDFG 5 #M Mux_input <= 4 Binding 3 MINREG (reduce registers) None Test1 96 78 6 #M and #R None • FSMD -> RTL 4 ALAP #M = ceil(0.25 * m) Test2 20 90 7 #M and #R Mux_input <= 4 5 ALAP #M = ceil(0.25 * m), #A = ceil(0.4 * a) Test3 97 160 8 #M and #A None 6 MINREG Test4 16 #M = ceil(0.1 * m), #A = ceil(0.2 * a) 50 9 #M and #A Mux_input <= 4 #M: number of multiplier m: number of multiplication Test5 87 390 10 #M and #A and #R Mux_input <= 4 #A: number of adder a: number of addition/subtraction

  15. The RTL Implementation Flow for Routability Evaluation RTL elaboration by Quartus C program Logic synthesis high-level synthesis by ABC by xPilot Evaluation (with different strategies) Pack & place by VPACK+VPR Verilog code Routing by VPR

  16. Implementation Flow Setup  Target platform: island-style FPGA  10 4-LUTs per CLB, with routing channels between CLBs (span = 1 CLB)  The number of routing tracks per channel ( channel width ) is constant  Configurations of the toolchain  Logic synthesis by ABC with default settings  Packing by T-VPACK with default settings  Wirelength-driven placement by VPR using simulated-annealing  Routing by VPR using negotiation-based routing and directed search • The channel width is variable and determined by binary search  Post-layout characteristics  Maximum channel width (CW_max)  Average wirelength (WL_avg) = average #tracks per net

  17. Impact of the Synthesis Engine  60 RTLs generated for each design  6 scheduling strategies, 10 binding strategies  Some are equivalent  Results: min/max for each metric  Clearly, very different although behaviorally equivalent

  18. Impact of the Synthesis Engine (min vs max) 60 140 CW_max CW_avg 50 120 100 40 80 30 60 20 40 10 20 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5 18 18 WL_tot WL_avg 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 test1 test2 test3 test4 test5 test1 test2 test3 test4 test5

  19. Impact of Compiler Transformations  A matrix multiplication example outer_loop: for (i = 0; i < 8; i++) { middle_loop: for (j = 0; j < 8; j++) { Result[i][j] = 0; inner_loop: for (k = 0; k < 8; k++) Result[i][j] += X[i][k] * Y[k][j]; } }  Different ways to transform/pipeline the code, partition memory loop memory 1 Keep all loops, pipeline inner loop As is 2 Unroll inner loop, pipeline middle Partition X into columns and Y into rows loop to allow simultaneous accesses 3 Unroll inner and middle loop, pipeline Partition X and Y into scalars, partition outer loop Result into columns

  20. Impact of Compiler Transformations

  21. Outline  High-level synthesis and layout-friendly architecture  Evaluation of the impact of high-level decisions  Evaluation of metrics for scheduling/binding  Conclusion

Recommend


More recommend