Advanced Synthesis Techniques Ramine Roane
Advanced Synthesis Techniques
Reminder From Last Year Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools–>Report–>Report DRC – Iterate in Synthesis (converge within 300ps) Real problems seen post synthesis (long path…) Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic! Improve area, timing, power – Only then, iterate in next steps opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic
Advanced Synthesis Techniques Overview Advance Synthesis Techniques for Design Closure Case Study: design closure at Synthesis level
Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM Elaborate FSM • XDC • Boolean logic Module generators LUT6 RTL Optimizations Optimize & Map Boolean optimization Technology mapping P&R or DCP
• Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study
Architecture Aware DSP HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 48 B XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates & Squarer (UG901) FIR (UG579) XOR (2016.1) Coding style examples: Multiply-accumulate Large accumulator …
DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A Synthesis B B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!
Architecture-Aware RAM & ROM RAMB36 HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations Width=1 x Depth=2 15 (32K) = 32Kx1 Width=2 x Depth=2 14 (16K) = 16Kx2 … Width=32 x Depth=2 10 (1K) = 1Kx32 32x1K – 36K configuration Q Width=36 x Depth=2 10 (1K) = 1Kx36 addr Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!
RAM Decomposition: Example 32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!
RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!
Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Removing else ’s won’t help!! Priority encoded logic long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!
Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: v3: q=a4; unique case (c) // works with “if” too v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … GOOD BAD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)
Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases… 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if (c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); in SystemVerilog … else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog
Parallelizing Priority Logic When you can’t avoid O(n), you still can! if c32…c63 if c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 if c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!
Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; // System Verilog flag = 1; //or exit in VHDL end Same as if…if…if… Same as if…else if…else if… Break/exit won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!
Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; c = c+1; Get rid of “if” a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!
Beware of Loop Unrolling – Arithmetic’s Q = 0 Q = … for i = 0 to 3 = 16*A + 48 Q = (A + 3) << 4 for j = 0 to 3 = A<<4 + 48 Q = Q+A+i+j A[N-1:4] A[N-5:0] Q = 0+ A+0+0 + A+0+1 + A+0+2 + A+0+3 + + Q[N-4:0] Q[N-1:4] A+1+0 + A+1+1 + A+1+2 + A+1+3 A+2+0 + A+2+1 + A+2+2 + A+2+3 48 3 A+3+0 + A+3+1 + A+3+2 + A+3+3 BAD: up to 36 N bit adder GOOD: 1 N-3 bit adder BETTER: 1 N-4 bit adder Loops (in general) can hurt timing! Here: symbolic arithmetic optimization may not happen
Avoid Gated Clock Transformation Very common in ASIC design (low power) Consolidate the clocks to minimize clock skew low-skew network D Q (BUFG) D Q ASIC FPGA CE CE c c clk clk CE (latched on ~c) edged detector D Q D Q D Q D Q CE c clk clk c clk clk BAD: 2 clocks, 1 gated GOOD: 1 clock Avoid gated clocks – they will hurt timing closure (will cause clock skew)
Avoid [Async] Resets What we recommended – Reduce the number of “control sets” {clk, rst, ce} – Avoid Reset / avoid Async Reset D Q D Q D Q CE CE CLR does this clk clk clk really remove reset? rst BAD: Attempt to remove Reset created Enable and Reset is still Async… Verify that removing Reset did not add Enables
RTL Synthesis: New Strategies Vivado RTL Synthesis has now 8 Strategies – Each Strategy is a combination of options & directives – Directives have a specific purpose For quick pipe-cleaning iterations – FLow_ Runtime Optimized For best area – Flow_ Area MultThresholdDSP – Flow_ Area Optimized_medium – Flow_ Area Optimized_high For performance – Vivado_Synthesis_ Default – Flow_ Perf Optimized_high – Flow_ Perf ThresholdCarry Strategies in Vivado (synthesis options) For congested designs – Flow_Alternate Routability Taking the best of all Strategies can give you 10% better QoR
Case Study Problem – Area explosion & bad timing in a design Locating the cause of the issue – Find offending module & synthesize it Out Of Context – Look for suspicious operators on Elaborated view (how??) – Cross-probe to source files Resolution – Fix the source code and/or use synthesis options
Case Study: Locating the Cause of the Issue Look for suspicious operators – Ctrl-F in Elaborated Schematic – Select suspicious operators (here: MULT, MOD…) – Press F4 to view schematic – Press F7 to cross-probe
Recommend
More recommend