Advanced Synthesis Techniques
Reminder From Last Year Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools – >Report – >Report DRC – Iterate in Synthesis (converge within 300ps) Real problems seen post synthesis (long path…) Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic! Improve area, timing, power – Only then, iterate in next steps opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic
Advanced Synthesis Techniques Overview Advantages of C Synthesis over RTL Synthesis Advance Synthesis Techniques for Design Closure Case Study: design closure at Synthesis level
HLS & IP Integrator (IPI) vs. RTL Synthesis Design closure Actual example: VHDL, Verilog: RTL (VHDL) 100k lines RTL (VHDL) VHDL-2008, SV: 50k lines HW Traditional Flow RTL RTL System Debug P&R Sim. Test Bench • Synthesis 240 people*mo Test Bench Test Bench (System C) (System C) o 10 people (driver level) o 2 years Exhaustive functional Minimal test tests Verification advantage: e.g. video processing 15x faster • RTL: 1 frame per ~5 hours • C++: 1 frame per second HLS Based Flow • 16 people*mo Design closure o 2 people o 8 month RTL (VHDL) C++ code RTL (VHDL) ( 5k lines)) HW System C RTL Faster for derivative designs HLS IPI Debug P&R Debug Test Bench Synth Test Bench • C++ reuse Test Bench (System C) (System C) • (application) Scales with parameters • Device independent System-level Debug Exhaustive functional tests
HLS Automates Micro-architecture Exploration Project specification while (i++) while (i++) for j =0 .. N for j =0 .. N y(i) = y(i) + b(j) * x(i-j) y(i) = y(i-j) +b(j) * x(i) Architecture choices Z -1 Z -1 Z -1 x(n) x(n) b 0 b 1 b 2 b m-1 b 0 b 1 b 2 b m-1 … … X X X X X X X X Z -1 Z -1 Z -1 + + + + + + Micro-architecture choices … … Algorithmic delay b x(n) x(n) x(n) X X X X X X + 0 + + + + + + 0 Pipeline register Fully parallel : N DSP no cascade Fully parallel : N DSP + cascade Fully Folded : 1 DSP (default)
HLS Micro-Architecture Exploration while (1) c 0 c 1 M3(i) = M1(i-2) * C2 Z -1 A1 A2 M1 M2 A1 A2 M1 M2 A1(i) = M3(i) + x(i) x[i] M2(i) = M1(i-1) * C1 c 2 A2(i) = A1(i) + M2(i) Z -1 M3 M3 M1(i) = A2(i) * C0 i++ Dataflow Graph C++ RTL Z -2 M3 M3 M3 Schedule 1: 14 cycles A2 A1 M1 A1 A2 M1 Z -1 Sequential process (CPU model) M2 M2 Minimal HW Resources : 1 MULT, 1 ADD i i+1 14 28 Schedule 2: 10 cycles Z -2 M3 M3 M3 Parallelism within each iteration A1 A2 A1 A2 M1 M1 Better performance (~29%) M2 M2 Z -1 20 2 MULT, 1 ADD i 10 i+1 i+2 Schedule 3: 9 cycles Z -2 M3 M3 M3 Loop pipelining A2 A2 A1 A1 M1 M1 Best performance (~36%) M2 M2 Z -1 9 18 2 MULT, 1 ADD i i+2 i+1
Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM • Elaborate FSM XDC • Boolean logic Module generators RTL Optimizations LUT6 Optimize & Map Boolean optimization Technology mapping P&R or DCP
• Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study
Architecture Aware DSP HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 B 48 XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates & Squarer (UG901) FIR (UG579) XOR (2016.1) Coding style examples: Multiply-accumulate Large accumulator …
DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A B Synthesis B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!
Architecture-Aware RAM & ROM RAMB36 HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations Width=1 x Depth=2 15 (32K) = 32Kx1 Width=2 x Depth=2 14 (16K) = 16Kx2 … Width=32 x Depth=2 10 (1K) = 1Kx32 – 36K configuration 32x1K Q Width=36 x Depth=2 10 (1K) = 1Kx36 addr Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!
RAM Decomposition: Example 32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!
RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!
Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Priority encoded logic Removing else ’s won’t help!! long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!
Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; //SystemVerilog flag = 1; end Same as if…if…if… Same as if…else if…else if… break won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!
Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: unique case (c) // works with “if” too v3: q=a4; v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … BAD GOOD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)
Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases … 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if ( c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); … in SystemVerilog else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog
Parallelizing Priority Logic When you can’t avoid O(n), you still can! i f c32…c63 i f c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 i f c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!
Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; Get rid of “if” c = c+1; a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!
Recommend
More recommend