Advanced Synthesis Techniques Reminder From Last Year Use UltraFast - PowerPoint PPT Presentation

Advanced Synthesis Techniques

Reminder From Last Year  Use UltraFast Design Methodology for Vivado – www.xilinx.com/ultrafast  Recommendations for Rapid Closure – HDL : use HDL Language Templates & DRC – Constraints : Timing Constraint Wizard, DRC Tools – >Report – >Report DRC – Iterate in Synthesis (converge within 300ps)  Real problems seen post synthesis (long path…)  Faster iterations & higher impact Worst path post Synthesis : 4.3ns 13 levels of logic!  Improve area, timing, power – Only then, iterate in next steps  opt, place, phys_opt , route, phys_opt Worst path post Route : 4.1ns 4 levels of logic

Advanced Synthesis Techniques Overview  Advantages of C Synthesis over RTL Synthesis  Advance Synthesis Techniques for Design Closure  Case Study: design closure at Synthesis level

HLS & IP Integrator (IPI) vs. RTL Synthesis Design closure Actual example: VHDL, Verilog: RTL (VHDL) 100k lines RTL (VHDL) VHDL-2008, SV: 50k lines HW Traditional Flow RTL RTL System Debug P&R Sim. Test Bench • Synthesis 240 people*mo Test Bench Test Bench (System C) (System C) o 10 people (driver level) o 2 years Exhaustive functional Minimal test tests Verification advantage: e.g. video processing 15x faster • RTL: 1 frame per ~5 hours • C++: 1 frame per second HLS Based Flow • 16 people*mo Design closure o 2 people o 8 month RTL (VHDL) C++ code RTL (VHDL) ( 5k lines)) HW System C RTL Faster for derivative designs HLS IPI Debug P&R Debug Test Bench Synth Test Bench • C++ reuse Test Bench (System C) (System C) • (application) Scales with parameters • Device independent System-level Debug Exhaustive functional tests

HLS Automates Micro-architecture Exploration Project specification while (i++) while (i++) for j =0 .. N for j =0 .. N y(i) = y(i) + b(j) * x(i-j) y(i) = y(i-j) +b(j) * x(i) Architecture choices Z -1 Z -1 Z -1 x(n) x(n) b 0 b 1 b 2 b m-1 b 0 b 1 b 2 b m-1 … … X X X X X X X X Z -1 Z -1 Z -1 + + + + + + Micro-architecture choices … … Algorithmic delay b x(n) x(n) x(n) X X X X X X + 0 + + + + + + 0 Pipeline register Fully parallel : N DSP no cascade Fully parallel : N DSP + cascade Fully Folded : 1 DSP (default)

HLS Micro-Architecture Exploration while (1) c 0 c 1 M3(i) = M1(i-2) * C2 Z -1 A1 A2 M1 M2 A1 A2 M1 M2 A1(i) = M3(i) + x(i) x[i] M2(i) = M1(i-1) * C1 c 2 A2(i) = A1(i) + M2(i) Z -1 M3 M3 M1(i) = A2(i) * C0 i++ Dataflow Graph C++ RTL Z -2 M3 M3 M3 Schedule 1: 14 cycles A2 A1 M1 A1 A2 M1 Z -1 Sequential process (CPU model) M2 M2 Minimal HW Resources : 1 MULT, 1 ADD i i+1 14 28 Schedule 2: 10 cycles Z -2 M3 M3 M3 Parallelism within each iteration A1 A2 A1 A2 M1 M1 Better performance (~29%) M2 M2 Z -1 20 2 MULT, 1 ADD i 10 i+1 i+2 Schedule 3: 9 cycles Z -2 M3 M3 M3 Loop pipelining A2 A2 A1 A1 M1 M1 Best performance (~36%) M2 M2 Z -1 9 18 2 MULT, 1 ADD i i+2 i+1

Vivado Synthesis Flow VHDL, Verilog VHDL-2008, SystemVerilog more compact: advanced types… verification friendly: UVM, SVA… Syntax check Build file hierarchy Analyze Design hierarchy Cross-probing Unroll loops Build Logic: • Arithmetic • RAM • Elaborate FSM XDC • Boolean logic Module generators RTL Optimizations LUT6 Optimize & Map Boolean optimization Technology mapping P&R or DCP

• Architecture-Aware Coding • Priority Encoders • Loops • Clocks & Resets • Directives & Strategies • Case Study

Architecture Aware DSP  HDL code needs to match DSP hardware (e.g. DSP48 E2 ) – Signage, width of nets, optimal pipelining… Signed 27 bit ACC A 27 B 48 XOR 45 C EQ 27 18 Verify that DSP are inferred efficiently Signed arithmetic with pipelining    Complex multiplier Dynamic pre-adder Rounding (2015.3) Use templates &   Squarer (UG901)  FIR (UG579) XOR (2016.1) Coding style examples:  Multiply-accumulate  Large accumulator  …

DSP Block Inference Improvements Squarer: 1 DSP Complex multiplier: 3 DSP (a+bi)*(c+di) = ((c-d) * a + S ) + ((c+d) * b + S )i (a – b) 2 with S =(a-b) * d (a + b) 2 − X + Re A − X B + X + Im Wider arithmetic requires more pipelining e.g. MULT 44x35 requires 4 MULT 27x18 & ADD A A B Synthesis B Pipelined MULT 44x35 in HDL Mapped to 4 DSP Blocks (27x18 MULT) Verify proper inference for full DSP block performance!

Architecture-Aware RAM & ROM RAMB36  HDL code needs to match BRAM Architecture out – Registered address (sync read), optional output register addr – 32K configurations  Width=1 x Depth=2 15 (32K) = 32Kx1  Width=2 x Depth=2 14 (16K) = 16Kx2  …  Width=32 x Depth=2 10 (1K) = 1Kx32 – 36K configuration 32x1K Q  Width=36 x Depth=2 10 (1K) = 1Kx36 addr  Wider & Deeper Memories – Automatically inferred by Synthesis Example: single port RAM Verify that BRAM are inferred efficiently!

RAM Decomposition: Example  32Kx32 RAM 32Kx1 1Kx32 1 1Kx32 32 32 ... 32Kx1 1Kx32 4x 32 32 1Kx32 LUTs 8x 32x 32x ... ... 32Kx1 1Kx32 . . . 8-1 MUX W=1 D=15 W=32 D=10 W=32 D=10 High Performance & Power Low Power & Performance Performance/Power Trade-off (default w/ timing constraints) UltraScale cascade-MUX Hybrid LUT & UltraScale Cascade 1 level , 32 BRAM active 32 levels , 1 BRAM active 4 levels , 4 BRAM active (* cascade_height = 32 *) … (* cascade_height = 4 *) … Verify that BRAM are decomposed efficiently!

RAM & ROM Recommendations BRAM BRAM BRAM BRAM Reg Reg Reg Reg Reg Use pipeline Reg No logic in-between No Fanout In same hierarchy! for performance BRAM slack<0 BRAM slack>0 Reg Reg Reg Reg Reg Run phys_opt to move Reg Add extra pipeline in & out based on timing for best performance! Verify that BRAM are pipelined efficiently!

Beware of Priority Logic if (c0) q = a0; if (c0) q = a0; if (c1) q = a1; else if (c1) q = a1; if (c2) q = a2; else if (c2) q = a2; if (c3) q = a3; else if (c3) q = a3; if (c4) q = a4; else if (c4) q = a4; if (c5) q = a5; … else if (c5) q = a5; … Priority encoded logic Removing else ’s won’t help!!  long paths a5 a0 a4 a1 a3 a2 a2 a3 a1 c5 a4 c0 c4 c1 a0 a5 c3 c2 … c2 … c3 c1 c4 c0 c5 Priority logic will hurt Timing Closure!

Priority Logic with “for” loops flag = 0; for (i=0 ; i<31 ; i=i+1) flag = 0; if (c[i]) begin for (i=0 ; i<31 ; i=i+1) flag = 1; if (c[i]) break; //SystemVerilog flag = 1; end Same as if…if…if… Same as if…else if…else if… break won’t help!! 1 1 1 1 1 1 1 1 1 1 c[31] c[0] c[30] c[1] 1 1 c[29] c[2] 0 … 0 … c[28] c[3] c[27] c[4] … … c[26] c[5] “break” does not reduce logic! Best code in this case: flag = |c Think Simple!

Priority Logic with “case” Statement case (c) In Verilog: v0: q=a0; CASE (c) //synthesis parallel_case v1: q=a1; (watch for simulation mismatch!) v2: q=a3; In SystemVerilog: unique case (c) // works with “if” too v3: q=a4; v4: q=a5;… a0 CASE won’t help either! c (note: values are variables) v0 a1 c a0 v1 a1 a2 a2 c v2 a3 a4 a3 c==v0 c c==v1 a5 v3 c==v2 a4 … c==v3 c v4 c==v4 … BAD GOOD c==v5 If conditions are mutually exclusive, make it clear! Note: please use complete conditions .v full_case (simulation may not match) or default & assign don’t_care .sv priority (for case & if)

Priority Logic Which Should Not Be! case ( S ) c0 = ( S == 0); if (c0) q = a0; Automated in most cases … 0: q = a0 c1 = ( S == 1); Even with registered conditions! else if (c1) q = a1; 1: q = a1 c2 = ( S == 2); unique if ( c0) … else if (c2) q = a2; 2: q = a2 c3 = ( S == 3); … in SystemVerilog else if (c3) q = a3; c4 = ( S == 4); or: else if (c4) q = a4; 1-hot conditions a0 (here: binary encoded) q = A[S] else if (c5) q = a5; … S 0.2 a1 S S 0.2 a2 a0 a1 S 0.2 a2 a3 a0..7 S 0.2 a3 a4 a4 S 0.2 a5 a5 a0..3 … S 2 S 1 S 0 S 0.2 S 0 S 1 a4..7 BAD GOOD GOOD S 0 S 1 If conditions are mutually exclusive, do not use a priority logic S 2 Use “unique if” in SystemVerilog

Parallelizing Priority Logic  When you can’t avoid O(n), you still can! i f c32…c63 i f c0…c63 32 deep 64 deep a0 1 a1 1 1 a2 1 0 a3 1 0 0 a4 0 1 c 0 a5 0 1 c 1 0 a63 1 c 2 i f c0…c31 2 deep 0 c 3 0 c 4 (log 6 (32)) 32 deep c 0 … c 31 … c 5 c 63 GOOD: N/2 +1 deep... BAD: N deep or N/4 + 2… or log(N) recursively Improve timing even when conditions are not mutually exclusive!

Beware of Loop Unrolling – Avoid “if” c = 0; c = 0; c = a[0] + a[1] + a[2] + for (i=0 ; i<8 ; i=i+1) for (i=0 ; i<8 ; i=i+1) a[3] + a[4] + a[5] + if (a[i]) c = c+a[i]; a[6] + a[7]; Get rid of “if” c = c+1; a[0] a[7] a[1] c + + c a[2] +1 + +1 0 +1 a[3] … a[0] a[6] + a[6] a[7] a[4] a[5] BAD: area & depth O(N) GOOD: area & depth log 3 (N) “if” in loops can seriously hurt timing!

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast - PowerPoint PPT Presentation

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for Vivado www.xilinx.com/ultrafast Recommendations for Rapid Closure HDL : use HDL Language Templates & DRC Constraints : Timing

Advanced Synthesis Techniques Ramine Roane Advanced Synthesis Techniques Reminder From Last Year

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Jieun Kim Hi-Sun Kim University of Chicago 1 st 2 nd 3 rd 4 th 5 th st nd rd th th year

Administrative notes October 19, 2017 Reminder: Reading quizzes due Monday Reminder:

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Co-synthesis techniques for embedded systems embedded systems Kelvin Yuk June 5, 2002 EEC282 -

Advanced Election Techniques in Rings Eero Hkkinen 2007-02-21 Advanced Election Techniques in

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

CTP431- Music and Audio Computing Sound Synthesis Graduate School of Culture Technology KAIST

Reducing the Cost of Probabilistic Knowledge Compilation Giso H. Dal, Steffen Michels and Peter

Unicode Character Code A character is the smallest possible component of a tex t (e.g., A,

P1722 Presentation Time Craig Gunther (cgunther@harman.com) October, 2007 18 October 2007 1

ARCNET Tutorial What is ARCNET? Attached Resource Computer NETwork Token-Passing Local

The Sound Group Joe Bota Aaron Camm Alex Cueto Brief Overview The Physics of Sound Audio

Video Error Concealment: A Brief Presentation Rui Fernandes 1 1 Instituto Polit ecnico de

Algebraic Coding Theory Ramsey Rossmann May 7, 2017 University of Puget Sound Motivation Goal

Using registers and administrative data in Official Statistics Population covering databases

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us