Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@cs cs. .ucla ucla. .edu edu cong@ http://cadlab cadlab. .cs cs. .ucla ucla. .edu edu http:// Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang
Outline Outline � Needs for Multi � Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication � Regular Distributed Register (RDR) Architecture � Regular Distributed Register (RDR) Architecture � MCAS: Multi � MCAS: Multi- -Cycle Communication Architectural Synthesis System Cycle Communication Architectural Synthesis System • Scheduling • Scheduling- -driven placement driven placement • Placement • Placement- -driven rescheduling & rebinding driven rescheduling & rebinding � Experimental Results � Experimental Results � Application in Pilot System � Application in Pilot System -- -- A Platform Based HW/SW Synthesis A Platform Based HW/SW Synthesis System System � Conclusions & Future Work Conclusions & Future Work �
Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs � 1st challenge: Interconnect delay exceeds gate delay (happened i 1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s) n mid 1990s) � Source of “timing closure” problem Source of “timing closure” problem � Happened in mid 1990s. Addressed by new physical synthesis/prot Happened in mid 1990s. Addressed by new physical synthesis/prototyping tools otyping tools
Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs � 2nd challenge: 2nd challenge: Single Single- -cycle full chip synchronization is no longer possible cycle full chip synchronization is no longer possible � Not supported by the current CAD toolset Not supported by the current CAD toolset � About to happen soon About to happen soon 5 clock � ITRS’01 0.07um Tech � 5.63 G Hz across-chip clock � 800 mm 2 (28.3mm x 28.3mm) � IPEM BIWS estimations 4 clock � Buffer size: 100x � Driver/receiver size: 100x On semi-global layer (tier 3) : � 3 clock � Can travel up to 11.4 mm in one cycle 2 clock � Need 5 clock cycles from 1 clock corner to corner 11.4 28.3 0 22.8
Single- -cycle Full Chip Synchronization No Longer cycle Full Chip Synchronization No Longer Single Possible -- Possible -- FPGA Example FPGA Example � Altera Stratix: EP1S80B-C6 � Large Size: 79,040 LEs 22 DSP blocks … � MegaRAM Blocks (9) � Corner to Corner Interconnect Delay: � 7.154 ns � With clock frequency: Logic Array � 300 MHz Blocks (79,040 LEs) � From corner to corner communication: � 3 clock cycles! DSP Blocks M512 RAM M4K RAM (22) Blocks (767) Blocks (364)
Possible Solutions Possible Solutions � Asynchronous designs � Asynchronous designs � � Triggered by events instead of clocks Triggered by events instead of clocks • Bridging capabilities: provides interfaces for systems of differ • Bridging capabilities: provides interfaces for systems of different speeds ent speeds • • Greater flexibility: circuits in a system do not have to common Greater flexibility: circuits in a system do not have to common timing timing � � Delay Delay- -insensitive insensitive � Reduced power consumption ? � Reduced power consumption ? � � Improved performance ? Improved performance ? � Synchronous designs, with multi � Synchronous designs, with multi- -cycle communications cycle communications � Much better understood � Much better understood � � Can leverage existing tools/flows Can leverage existing tools/flows � Our current focus � Our current focus
Multi- -Cycle Interconnect Communication Cycle Interconnect Communication Multi at Logic / Physical Level at Logic / Physical Level � Simultaneous retiming + placement / � Simultaneous retiming + placement / floorplanning floorplanning � Retiming + multilevel partitioning[Cong et al, ICCAD � Retiming + multilevel partitioning[Cong et al, ICCAD’ ’00] and 00] and coarse placement[Cong et al, DAC coarse placement[Cong et al, DAC’ ’03] 03] � Retiming + � Retiming + floorplanning floorplanning [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01] � Retiming + placement for � Retiming + placement for FPGAs FPGAs [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02]
Need of Considering Retiming during Placement Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects Retiming/pipelining on global interconnects - � Multiple clock cycles are needed to cross the chip Multiple clock cycles are needed to cross the chip � Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays. Placement 1 Placement 2 b d c a c d a b d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Before retiming, φ = 4.0 Before retiming, φ = 5.0 Better Initial Placement !! After retiming, φ = 3.0
Need of Considering Retiming during Placement Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects - Retiming/pipelining on global interconnects � Multiple clock cycles are needed to cross the chip Multiple clock cycles are needed to cross the chip � Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays. Placement 1 Placement 2 b d c a c d a b d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Before retiming, φ = 4.0 Before retiming, φ = 5.0 Better Initial Placement !! After retiming, φ = 3.0 After retiming, φ = 4.0
Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming on Interconnects on Interconnects � Difficulties � Difficulties � How to consider retiming/pipelining over global interconnects � How to consider retiming/pipelining over global interconnects • • Flip Flip- -flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static static timing analysis timing analysis � How to handle the high complexity of the combined problem � How to handle the high complexity of the combined problem � Our solution Our solution � � Compute the labels of all nodes under c � Compute the labels of all nodes under c- -retiming for a given retiming for a given placement solution and perform sequential timing analysis (Seq Seq- -TA) TA) placement solution and perform sequential timing analysis ( � Minimize the longest sequential path by improving the placement � Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement framework solution in the multilevel coarse placement framework
Sequential Arrival Time (SAT) Sequential Arrival Time (SAT) � Definition [Pan et al, TCAD98] � Definition [Pan et al, TCAD98] � � l l ( ( v v ) = max delay from PIs to ) = max delay from PIs to v v after opt. retiming under a given clock period after opt. retiming under a given clock period f f � l � l ( ( v v ) = max{ ) = max{ l l ( ( u u ) ) - - f f · · w w ( ( u,v u,v ) + ) + d d ( ( u,v u,v ) + ) + d d ( ( v v )} )} l ( u ) = 7 u u v d ( v ) = 1, d ( e ) = 2, f = 5 v l ( v ) = max{7-5·1+2+1, 3+2+1} = 6 l ( u ) w ( u,v ) d ( v ) w l ( w ) = 3 l f - ) = � Relation to retiming: � Relation to retiming: r r ( ( v v ) = l ( ( v v ) / ) / f - 1 1 ≤ f (POs) ≤ � Theorem: � Theorem: P P can be retimed to can be retimed to f f + max{ + max{ d d ( ( e e )} iff )} iff l l (POs) f � SAT can be computed iteratively in O(VE) time (linear time in pr � SAT can be computed iteratively in O(VE) time (linear time in practice) actice)
Limitation of Exploring Multi- -cycle Interconnect cycle Interconnect Limitation of Exploring Multi Communication during Logic/Physical Synthesis Communication during Logic/Physical Synthesis � Minimum clock period can be achieved by logic � Minimum clock period can be achieved by logic optimization is bounded by max. delay- -to to- -register (DR) register (DR) optimization is bounded by max. delay ratio of the loops in the circuits ratio of the loops in the circuits • In a loop, 4 logic cells, 2 registers • Cell delay =1ns • Interconnect delay=1ns • DR ratio = (D logic +D int )/#Registers = (4+4)/2=4ns • Clock cycle >= 4ns � Require consideration of multi � Require consideration of multi- -cycle communication cycle communication during architecture & behavior synthesis during architecture & behavior synthesis
Recommend
More recommend