Architecture and Synthesis for Multi- -Cycle Cycle Architecture - PowerPoint PPT Presentation

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@cs cs. .ucla ucla. .edu edu cong@ http://cadlab cadlab. .cs cs. .ucla ucla. .edu edu http:// Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang

Outline Outline � Needs for Multi � Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication � Regular Distributed Register (RDR) Architecture � Regular Distributed Register (RDR) Architecture � MCAS: Multi � MCAS: Multi- -Cycle Communication Architectural Synthesis System Cycle Communication Architectural Synthesis System • Scheduling • Scheduling- -driven placement driven placement • Placement • Placement- -driven rescheduling & rebinding driven rescheduling & rebinding � Experimental Results � Experimental Results � Application in Pilot System � Application in Pilot System -- -- A Platform Based HW/SW Synthesis A Platform Based HW/SW Synthesis System System � Conclusions & Future Work Conclusions & Future Work �

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs � 1st challenge: Interconnect delay exceeds gate delay (happened i 1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s) n mid 1990s) � Source of “timing closure” problem Source of “timing closure” problem � Happened in mid 1990s. Addressed by new physical synthesis/prot Happened in mid 1990s. Addressed by new physical synthesis/prototyping tools otyping tools

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs � 2nd challenge: 2nd challenge: Single Single- -cycle full chip synchronization is no longer possible cycle full chip synchronization is no longer possible � Not supported by the current CAD toolset Not supported by the current CAD toolset � About to happen soon About to happen soon 5 clock � ITRS’01 0.07um Tech � 5.63 G Hz across-chip clock � 800 mm 2 (28.3mm x 28.3mm) � IPEM BIWS estimations 4 clock � Buffer size: 100x � Driver/receiver size: 100x On semi-global layer (tier 3) : � 3 clock � Can travel up to 11.4 mm in one cycle 2 clock � Need 5 clock cycles from 1 clock corner to corner 11.4 28.3 0 22.8

Single- -cycle Full Chip Synchronization No Longer cycle Full Chip Synchronization No Longer Single Possible -- Possible -- FPGA Example FPGA Example � Altera Stratix: EP1S80B-C6 � Large Size: 79,040 LEs 22 DSP blocks … � MegaRAM Blocks (9) � Corner to Corner Interconnect Delay: � 7.154 ns � With clock frequency: Logic Array � 300 MHz Blocks (79,040 LEs) � From corner to corner communication: � 3 clock cycles! DSP Blocks M512 RAM M4K RAM (22) Blocks (767) Blocks (364)

Possible Solutions Possible Solutions � Asynchronous designs � Asynchronous designs � � Triggered by events instead of clocks Triggered by events instead of clocks • Bridging capabilities: provides interfaces for systems of differ • Bridging capabilities: provides interfaces for systems of different speeds ent speeds • • Greater flexibility: circuits in a system do not have to common Greater flexibility: circuits in a system do not have to common timing timing � � Delay Delay- -insensitive insensitive � Reduced power consumption ? � Reduced power consumption ? � � Improved performance ? Improved performance ? � Synchronous designs, with multi � Synchronous designs, with multi- -cycle communications cycle communications � Much better understood � Much better understood � � Can leverage existing tools/flows Can leverage existing tools/flows � Our current focus � Our current focus

Multi- -Cycle Interconnect Communication Cycle Interconnect Communication Multi at Logic / Physical Level at Logic / Physical Level � Simultaneous retiming + placement / � Simultaneous retiming + placement / floorplanning floorplanning � Retiming + multilevel partitioning[Cong et al, ICCAD � Retiming + multilevel partitioning[Cong et al, ICCAD’ ’00] and 00] and coarse placement[Cong et al, DAC coarse placement[Cong et al, DAC’ ’03] 03] � Retiming + � Retiming + floorplanning floorplanning [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01] � Retiming + placement for � Retiming + placement for FPGAs FPGAs [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02]

Need of Considering Retiming during Placement Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects Retiming/pipelining on global interconnects - � Multiple clock cycles are needed to cross the chip Multiple clock cycles are needed to cross the chip � Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays. Placement 1 Placement 2 b d c a c d a b d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Before retiming, φ = 4.0 Before retiming, φ = 5.0 Better Initial Placement !! After retiming, φ = 3.0

Need of Considering Retiming during Placement Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects - Retiming/pipelining on global interconnects � Multiple clock cycles are needed to cross the chip Multiple clock cycles are needed to cross the chip � Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays. Placement 1 Placement 2 b d c a c d a b d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Before retiming, φ = 4.0 Before retiming, φ = 5.0 Better Initial Placement !! After retiming, φ = 3.0 After retiming, φ = 4.0

Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming on Interconnects on Interconnects � Difficulties � Difficulties � How to consider retiming/pipelining over global interconnects � How to consider retiming/pipelining over global interconnects • • Flip Flip- -flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static static timing analysis timing analysis � How to handle the high complexity of the combined problem � How to handle the high complexity of the combined problem � Our solution Our solution � � Compute the labels of all nodes under c � Compute the labels of all nodes under c- -retiming for a given retiming for a given placement solution and perform sequential timing analysis (Seq Seq- -TA) TA) placement solution and perform sequential timing analysis ( � Minimize the longest sequential path by improving the placement � Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement framework solution in the multilevel coarse placement framework

Sequential Arrival Time (SAT) Sequential Arrival Time (SAT) � Definition [Pan et al, TCAD98] � Definition [Pan et al, TCAD98] � � l l ( ( v v ) = max delay from PIs to ) = max delay from PIs to v v after opt. retiming under a given clock period after opt. retiming under a given clock period f f � l � l ( ( v v ) = max{ ) = max{ l l ( ( u u ) ) - - f f · · w w ( ( u,v u,v ) + ) + d d ( ( u,v u,v ) + ) + d d ( ( v v )} )} l ( u ) = 7 u u v d ( v ) = 1, d ( e ) = 2, f = 5 v l ( v ) = max{7-5·1+2+1, 3+2+1} = 6 l ( u ) w ( u,v ) d ( v ) w l ( w ) = 3  l f   - ) =  � Relation to retiming: � Relation to retiming: r r ( ( v v ) = l ( ( v v ) / ) / f - 1 1 ≤ f (POs) ≤ � Theorem: � Theorem: P P can be retimed to can be retimed to f f + max{ + max{ d d ( ( e e )} iff )} iff l l (POs) f � SAT can be computed iteratively in O(VE) time (linear time in pr � SAT can be computed iteratively in O(VE) time (linear time in practice) actice)

Limitation of Exploring Multi- -cycle Interconnect cycle Interconnect Limitation of Exploring Multi Communication during Logic/Physical Synthesis Communication during Logic/Physical Synthesis � Minimum clock period can be achieved by logic � Minimum clock period can be achieved by logic optimization is bounded by max. delay- -to to- -register (DR) register (DR) optimization is bounded by max. delay ratio of the loops in the circuits ratio of the loops in the circuits • In a loop, 4 logic cells, 2 registers • Cell delay =1ns • Interconnect delay=1ns • DR ratio = (D logic +D int )/#Registers = (4+4)/2=4ns • Clock cycle >= 4ns � Require consideration of multi � Require consideration of multi- -cycle communication cycle communication during architecture & behavior synthesis during architecture & behavior synthesis

Architecture and Synthesis for Multi- -Cycle Cycle Architecture - PowerPoint PPT Presentation

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi-

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

CTP431- Music and Audio Computing Sound Synthesis Graduate School of Culture Technology KAIST

Texture Synthesis Given a texture, create more CS176: Texture Synthesis All examples from Wei

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

The Cycle The Cycle ULI Multi-Family Gold Council ULI Multi-Family Gold Council Wednesday,

Hamiltonian Cycles Hamiltonian Cycles CSE, IIT KGP Hamiltonian Cycle Hamiltonian Cycle A A

1 Querying Irregular Dataset Structure Multi-dimensional Datasets Irregular datasets

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R

Plan Motivations (to combine navigation and querying in a file system) Specification (ls = ?,

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique & G.

Design Considerations for a DECADE SDT draft-kutscher-decade-protocol-00

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Architecture and Synthesis for Multi- -Cycle Cycle Architecture - PowerPoint PPT Presentation

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi-

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

CTP431- Music and Audio Computing Sound Synthesis Graduate School of Culture Technology KAIST

Texture Synthesis Given a texture, create more CS176: Texture Synthesis All examples from Wei

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

The Cycle The Cycle ULI Multi-Family Gold Council ULI Multi-Family Gold Council Wednesday,

Hamiltonian Cycles Hamiltonian Cycles CSE, IIT KGP Hamiltonian Cycle Hamiltonian Cycle A A

1 Querying Irregular Dataset Structure Multi-dimensional Datasets Irregular datasets

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R

Plan Motivations (to combine navigation and querying in a file system) Specification (ls = ?,

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique &amp; G.

Design Considerations for a DECADE SDT draft-kutscher-decade-protocol-00

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique & G.