Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock Routing for Ports Clock Routing for High Performance High Performance Microprocessor Designs Microprocessor Designs Haitong Tian # , Wai-Chung Tang # , Evangeline F.Y. Young # and C.N. Sze * # Department of Computer Science and Engineering The Chinese University of Hong Kong * IBM Austin Research Laboratory IBM Austin Research Laboratory ISPD ’11, Santa Barbara , USA March 28, 2011
Outline Introduction Problem Formulation Routing Algorithm Experimental Results E i l R l Conclusion Conclusion I SPD 2011 2
Clock Distribution Categories Clock distribution is an very important issue Buffered and unbuffered trees Used in various ASICs Supported by many physical design tools See Tsay TCAD’93, Xi DAC’95 Non-tree structure with crosslinks Intended for reducing clock skews See Rajaram DAC’04, TCAD’06 See aja a C 0 , C 06 Grid and buffered trees High performance processors Sometimes manually design the clock structures Sometimes manually design the clock structures See Shelar ISPD’09, TCAD’10, Guru VLSI Circuits’10 I SPD 2011 3
High Performance Clock Distribution Clock network in high Grid buffers performance microprocessors microprocessors Regional Local Distributed as global grid External Clock Clock Clock ... buffers buffers followed by buffered trees See Shelar ISPD 09, See Shelar ISPD’09 PLL PLL ... TCAD’10, Guru VLSI Regional Local Clock Clock Circuits’10 ... buffers buffers This paper focuses on the post-grid clock distribution area Post-grid Clock Local Clock Grid Bufer Clock Grid Distribution Network Post grid clock distribution I SPD 2011 4
Post-grid Clock Distribution In our modeling Global grid Entire chip divided into several layout areas several layout areas Grid Buffer Each layout area contains Blocks many blocks many blocks Reserved R d Tracks Port Each block contains Sequential standard cells and/or macros Global Grid Local Clock Buffer Each layout area contains Layout Region 100s-1000s clock ports p Grid wires reserved for Reserved clock routing multilayer tracks Typically upper mental layers layers I SPD 2011 5
Motivations Clock distribution of microprocessor: Crucial importance Major source of power dissipation High capacitance usage 18 1% f t t l l 18.1% of total clock capacitance [1] [1] k it See Pham Solid State Circuits’06 Manually design in practice y g p Hard to satisfy delay/slew constraints Time to market S See Shelar ISPD’09, TCAD’10 Sh l ISPD’09 TCAD’10 [1]: D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P. Harvey, H. Hofstee, C. Johns, et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor. IEEE Journal of Solid-State Circuits, 41(1):179–196, Jan. 2006. I SPD 2011 6
Outline Introduction Problem Formulation Routing Algorithm Experimental Results Conclusion I SPD 2011 7
Problem Formulation Input A set of reserved tracks Locations and capacitances of ports P Different types of wires on each metal layer Delay limit D . Slew limit S Output A clock network (may be non tree structures) A clock network (may be non-tree structures) Objective Connecting every port to the source Satisfying delay and slew constraints Minimizing capacitance usage Minimizing capacitance usage I SPD 2011 8
Post-grid Clock Routing 0 7 7 6 6 Layer Layer 5 5 4 4 3 3 0 0 500 500 1800 1800 1600 1600 1600 1600 1400 1400 1400 1400 1200 1200 1000 1000 1000 1000 800 800 600 600 400 400 200 200 1500 1500 0 0 I SPD 2011 9
Outline Introduction Problem Formulation Routing Algorithm Experimental Results Conclusion I SPD 2011 10
Overall Algorithm Critical ports Ports with large capacitance or f far away from the source f th Path expansion algorithm Elmore-delay driven Expanding in some selected directions Post-processing Wire replacement Topology refinement Iterations The overall algorithm is repeatedly invoked May fail when number of y iterations > K (user specified) I SPD 2011 11
Delay-driven Path Expansion Algorithm Basic steps Simultaneously expand from all ports Select the path with the minimum Elmore delay to further expand Connect the ports to the source once the path reaches the source grid Check delay/slew constraints I SPD 2011 12
A Routing Example Initially, the heap is empty First iteration (simultaneously expand from all ports) Heap={(P 1 ,P 2 );(P 1 ,C 1 );(P 2 ,P 3 );(P 2 ,P 1 );(P 3 ,P 2 );(P 3 ,C 2 )} Second iteration (P 1 ,P 2 ) Heap={(P 3 ,C 2 );(P 1 ,P 2 ,P 3 );(P 1 ,C 1 );(P 2 ,P 3 );(P 2 ,P 1 );(P 3 ,P 2 )} p {( , );( , , );( , );( , );( , );( , )} Third iteration (P C ) Third iteration (P 3 ,C 2 ) Heap ={(P 3 ,C 2 ,S 2 );(P 1 ,P 2 ,P 3 );(P 1 ,C 1 );(P 2 ,P 3 );(P 2 ,P 1 );(P 3 ,P 2 )} I SPD 2011 13
A Routing Example Fourth iteration (identify chain paths) Heap ={(P 1 ,P 2 ,P 3 );(P 1 ,C 1 );(P 2 ,P 3 );(P 2 ,P 1 )} Chain path={(P 1 ,P 2 ,P 3 );(P 2 ,P 3 )} Fifth iteration (P 2 ,P 3 ) Heap={(P 1 ,P 2 ,P 3 );(P 1 ,C 1 )} Chain path={(P 1 ,P 2 ,P 3 );(P 1 ,P 2 )} p {( , , );( , )} Sixth iteration (P 1 P 2 ) Sixth iteration (P 1 ,P 2 ) Heap={}, chain path={} Final result I SPD 2011 14
Post-processing Techniques Wire replacement Wire replacement Port with largest delay: P 5 Two types of wires Replace edge P 1 C 1 Replace edge P 1 C 1 capacitance/resistance tradeoff it / i t t d ff Replace edge P 4 C 2 Replace edge P 2 P 3 , P 3 C 1 Procedures Replace P 5 C 3 , C 3 C 2 , C 2 C 1 , C 1 S 1 Identify port P l with the largest p , , , Identify port P l with the largest Elmore delay Replace wires in a bottom-up style S 1 S 1 S 1 S 1 S 1 Check delay/slew constrains P 2 P 2 P 2 P 2 P 2 P 3 P 3 P 3 P 3 P 3 C 1 C 1 C 1 C 1 C 1 P 1 P 1 P 1 P 1 P 1 C 3 C 3 C 3 C 3 C 3 C 2 C 2 C 2 C 2 C 2 P 4 P 4 P 4 P 4 P 4 P 5 P 5 P 5 P 5 P 5 I SPD 2011 15
Post-processing Techniques Topology refinement Topology refinement Procedures Elmore delay: P 5 >P 4 >P 6 >P 2 >P 1 >P 3 >P 7 Disconnect a port P Sequentially process all the ports Expand P towards all directions S 1 S 1 S 1 S 1 P 2 P 3 C 1 C 1 C 1 C 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 3 P 3 P 3 Select paths with smaller capacitance i C 2 C 2 C 2 C 3 C 2 P 4 P 4 P 4 P 4 P 5 P 5 P 5 Check delay/slew constraints P 5 P 6 P 6 P 6 C 4 C 4 C 4 P 6 C 4 C 5 C 5 C 5 P 7 P 7 P 7 C 5 P 7 S 2 S 2 S 2 S 2 I SPD 2011 16
Non-tree Extensions A small number of ports have Non-tree extensions exceptionally large capacitances Connect p to S 1 The delay of its shortest path The delay of its shortest path Find a second source S 2 exceeds the delay limit D Add crosslinks Procedures Find a third source S 3 Establish a shortest path for p p p Add crosslinks Find a second shortest path Target delay not met? Add all useful corsslinks Target delay not met? Do the Target delay not met? Do the same thing for parent node of p I SPD 2011 17
Outline Introduction Problem Formulation Routing Algorithm Experimental Results Conclusion I SPD 2011 18
Experiment Setup Environment Implemented in C++ Run on Linux server Intel Pentium 4 3.2GHz 2GB RAM Delay setup: 5ps Slew setup: input: 10ps; output: 15 ps Benchmarks B h k 3 test cases are provided by industry 11 test cases are from ISPD 2010 Clock Network Synthesis Contest es cases a e o S 0 0 C oc Ne wo Sy es s Co es Comparisons Compared with TG, which was proposed by Shelar in ISPD’09, TCAD’10 TCAD’10 I SPD 2011 19
Tree Growing Algorithm Proposed in R. Shelar ISPD’09, Tree Growing Algorithm TCAD’10 Expand from the source D l Delay/Slew constraints /Sl t i t Add S 1 C 1 , S 2 C 2 Greedy expansion from the Add C 2 P 3 source Add C 1 P 1 Edges with the smallest Ed ith th ll t Add P 3 P 2 capacitance will be added into the network S 1 S 1 S 1 S 1 S 1 S 1 S 2 S 2 S 2 S 2 S 2 S 2 C 1 C 1 C 1 C 1 C 1 C 1 C 2 C 2 C 2 C 2 C 2 C 2 P 1 P 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 P 2 P 3 P 3 P 3 P 3 P 3 P 3 I SPD 2011 20
Comparisons: capacitance Without post-processing: Capacitance (without post-processing techniques) 18 3% i 18.3% improvement t 25000 capacitance (fF) 20000 15000 TG Ours 10000 5000 5000 c 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases With topology refinement: Capacitance (with topology refinement) 24.6% improvement 25000 ce(fF) 20000 capacitanc 15000 TG Ours 10000 5000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases I SPD 2011 21
Recommend
More recommend