I m pact of Local I nterconnects on Tim ing and Pow er in a High Perform ance Microprocessor Marek Patyra Rupesh S. Shelar Enterprise Microprocessor Group Low Power IA Group Intel Corporation, Hillsboro, OR Intel Corporation, Austin, TX ISPD 2010 San Francisco, CA
Objective • To convey the severity of the delay/ power impact and the challenges it presents to physical design 2
Agenda • Introduction • Impact on Timing • Impact on Power • Conclusions 3
W hy Look at I nterconnects Closely • Unlike transistors, they do not perform computation • They just transfer information from one place to another • Paying power/ timing cost for interconnects yields nothing, unlike that for transistors • Secondary effects: Cause area growth, delay penalty, yield issues indirectly due to routing congestion 4
Motivation I : I nterconnect Delay • Interconnects known to contribute significantly to path delays • For intra-block paths, exact numbers probably not known, as these vary depending on the block-size, design style • Many academic studies (Keutzer, Horowitz, Cong, Saraswat, Saxena) exist (and 1000s of papers start the introduction section with “interconnect delay scaling… ”) • Most based on combination of some (small) design data and simplistic assumptions about scaling and do not solely focus on data from real design, for example, high performance microprocessor core 5
Motivation I I : Pow er in Local I nterconnects • More than 70% of power in datapath and control logic blocks • 60% of the total power is dynamic/ glitch – 66% of the total dynamic power in local, i.e. , intra-block, interconnects (Source: SLIP’04 paper, based on a microprocessor study) • Still relatively less attention paid on power dissipation in interconnects 6
About Data • Delay/ power data from blocks in high performance microprocessor core [ Kumar et al., JSSCC 2008 ] in 45 nm technology • Blocks implemented using different design Styles – RTL-to-Layout Synthesis (RLS), aka random logic synthesis • Mostly automatic (using vendor/ in-house tools); write RTL, partition, and run tools/ flows • Design quality determined by algorithms, tools, flows, parameters; supposedly poor utilization, or sparse layouts – Structured Data Paths (SDP) • Mostly manual; extract regularity using hierarchies, draw schematics, hierarchical placement and routing • Routing can be done flat; supposedly high utilization, or dense layouts • RLS (SDP): 86 (133) blocks; cell count more than 600 (700) K • Local interconnects: – RLS uses, mostly, M2 to M5, mostly minimum width, flat routing – SDP uses M2 to M7, different widths, hierarchical routing • Delay/ Power impact due to interconnects inside standard cells is considered as cell- delay/ power contribution in this study 7
Utilization in RLS • Avg. utilization: 51.69% • Varies from 7% to 78% Placement Utilization (%) vs. # of Cells – Utilization varies significantly for blocks with 90 < 5000 cells, possibly because of floorplan; 80 for blocks with > 15000 cells, varies between 40 to 70% 70 Placement Utilization (%) – Higher than 70% utilization blocks fairly 60 difficult to converge 50 • Avg . block size: 7817, varies from 323 to 40 43298 30 20 • Reasons for low utilization: 10 – Difficult to route and converge timing due to 0 congestion, if the utilization is higher 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 – Synthesis/ placement not doing good job? # of Cells – Space for ECOs: even if we assume Utilization = std. cell area/ block area generous 10% white space, 60% utilization may still be considered low 8
Utilization in SDP • Utilization varies from 0.07% to 74% Placement Utilization (%) vs. Cell Count • Avg. Utilization: 40.40% 80 70 • Avg. block size: 7542 cells 60 Placement Utilization • The SDP layouts are not denser 50 than RLS; reasons: 40 30 – Routing congestion caused “artificially” by the hierarchies 20 – Even with flat routing, it is not clear 10 why, and how much, the 0 congestion/ utilization may improve 0 5000 10000 15000 20000 25000 30000 (net ordering problem) Cell Count – Matching bit-widths? Utilization = std. cell area/ block area – ??? 9
Agenda • Introduction • Impact on Timing • Impact on Power • Conclusions 1 0
I m pact of I nterconnects on tim ing • For max timing, interconnects contribute in terms of – Wire delay – Slope degradation (slows down receivers) – Cell-delay degradation (extra cap to drive) – Cumulative effect of above 3 on path delays – Delays due to repeaters (inserted for timing/ slope/ noise) • Chose 3 metrics on the worst internal paths: – Wire delay – Interconnect impact (obtained by setting R= C= 0) – Repeater delay • Why internal paths: should exclude the effect of timing constraints on primary i/ os on synthesis flows (RLS)/ manual design (SDP) • Why worst paths: determines operating frequency 1 1
W ire Delay on W orst Paths in RLS blocks • Varies from 0 to 26% of cycle- Wire delay % vs Cell count time 30 25 • Average wire delay: 6% 20 Wire delay % • Excludes repeater delay and 15 cell-delay/ slope-degradation 10 5 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Cell count 1 2
W ire Delay on W orst Paths in SDP Blocks • Varies from 0 to 30% Wire delay % vs Cell count • Average wire delay: 5% 35 30 • Several blocks with 0 wire delay 25 Wire delay % on internal critical path implies 20 careful design 15 10 • Excludes repeater delay and cell- 5 delay/ slope-degradation 0 0 5000 10000 15000 20000 25000 30000 Cell count % 1 3
W ire Delay vs. Slack for RLS blocks • Wire delay component Wire delay% vs Slack 30 increases as slack decreases 25 • Critical paths interconnect 20 dominant ones Wire delay % 15 10 5 0 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Slack 1 4
W ire Delay vs. Slack for SDP blocks • Wire-delay component Wire delay % vs Slack increases as slack decreases 35 • Critical paths interconnect 30 dominant ones 25 Wire delay % 20 15 10 5 0 -0.05 0 0.05 0.1 0.15 0.2 0.25 Slack 1 5
I nterconnect Delay Contribution on I nternal Paths in RLS blocks • How much would the timing improve, if R= C= 0 for local Slack difference % vs Slack interconnects 30 • Measured as the slack difference 25 on the worst internal paths by Slack Difference % 20 setting R= C= 0 15 – Includes cumulative effect of wire 10 delay, slope, cell-delay degradation 5 • Varies from 0 to 27% ; average 0 13% -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Slack • Average impact slightly more than twice the average wire delay • Excludes repeaters delay 1 6
I nterconnect Delay Contribution on I nternal Paths in SDP blocks • How much would the timing improve, if R= C= 0 for local interconnects Slack difference % vs Slack • Slack difference varies from 0 to 40% 45 • Average slack difference 9% 40 35 – Smaller average implies that for many Slack difference % 30 blocks the worst internal path were cell- 25 delay dominated (consistent with wire delay slide for SDP) 20 15 • Average impact close to twice the 10 average wire delay 5 0 -0.05 0 0.05 0.1 0.15 0.2 0.25 • Excludes repeater delay Slack 1 7
Repeater Count in RLS blocks • Varies almost linearly with block-size # of Repeaters vs. # of Cells • Repeater count varies from 183 to 25000 21315 20000 • Out of 641002, 176205 (27.48% ) # of Repeaters 15000 inverters and 106346 (16.59% ) buffers 10000 • Inv./ buf. contribute to ~ 44% of cell count 5000 • Synthesis possibly did not do a great job 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 # of Cells 1 8
Repeater Count in SDP blocks • Increases with cell-count, but # of Inv./Buf. vs. # of Cells spread is larger than that in RLS 16000 – Depends on how different DEs do 14000 schematic, buffer insertion 12000 – # of buffers not necessarily increasing # of Inv./Buf. 10000 as linearly with cell count as in RLS; 8000 DEs used them sparingly as compared 6000 to tools 4000 2000 • Buffer count varies from 0 to 14089 0 0 5000 10000 15000 20000 25000 30000 • Out of 770306, 177037 (22.98% ) # of Cells inverters and 68069 (8.83% ) • Inv./ buf. contribute to ~ 31% of cell count; 13% better than RLS 1 9
Repeater Delay in RLS blocks Repeater delay% vs Cell count • Varies from 0 to 45% 90 • Average repeater delay: 19% 80 70 • Includes both, inverter and 60 Repeater delay % 50 buffer delay 40 30 20 10 0 0 10000 20000 30000 40000 50000 Cell count 2 0
Repeater Delay in SDP blocks • Varies from 0 to 38% Repeater delay % vs Cell count 45 • Average repeater delay: 11% 40 35 Repeater delay % • Includes both, inverter and buffer 30 25 delay 20 15 10 5 0 0 5000 10000 15000 20000 25000 30000 Cell count 2 1
Sum m ary of Observations so far • Interconnect delay dominance regardless of design style • Secondary effects, slope-/ cell-delay degradation as big as wire delay • Repeater count more than 40% and linear in the size of blocks • Repeater delay contributes as much as wires • SDP design with more manual control better than synthesis 2 2
Recommend
More recommend