i m pact of local i nterconnects on tim ing and pow er in
play

I m pact of Local I nterconnects on Tim ing and Pow er in a High - PowerPoint PPT Presentation

I m pact of Local I nterconnects on Tim ing and Pow er in a High Perform ance Microprocessor Marek Patyra Rupesh S. Shelar Enterprise Microprocessor Group Low Power IA Group Intel Corporation, Hillsboro, OR Intel Corporation, Austin, TX ISPD


  1. I m pact of Local I nterconnects on Tim ing and Pow er in a High Perform ance Microprocessor Marek Patyra Rupesh S. Shelar Enterprise Microprocessor Group Low Power IA Group Intel Corporation, Hillsboro, OR Intel Corporation, Austin, TX ISPD 2010 San Francisco, CA

  2. Objective • To convey the severity of the delay/ power impact and the challenges it presents to physical design 2

  3. Agenda • Introduction • Impact on Timing • Impact on Power • Conclusions 3

  4. W hy Look at I nterconnects Closely • Unlike transistors, they do not perform computation • They just transfer information from one place to another • Paying power/ timing cost for interconnects yields nothing, unlike that for transistors • Secondary effects: Cause area growth, delay penalty, yield issues indirectly due to routing congestion 4

  5. Motivation I : I nterconnect Delay • Interconnects known to contribute significantly to path delays • For intra-block paths, exact numbers probably not known, as these vary depending on the block-size, design style • Many academic studies (Keutzer, Horowitz, Cong, Saraswat, Saxena) exist (and 1000s of papers start the introduction section with “interconnect delay scaling… ”) • Most based on combination of some (small) design data and simplistic assumptions about scaling and do not solely focus on data from real design, for example, high performance microprocessor core 5

  6. Motivation I I : Pow er in Local I nterconnects • More than 70% of power in datapath and control logic blocks • 60% of the total power is dynamic/ glitch – 66% of the total dynamic power in local, i.e. , intra-block, interconnects (Source: SLIP’04 paper, based on a microprocessor study) • Still relatively less attention paid on power dissipation in interconnects 6

  7. About Data • Delay/ power data from blocks in high performance microprocessor core [ Kumar et al., JSSCC 2008 ] in 45 nm technology • Blocks implemented using different design Styles – RTL-to-Layout Synthesis (RLS), aka random logic synthesis • Mostly automatic (using vendor/ in-house tools); write RTL, partition, and run tools/ flows • Design quality determined by algorithms, tools, flows, parameters; supposedly poor utilization, or sparse layouts – Structured Data Paths (SDP) • Mostly manual; extract regularity using hierarchies, draw schematics, hierarchical placement and routing • Routing can be done flat; supposedly high utilization, or dense layouts • RLS (SDP): 86 (133) blocks; cell count more than 600 (700) K • Local interconnects: – RLS uses, mostly, M2 to M5, mostly minimum width, flat routing – SDP uses M2 to M7, different widths, hierarchical routing • Delay/ Power impact due to interconnects inside standard cells is considered as cell- delay/ power contribution in this study 7

  8. Utilization in RLS • Avg. utilization: 51.69% • Varies from 7% to 78% Placement Utilization (%) vs. # of Cells – Utilization varies significantly for blocks with 90 < 5000 cells, possibly because of floorplan; 80 for blocks with > 15000 cells, varies between 40 to 70% 70 Placement Utilization (%) – Higher than 70% utilization blocks fairly 60 difficult to converge 50 • Avg . block size: 7817, varies from 323 to 40 43298 30 20 • Reasons for low utilization: 10 – Difficult to route and converge timing due to 0 congestion, if the utilization is higher 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 – Synthesis/ placement not doing good job? # of Cells – Space for ECOs: even if we assume Utilization = std. cell area/ block area generous 10% white space, 60% utilization may still be considered low 8

  9. Utilization in SDP • Utilization varies from 0.07% to 74% Placement Utilization (%) vs. Cell Count • Avg. Utilization: 40.40% 80 70 • Avg. block size: 7542 cells 60 Placement Utilization • The SDP layouts are not denser 50 than RLS; reasons: 40 30 – Routing congestion caused “artificially” by the hierarchies 20 – Even with flat routing, it is not clear 10 why, and how much, the 0 congestion/ utilization may improve 0 5000 10000 15000 20000 25000 30000 (net ordering problem) Cell Count – Matching bit-widths? Utilization = std. cell area/ block area – ??? 9

  10. Agenda • Introduction • Impact on Timing • Impact on Power • Conclusions 1 0

  11. I m pact of I nterconnects on tim ing • For max timing, interconnects contribute in terms of – Wire delay – Slope degradation (slows down receivers) – Cell-delay degradation (extra cap to drive) – Cumulative effect of above 3 on path delays – Delays due to repeaters (inserted for timing/ slope/ noise) • Chose 3 metrics on the worst internal paths: – Wire delay – Interconnect impact (obtained by setting R= C= 0) – Repeater delay • Why internal paths: should exclude the effect of timing constraints on primary i/ os on synthesis flows (RLS)/ manual design (SDP) • Why worst paths: determines operating frequency 1 1

  12. W ire Delay on W orst Paths in RLS blocks • Varies from 0 to 26% of cycle- Wire delay % vs Cell count time 30 25 • Average wire delay: 6% 20 Wire delay % • Excludes repeater delay and 15 cell-delay/ slope-degradation 10 5 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Cell count 1 2

  13. W ire Delay on W orst Paths in SDP Blocks • Varies from 0 to 30% Wire delay % vs Cell count • Average wire delay: 5% 35 30 • Several blocks with 0 wire delay 25 Wire delay % on internal critical path implies 20 careful design 15 10 • Excludes repeater delay and cell- 5 delay/ slope-degradation 0 0 5000 10000 15000 20000 25000 30000 Cell count % 1 3

  14. W ire Delay vs. Slack for RLS blocks • Wire delay component Wire delay% vs Slack 30 increases as slack decreases 25 • Critical paths interconnect 20 dominant ones Wire delay % 15 10 5 0 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Slack 1 4

  15. W ire Delay vs. Slack for SDP blocks • Wire-delay component Wire delay % vs Slack increases as slack decreases 35 • Critical paths interconnect 30 dominant ones 25 Wire delay % 20 15 10 5 0 -0.05 0 0.05 0.1 0.15 0.2 0.25 Slack 1 5

  16. I nterconnect Delay Contribution on I nternal Paths in RLS blocks • How much would the timing improve, if R= C= 0 for local Slack difference % vs Slack interconnects 30 • Measured as the slack difference 25 on the worst internal paths by Slack Difference % 20 setting R= C= 0 15 – Includes cumulative effect of wire 10 delay, slope, cell-delay degradation 5 • Varies from 0 to 27% ; average 0 13% -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Slack • Average impact slightly more than twice the average wire delay • Excludes repeaters delay 1 6

  17. I nterconnect Delay Contribution on I nternal Paths in SDP blocks • How much would the timing improve, if R= C= 0 for local interconnects Slack difference % vs Slack • Slack difference varies from 0 to 40% 45 • Average slack difference 9% 40 35 – Smaller average implies that for many Slack difference % 30 blocks the worst internal path were cell- 25 delay dominated (consistent with wire delay slide for SDP) 20 15 • Average impact close to twice the 10 average wire delay 5 0 -0.05 0 0.05 0.1 0.15 0.2 0.25 • Excludes repeater delay Slack 1 7

  18. Repeater Count in RLS blocks • Varies almost linearly with block-size # of Repeaters vs. # of Cells • Repeater count varies from 183 to 25000 21315 20000 • Out of 641002, 176205 (27.48% ) # of Repeaters 15000 inverters and 106346 (16.59% ) buffers 10000 • Inv./ buf. contribute to ~ 44% of cell count 5000 • Synthesis possibly did not do a great job 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 # of Cells 1 8

  19. Repeater Count in SDP blocks • Increases with cell-count, but # of Inv./Buf. vs. # of Cells spread is larger than that in RLS 16000 – Depends on how different DEs do 14000 schematic, buffer insertion 12000 – # of buffers not necessarily increasing # of Inv./Buf. 10000 as linearly with cell count as in RLS; 8000 DEs used them sparingly as compared 6000 to tools 4000 2000 • Buffer count varies from 0 to 14089 0 0 5000 10000 15000 20000 25000 30000 • Out of 770306, 177037 (22.98% ) # of Cells inverters and 68069 (8.83% ) • Inv./ buf. contribute to ~ 31% of cell count; 13% better than RLS 1 9

  20. Repeater Delay in RLS blocks Repeater delay% vs Cell count • Varies from 0 to 45% 90 • Average repeater delay: 19% 80 70 • Includes both, inverter and 60 Repeater delay % 50 buffer delay 40 30 20 10 0 0 10000 20000 30000 40000 50000 Cell count 2 0

  21. Repeater Delay in SDP blocks • Varies from 0 to 38% Repeater delay % vs Cell count 45 • Average repeater delay: 11% 40 35 Repeater delay % • Includes both, inverter and buffer 30 25 delay 20 15 10 5 0 0 5000 10000 15000 20000 25000 30000 Cell count 2 1

  22. Sum m ary of Observations so far • Interconnect delay dominance regardless of design style • Secondary effects, slope-/ cell-delay degradation as big as wire delay • Repeater count more than 40% and linear in the size of blocks • Repeater delay contributes as much as wires • SDP design with more manual control better than synthesis 2 2

Recommend


More recommend