graceful register clustering by effective mean shift
play

Graceful Register Clustering by Effective Mean Shift Algorithm for - PowerPoint PPT Presentation

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing Iris Hui-Ru Jiang Ya-Chu Chang Tung-Wei Lin Gi-Joon Nam Outline Introduction Preliminaries and problem formulation Effective mean shift


  1. Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing Iris Hui-Ru Jiang Ya-Chu Chang Tung-Wei Lin Gi-Joon Nam

  2. Outline Introduction Preliminaries and problem formulation Effective mean shift Experimental results Conclusion 2

  3. Why Register Clustering? ⚫ Dynamic power!! Clock power dominates!! ⚫ Reduce the switching capacitance in a clock network. Switching capacitance Clock power saving Other benefits Clock sinks Shared clocking circuitry; Smaller area #leafs  (Register capacitance) Clock network Simpler topology and #leaf   depth  (Wirelength, clock buffers) easier skew control Clock root Clock root 3 C FF 8 C FF 3

  4. Two Register Cluster Designs ⚫ Rigid cell ⚫ Flexible template – Discrete bits: – Structured latch template: – 1 , 2 , 3~ 4 , 5~ 8 , 9~ 16 , 17~ 32 , 1, 2, 4, 8, 16, 32, 64 33~ 64 Dual-bit flip-flop Master Slave Q 1 D 1 latch latch clk clk Master Master Slave Slave Q Q 2 D 2 D latch latch latch latch Single-bit flip-flop 4

  5. Prior Work (1/3) ⚫ In-placement or post-placement Logic synthesis Timing-driven placement Register clustering Legalization Clock tree synthesis Routing Tape Out Source: IBM 5

  6. Prior Work (2/3) ⚫ Clique partitioning – Constructs a clustering compatibility graph based on timing feasible regions – Extracts maximal cliques to form multi-bit registers without timing degradation ⚫ Up-to-date: [Seitanidis+, DAC-17] – Clique enumeration + ILP – High complexities! – Scalability issue for large-scale design 1 6 3 5 8 4 2 7 6

  7. Prior Work (3/3) ⚫ K-means – Relaxes timing constraints to maximum displacement constraints – Starts with a prespecified # of clusters and initial cluster centers – Assigns registers to nearest clusters iteratively until convergence ⚫ State-of-the-art: Weighted K-means [Wu+, DAC-16] – Is sensitive to initializations and outliers (distant from others) – Intends to form large clusters (nearly max. allowable bits) – Possibly moves outliers far away – Needs extra processes to fix over-displacement & size overflow ⚫ Up-to-date: Capacitated K-means + ILP [Kahng+, ICCAD-16] 7

  8. Investigations ⚫ Creating large clusters or dragging outliers far away causes large disruption to placement thus incurring significant timing degradation – The more timing degradations, the more ECO efforts. ⚫ We can save power even few registers are clustered Macro3 Macro2 Macro4 Macro1 Macro5 Macro6 Macro7 Macro8 : outlier : I/O pin : clusterable registers 8

  9. What’s a Good Register Clustering Algorithm? ⚫ 1) Requires no prespecified number of clusters ⚫ 2) Is insensitive to initializations ⚫ 3) Is robust to outliers ⚫ 4) Is tolerant of various register distributions ⚫ 5) Is efficient and scalable ⚫ 6) Balances power and timing 9

  10. Our Contributions ⚫ Propose effective mean shift to perform graceful register clustering for reducing clock power while minimizing timing degradation ⚫ Augment classic mean shift with special treatments for register clustering to attain these goals ⚫ Key idea: Conceptually, clusters are expected to reside in dense regions of registers. Our idea is to direct registers towards their nearest densest spots to form clusters naturally. 10

  11. Outline Introduction Preliminaries and problem formulation Effective mean shift Experimental results Conclusion 11

  12. Classic Mean Shift 𝑜 1 𝑙 𝑦 − 𝑦 𝑗 𝑔(𝑦) = 𝑜ℎ 𝑒 ෍ ⚫ Generate a density surface ℎ 𝑗=1 2 𝑦 − 𝑦 𝑗 𝑜 σ 𝑗=1 𝑦 𝑗 𝑕 ℎ ⚫ Iteratively shift each point uphill 𝑛 𝑦 = − 𝑦 2 𝑦 − 𝑦 𝑗 𝑜 σ 𝑗=1 𝑕 ℎ ⚫ Time complexity is of 𝑃 𝑈𝑜 2 : 𝑈 iterations, 𝑜 points Y cluster peak data point outlier 12 X

  13. Problem Formulation Initial placement Tech file Register library Logic synthesis Timing-driven placement Register clustering Register clustering Min. #clusters Legalization Min. displacement (Manhattan) Clock tree synthesis s.t. the cluster size constraint, Max. displacement constraints Routing Tape Out Timing report Clock tree report 13

  14. Outline Introduction Preliminaries and problem formulation Effective mean shift Experimental results Conclusion 14

  15. Classic vs. Adaptive vs. Effective Local max The register distribution is mapped to a density surface. Set K-NN Set Bandwidth Shift Cluster Dense regions form hills. Adaptive Mean Shift Classic Mean Shift Effective Mean Shift (Variable Bandwidth) Density estimator 𝑜 𝑜 1 1 𝑒 𝑙 𝑦 − 𝑦 𝑗 1 𝑙 𝑦 − 𝑦 𝑗 1 1 𝑒 𝑙 𝑦 − 𝑦 𝑗 ෍ 𝑜ℎ 𝑒 ෍ 𝑜 ෍ 𝑜 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 𝑗∈𝐿𝑂𝑂′(𝑦) 𝑗=1 𝑗=1 Shift point 2 2 𝑦 𝑗 𝑦 − 𝑦 𝑗 𝑦 𝑗 𝑦 − 𝑦 𝑗 2 𝑦 − 𝑦 𝑗 𝑜 σ 𝑗=1 𝑒+2 𝑕 σ 𝑗∈𝐿𝑂𝑂′(𝑦) 𝑒+2 𝑕 𝑜 σ 𝑗=1 𝑦 𝑗 𝑕 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 ℎ 2 2 2 𝑦 − 𝑦 𝑗 1 𝑦 − 𝑦 𝑗 1 𝑦 − 𝑦 𝑗 𝑜 σ 𝑗=1 𝑕 𝑜 σ 𝑗=1 σ 𝑗∈𝐿𝑂𝑂′(𝑦) 𝑒+2 𝑕 𝑒+2 𝑕 ℎ ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 ℎ 𝑗 𝑦 2 , Gaussian kernel 1. 𝑙 𝑦 = 𝜆 2. 𝑕 𝑦 = −𝜆′(𝑦) 3. 𝑒 = 2 15

  16. Overview Initial placement Tech file Register library Logic synthesis Effective Mean Shift Timing-driven placement For each register Identifying effective neighbors Register clustering Setting timing-aware bandwidth Legalization Constructing density surface Shifting to local maximum Clock tree synthesis Clustering by local maxima Routing Relocating clusters and registers Tape Out Timing report Clock tree report 16

  17. Variable Bandwidth Selection Local max Set K-NN Cluster Set Bandwidth Shift 1 1 𝑒 𝑙 𝑦 − 𝑦 𝑗 ℎ 𝑗 = min ℎ max , 𝛽 𝑦 𝑗 − 𝑦 𝑗,𝑁 ෍ 𝑜 ℎ 𝑗 ℎ 𝑗 𝑗∈𝐿𝑂𝑂′(𝑦) ℎ 𝑘 ℎ 𝑗 register M=1 17

  18. Identifying Effective Neighbors ⚫ Points that correspond to the tails of the underlying density function receive small weights, and thus they are almost automatically discarded. ⚫ Consider only effective neighbors ⚫ Iteratively updating effective neighbors may still be computation intensive ⚫ Computing KNN only once – Neighbors barely change, effective neighbors can be identified only once (at the beginning) – Analysis of distinct neighbors (K=140) Circuit # of Iterations # of Total Distinct Neighbors # of Distinct Neighbors per Iteration Superblue16 213 158.25 0.74 Superblue18 315 158.09 0.50 Superblue10 533 156.13 0.29 18

  19. Setting K-Nearest Neighbors Local max Set K-NN Cluster Shift Set Bandwidth 2 𝑦 𝑗 𝑦 − 𝑦 𝑗 σ 𝑗∈𝐿𝑂𝑂′(𝑦) 𝑒+2 𝑕 ℎ 𝑗 ℎ 𝑗 – Constraint: maximum displacement 2 1 𝑦 − 𝑦 𝑗 K=12 σ 𝑗∈𝐿𝑂𝑂′(𝑦) 𝑒+2 𝑕 ℎ 𝑗 ℎ 𝑗 excluded neighbor ℎ max ignored register 19

  20. Shifting to Local Density Maxima ⚫ Each register undergoes the following steps to seek the local density maximum 0 = 𝑦 𝑘 , 𝑘 = 1. . 𝑜 Set the initial coordinates, 𝑧 𝑘 1. 0 ) ; set bandwidth ℎ 𝑘 Identify effective neighbors, 𝐿𝑂𝑂′(𝑧 𝑘 2. ◼ Then, the density surface is formed 𝑢 Compute the mean shift vector 𝑛 𝑧 𝑘 3. 2 𝑢−𝑦𝑗 𝑧𝑘 𝑦𝑗 σ 𝑗∈𝐿𝑂𝑂′(𝑧𝑘 𝑒+2 𝑕 0) ℎ𝑗 ℎ𝑗 𝑢+1 = 𝑧 𝑘 𝑢 + 𝑛 𝑧 𝑘 𝑢 = Shift each register, 𝑧 𝑘 4. 2 𝑢−𝑦𝑗 𝑧𝑘 1 σ 𝑗∈𝐿𝑂𝑂′(𝑧𝑘 𝑒+2 𝑕 0) ℎ𝑗 ℎ𝑗 𝑢+1 − 𝑧 𝑘 𝑢 < δ Iterate steps 3 and 4 until convergence, 𝑧 𝑘 5. 20

  21. Clustering by Local Density Maxima Local max Set K-NN Cluster Shift Set Bandwidth ⚫ Compensate the approximation error of KNN (c) Large threshold (b) Medium threshold (a) Small threshold 21

  22. Relocation for Timing and Displacement ⚫ The previous steps in effective mean shift can be viewed as seeking the locations of clusters ⚫ Reassign registers and relocate clusters for improving timing and displacement – Manhattan distance ⚫ Relocate each cluster to the median coordinate of its register members for minimizing displacement and reducing timing degradation 22

  23. Complexity Analysis ⚫ Classic mean shift: 𝑃 𝑈𝑜 2 , 𝑈 iterations, 𝑜 registers ⚫ Effective mean shift: 𝑃(𝑈𝐿𝑜 + 𝐷𝑜) , 𝐿 effective neighbors, 𝐷 clusters. – Shifting to local density maxima: 𝑃(𝑈𝐿𝑜) time, 𝐿 ≪ 𝑜 – Register reassignment and cluster relocation: 𝑃(𝐷𝑜) time, 𝐷 ≪ 𝑜 23

  24. For each register Identifying effective neighbors Parallelization Setting timing-aware bandwidth Constructing density surface Shifting to local maximum Set KNN Shift to Local Maximum Set Bandwidth Thread0 Reg. 8m Reg. 8m Thread1 Reg. 8m+1 Reg. 8m+1 Thread2 Reg. 8m+2 Reg. 8m+2 Thread3 Reg. 8m+3 Reg. 8m+3 Start Thread4 Reg. 8m+4 Reg. 8m+4 Thread5 Reg. 8m+5 Reg. 8m+5 Thread6 Reg. 8m+6 Reg. 8m+6 Thread7 Reg. 8m+7 Reg. 8m+7 24

  25. Outline Introduction Preliminaries and problem formulation Effective mean shift Experimental results Conclusion 25

Recommend


More recommend