Analytical Clustering Score with Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 , Peixin Li 1 , Guojie Luo 1 , Yiyu Shi 2 , and Iris Hui-Ru Jiang 3 {changxu, gluo} @ pku.edu.cn 1
Outline Background Multi-bit flip-flop Previous works and limitation Our method Analytical score Discrete refinement Efficient implementation Experimental Results Conclusion 2
Clock Power Optimization Clock power predominates dynamic power 𝟑 𝒈 𝒅𝒎𝒍 𝑸 𝒅𝒎𝒍 = 𝜷𝑫 𝒅𝒎𝒍 𝑾 𝒆𝒆 Clock power optimization Reduce 𝜷 • Clock gating technique Reduce 𝑾 𝒆𝒆 • Sub-threshold voltage • Multi-supply-voltage Reduce 𝑫 𝒅𝒎𝒍 • Multi-bit flip-flop • Resonance clock 3
Multi-Bit Flip-Flop(MBFF) What’s MBFF Several SBFFs share common inverters in MBFF cell 2-Bit Flip-Flop Source: ICCAD’10 Chang et al. 4
Multi-Bit Flip-Flop(MBFF) Power saving comes from MBFF library UMC 55nm process Simplified clock tree Faraday cell library (a) Common clock tree (b) Simplified clock tree with MBFF 5
Using MBFF at Different Stages Logic Synthesis Pre-placement MBFF MBFF Clustering SNUG’10 Chen et al., In-placement MBFF Placement MBFF Clustering ISPD’13 Tsai et al., ICCAD’13 Hsu et al., Timing Analysis Post-placement MBFF Post-placement Optimization ICGCS’10 Yan and Chen MBFF Clustering ICCAD’10 Chang et al., ISPD’11 Jiang et al., INTEGRA CTS Routing 6
Post-Placement MBFF Clustering Input Placement of FFs and other gates Timing slacks FF FF MBFF library TVFR Output pin Output FF FF clusters (MBFF) Constraint FF Input pin Timing constraint 7
Post-Placement MBFF Clustering Timing violation free region (TVFR) TVFR TVFR1 Output pin FF 2-bit FF Input pin TVFR2 8
Previous Works and Limitation Intersection graph-based searching [ICCAD’10 ] Complete Intersection TVFRs Graph Graph TVFRs Time consuming: 𝑷(𝑶 𝟒 ) Window-based acceleration affects power reduction 9
Previous Works and Limitation Interval graph-based searching [ISPD’11] Random Choice! Illustration to Interval Graph Source: ISPD’11 Jiang et al. Efficient: sub-quadratic time complexity Effective: best power reduction Simple: signal wirelength degradation 10
Benchmarks: C1-C6 Vs. IWLS 2005 Difference TVFD/AFFD: roughly estimate #FF can be covered within TVFR IWLS benchmarks have much more MBFF candidates! FF ratio FF ratio Vga (IWLS 2005) TVFD/AFFD C1-C6 TVFD/AFFD Signal wirelength degradation (for Integra) C1-C6: Avg. 3% IWLS: Avg. 932% 11
Our Contribution Efficient and great scalability Sub-quadratic time complexity Robust performance Power reduction: comparable to Integra Signal wirelength: much better than Integra, especially for real designs Analytical fashion Potential integration in analytical global placement Potential usage for clustering algorithms 12
Optimization Flow 13
Analytical Step: Basic Idea Optimization Problem 𝑛𝑗𝑜 𝛽𝑔 𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛 𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈 𝑔 𝑚 𝒚, 𝒛 : signal wirelength weighted- average WL[DAC’11] 𝑔 𝑑 𝒚, 𝒛 : #FF groups nontrivial to be formulated 2-bit group Timing constraint feasible region 3-bit group TVFRs 14
Analytical Step: Def. of Clustering Score Dirac delta function 𝜀 𝑥, 𝑨 = 1 (𝑥 = 𝑨) 0 (𝑥 ≠ 𝑨) , 0 = 1 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 = 0 𝜀 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Cluster size 𝑂 = 𝑘=1 𝑂 𝑗 𝒚, 𝒛 𝜀( 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 , 0 ) 𝑮𝑮 𝒌 𝑶 𝒌 = 𝟑 𝑮𝑮 𝒋 𝑶 𝒋 = 𝟒 TVFRs 15
Analytical Step: Def. of Clustering Score Objective function: 𝒈 𝒅 term 4-bit group is most-efficient 𝑂 𝑛𝑗𝑜 − 𝑔 𝑑 = −𝑛𝑏𝑦𝑔 𝑑 = −𝑛𝑏𝑦 𝜀 𝑂 𝑗 𝒚, 𝒛 , 4 𝑗=1 16
Analytical Step: Smoothing Gaussian function 𝑥 − 𝑨 2 𝑚𝑜𝜗 𝜀 𝑥, 𝑨 ≈ 𝐸 𝑥, 𝑨 = exp 2 𝑒 0 𝐸 𝑥 − 𝑨 = 1 𝑥ℎ𝑓𝑜 𝑥 = 𝑨 𝐸 𝑥 − 𝑨 < 𝜗 𝑥ℎ𝑓𝑜 𝑥 − 𝑨 > 𝑒 0 Gaussian function Dirac Delta function 17
Analytical Step: Effectiveness Attractive force & repelling force PULL PULL 𝐺𝐺 𝐺𝐺 𝑗 𝑗 PUSH 𝐺𝐺 𝐺𝐺 𝑗 𝑗 18
Analytical Step: Preliminary Clusters 3500 3500 Init. Loc. NLP Loc. 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 (a) Initial FFs’ distribution (b) FFs’ distribution after analytical clustering 𝒈 𝒅 : maximizes MBFF group numbers 𝒈 𝒎 : pulls FFs towards their “optimal locations” in terms of WL 19
Discrete Step: Basic Idea Two-pass best-choice clustering First-pass: discretization Second-pass: refinement A A C C B B D D First-pass I I E E G G F F H H (a) Proximity relation (b) Discrete clustering after analytical step A A C C D B B D Second-pass E I G E G I F H F H (d) Final MBFF groups (c) Discrete refinement 20
Discrete Step: Two-Pass Best-Choice Clustering First-pass: extract proximity relation Bottom-up merging Priority queue • Tuple: 𝑮𝑮 𝒋 , 𝑮𝑮 𝒌 , 𝒆 𝒆 = 𝒆𝒋𝒕𝒖(𝑮𝑮 𝒋 , 𝑮𝑮 𝒌 ) Capacity constraint: 4-bit Second-pass: further refinement Improve the ratio of 4-bit groups 21
Discrete Step: Two-Pass Best-Choice Clustering A A A A C C C C D D B B B B D D x I I E E E E F G F G G G I I F F H H H H (c) second-pass clustering (d) Final MBFF groups (a) Proximity relation (b) First-pass clustering after analytical step S(C,D) S(I,E) S(G,F) S(H,F) S(E,G) S(I,H) S(A,B) S(A,C) S(I,E) 22
MBFF Clusters 3500 3500 NLP Loc. Init. Loc. 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 3500 Final Loc. Init. Loc. 3000 3000 NLP Loc. 2500 2500 Final Loc. 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 23
Efficient Implementation Sub-quadratic timing complexity Analytical Step 𝑛𝑗𝑜 𝛽𝑔 𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛 Discrete Refinement 𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈 • Gradient calculation Fast gauss transformation (FGT) • FF-pair distance 𝑃 𝑂 2 ⇒ 𝑃(𝑂) Bin-structure searching • Nonlinear programming solver 𝑃 𝑂 2 ⇒ 𝑃(𝑂) Nesterov method Placement-like problem 𝑃(𝑂 1.18 ) 24
Experiment Results: Setup: G++ 4.5.1 −𝑷 𝟒 Intel Xeon CPU @ 2.4GHz with 16 logical threads Benchmarks: C1-C6, IWLS-2005 suite Synthesis flow for real designs • Synopsys DC • Cadence Encounter SOC 25
Experimental Results: C1-C6 Comparable power reduction 33% WL reduction Integra Ours Circuit PWR WLR RT PWR WLR RT (s) (s) C1 82.8 96 0.01 83.5 77.4 0.42 C2 80.9 102 0.01 82.3 76.4 0.97 C3 80.8 104 0.01 82.3 74.9 3.14 C4 81.0 104 0.02 82.4 75.6 10.59 C5 80.7 105 0.05 82.1 76.4 16.66 C6 80.7 105 1.11 82.3 82 217.4 Avg. 1 1.33 1 1.02 1 252 26
Experimental Results: Real Designs Bound-Integra Effect of Different Bound Factors to Power Ration and WL Ratio 27
Experimental Results: Real Designs Comparable power reduction 43% WL reduction compared with Bound-Integra Bound-Integra Ours Circuit PWR WLR RT PWR WLR RT (s) (s) Tv80 78.11 109.2 0.01 78.10 95.7 0.94 Wbconmax 78.26 128 0.03 78.02 105 2.3 Pairing 78.00 132 0.03 78.00 109 6.61 Dma 78.04 124 0.05 78.02 96 5.43 Ac97 78.02 120 0.02 78.02 96 4.88 Ethernet 78.00 217 0.63 78.00 88 24.5 Avg. 1 1.43 1 0.99 1 84 28
Conclusion We propose analytical clustering score to merge MBFF The time complexity is sub-quadratic We get comparable power reduction as Integra We reduce wirelength by about 25% compared with original placement Potential usage: Integrated in global placement Clustering algorithms 29
Q&A Thanks {changxu, gluo} @pku.edu.cn 30
Recommend
More recommend