application to post placement
play

Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 - PowerPoint PPT Presentation

Analytical Clustering Score with Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 , Peixin Li 1 , Guojie Luo 1 , Yiyu Shi 2 , and Iris Hui-Ru Jiang 3 {changxu, gluo} @ pku.edu.cn 1 Outline Background Multi-bit


  1. Analytical Clustering Score with Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 , Peixin Li 1 , Guojie Luo 1 , Yiyu Shi 2 , and Iris Hui-Ru Jiang 3 {changxu, gluo} @ pku.edu.cn 1

  2. Outline  Background  Multi-bit flip-flop  Previous works and limitation  Our method  Analytical score  Discrete refinement  Efficient implementation  Experimental Results  Conclusion 2

  3. Clock Power Optimization  Clock power predominates dynamic power 𝟑 𝒈 𝒅𝒎𝒍  𝑸 𝒅𝒎𝒍 = 𝜷𝑫 𝒅𝒎𝒍 𝑾 𝒆𝒆  Clock power optimization  Reduce 𝜷 • Clock gating technique  Reduce 𝑾 𝒆𝒆 • Sub-threshold voltage • Multi-supply-voltage  Reduce 𝑫 𝒅𝒎𝒍 • Multi-bit flip-flop • Resonance clock 3

  4. Multi-Bit Flip-Flop(MBFF)  What’s MBFF  Several SBFFs share common inverters in MBFF cell 2-Bit Flip-Flop Source: ICCAD’10 Chang et al. 4

  5. Multi-Bit Flip-Flop(MBFF)  Power saving comes from  MBFF library UMC 55nm process  Simplified clock tree Faraday cell library (a) Common clock tree (b) Simplified clock tree with MBFF 5

  6. Using MBFF at Different Stages Logic Synthesis  Pre-placement MBFF MBFF Clustering  SNUG’10 Chen et al.,  In-placement MBFF Placement MBFF Clustering  ISPD’13 Tsai et al.,  ICCAD’13 Hsu et al., Timing Analysis  Post-placement MBFF Post-placement Optimization  ICGCS’10 Yan and Chen MBFF Clustering  ICCAD’10 Chang et al.,  ISPD’11 Jiang et al., INTEGRA CTS Routing 6

  7. Post-Placement MBFF Clustering  Input  Placement of FFs and other gates  Timing slacks FF FF  MBFF library TVFR Output pin  Output FF  FF clusters (MBFF)  Constraint FF Input pin  Timing constraint 7

  8. Post-Placement MBFF Clustering  Timing violation free region (TVFR) TVFR TVFR1 Output pin FF 2-bit FF Input pin TVFR2 8

  9. Previous Works and Limitation  Intersection graph-based searching [ICCAD’10 ] Complete Intersection TVFRs Graph Graph TVFRs  Time consuming: 𝑷(𝑶 𝟒 )  Window-based acceleration affects power reduction 9

  10. Previous Works and Limitation  Interval graph-based searching [ISPD’11] Random Choice! Illustration to Interval Graph Source: ISPD’11 Jiang et al.  Efficient: sub-quadratic time complexity  Effective: best power reduction  Simple: signal wirelength degradation 10

  11. Benchmarks: C1-C6 Vs. IWLS 2005  Difference  TVFD/AFFD: roughly estimate #FF can be covered within TVFR  IWLS benchmarks have much more MBFF candidates! FF ratio FF ratio Vga (IWLS 2005) TVFD/AFFD C1-C6 TVFD/AFFD  Signal wirelength degradation (for Integra)  C1-C6: Avg. 3%  IWLS: Avg. 932% 11

  12. Our Contribution  Efficient and great scalability  Sub-quadratic time complexity  Robust performance  Power reduction: comparable to Integra  Signal wirelength: much better than Integra, especially for real designs  Analytical fashion  Potential integration in analytical global placement  Potential usage for clustering algorithms 12

  13. Optimization Flow 13

  14. Analytical Step: Basic Idea  Optimization Problem 𝑛𝑗𝑜 𝛽𝑔 𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛 𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈  𝑔 𝑚 𝒚, 𝒛 : signal wirelength  weighted- average WL[DAC’11]  𝑔 𝑑 𝒚, 𝒛 : #FF groups  nontrivial to be formulated 2-bit group  Timing constraint  feasible region 3-bit group TVFRs 14

  15. Analytical Step: Def. of Clustering Score  Dirac delta function 𝜀 𝑥, 𝑨 = 1 (𝑥 = 𝑨) 0 (𝑥 ≠ 𝑨) , 0 = 1 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 = 0 𝜀 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓  Cluster size 𝑂 = 𝑘=1 𝑂 𝑗 𝒚, 𝒛 𝜀( 𝑦 𝑗 , 𝑧 𝑗 − 𝑦 𝑘 , 𝑧 𝑘 , 0 ) 𝑮𝑮 𝒌 𝑶 𝒌 = 𝟑 𝑮𝑮 𝒋 𝑶 𝒋 = 𝟒 TVFRs 15

  16. Analytical Step: Def. of Clustering Score  Objective function: 𝒈 𝒅 term  4-bit group is most-efficient 𝑂 𝑛𝑗𝑜 − 𝑔 𝑑 = −𝑛𝑏𝑦𝑔 𝑑 = −𝑛𝑏𝑦 𝜀 𝑂 𝑗 𝒚, 𝒛 , 4 𝑗=1 16

  17. Analytical Step: Smoothing  Gaussian function 𝑥 − 𝑨 2 𝑚𝑜𝜗 𝜀 𝑥, 𝑨 ≈ 𝐸 𝑥, 𝑨 = exp 2 𝑒 0 𝐸 𝑥 − 𝑨 = 1 𝑥ℎ𝑓𝑜 𝑥 = 𝑨 𝐸 𝑥 − 𝑨 < 𝜗 𝑥ℎ𝑓𝑜 𝑥 − 𝑨 > 𝑒 0 Gaussian function Dirac Delta function 17

  18. Analytical Step: Effectiveness  Attractive force & repelling force PULL PULL 𝐺𝐺 𝐺𝐺 𝑗 𝑗 PUSH 𝐺𝐺 𝐺𝐺 𝑗 𝑗 18

  19. Analytical Step: Preliminary Clusters 3500 3500 Init. Loc. NLP Loc. 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 (a) Initial FFs’ distribution (b) FFs’ distribution after analytical clustering  𝒈 𝒅 : maximizes MBFF group numbers  𝒈 𝒎 : pulls FFs towards their “optimal locations” in terms of WL 19

  20. Discrete Step: Basic Idea  Two-pass best-choice clustering  First-pass: discretization  Second-pass: refinement A A C C B B D D First-pass I I E E G G F F H H (a) Proximity relation (b) Discrete clustering after analytical step A A C C D B B D Second-pass E I G E G I F H F H (d) Final MBFF groups (c) Discrete refinement 20

  21. Discrete Step: Two-Pass Best-Choice Clustering  First-pass: extract proximity relation  Bottom-up merging  Priority queue • Tuple: 𝑮𝑮 𝒋 , 𝑮𝑮 𝒌 , 𝒆 𝒆 = 𝒆𝒋𝒕𝒖(𝑮𝑮 𝒋 , 𝑮𝑮 𝒌 )  Capacity constraint: 4-bit  Second-pass: further refinement  Improve the ratio of 4-bit groups 21

  22. Discrete Step: Two-Pass Best-Choice Clustering A A A A C C C C D D B B B B D D x I I E E E E F G F G G G I I F F H H H H (c) second-pass clustering (d) Final MBFF groups (a) Proximity relation (b) First-pass clustering after analytical step S(C,D) S(I,E) S(G,F) S(H,F) S(E,G) S(I,H) S(A,B) S(A,C) S(I,E) 22

  23. MBFF Clusters 3500 3500 NLP Loc. Init. Loc. 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 3500 Final Loc. Init. Loc. 3000 3000 NLP Loc. 2500 2500 Final Loc. 2000 2000 1500 1500 1000 1000 500 500 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 23

  24. Efficient Implementation  Sub-quadratic timing complexity Analytical Step 𝑛𝑗𝑜 𝛽𝑔 𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛 Discrete Refinement 𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈 • Gradient calculation Fast gauss transformation (FGT) • FF-pair distance 𝑃 𝑂 2 ⇒ 𝑃(𝑂) Bin-structure searching • Nonlinear programming solver 𝑃 𝑂 2 ⇒ 𝑃(𝑂) Nesterov method Placement-like problem 𝑃(𝑂 1.18 ) 24

  25. Experiment Results:  Setup:  G++ 4.5.1 −𝑷 𝟒  Intel Xeon CPU @ 2.4GHz with 16 logical threads  Benchmarks: C1-C6, IWLS-2005 suite  Synthesis flow for real designs • Synopsys DC • Cadence Encounter SOC 25

  26. Experimental Results: C1-C6  Comparable power reduction  33% WL reduction Integra Ours Circuit PWR WLR RT PWR WLR RT (s) (s) C1 82.8 96 0.01 83.5 77.4 0.42 C2 80.9 102 0.01 82.3 76.4 0.97 C3 80.8 104 0.01 82.3 74.9 3.14 C4 81.0 104 0.02 82.4 75.6 10.59 C5 80.7 105 0.05 82.1 76.4 16.66 C6 80.7 105 1.11 82.3 82 217.4 Avg. 1 1.33 1 1.02 1 252 26

  27. Experimental Results: Real Designs  Bound-Integra Effect of Different Bound Factors to Power Ration and WL Ratio 27

  28. Experimental Results: Real Designs  Comparable power reduction  43% WL reduction compared with Bound-Integra Bound-Integra Ours Circuit PWR WLR RT PWR WLR RT (s) (s) Tv80 78.11 109.2 0.01 78.10 95.7 0.94 Wbconmax 78.26 128 0.03 78.02 105 2.3 Pairing 78.00 132 0.03 78.00 109 6.61 Dma 78.04 124 0.05 78.02 96 5.43 Ac97 78.02 120 0.02 78.02 96 4.88 Ethernet 78.00 217 0.63 78.00 88 24.5 Avg. 1 1.43 1 0.99 1 84 28

  29. Conclusion  We propose analytical clustering score to merge MBFF  The time complexity is sub-quadratic  We get comparable power reduction as Integra  We reduce wirelength by about 25% compared with original placement  Potential usage:  Integrated in global placement  Clustering algorithms 29

  30. Q&A  Thanks {changxu, gluo} @pku.edu.cn 30

Recommend


More recommend