a learning bridge from architectural synthesis to
play

A Learning Bridge from Architectural Synthesis to Physical Design - PowerPoint PPT Presentation

A Learning Bridge from Architectural Synthesis to Physical Design for Exploring Power Efficient High-Performance Adders Subhendu Roy 1 Yuzhe Ma 2 Jin Miao 1 Bei Yu 2 1 Cadence Design Systems 2 The Chinese University of Hong Kong ISLPED17 1 /


  1. A Learning Bridge from Architectural Synthesis to Physical Design for Exploring Power Efficient High-Performance Adders Subhendu Roy 1 Yuzhe Ma 2 Jin Miao 1 Bei Yu 2 1 Cadence Design Systems 2 The Chinese University of Hong Kong ISLPED’17 1 / 23

  2. Optimality across EDA stages Logic Synthesis Architectural Physical Synthesis Design No 1-1 mapping between metrics across various EDA stages. ◮ Optimality at one stage doesn’t guarantee the same in another stage ◮ Data-driven methodology, such as machine learning, becomes imminent ISLPED’17 2 / 23

  3. Binary Adder Design ◮ Primary building blocks in the datapath logic of a microprocessor ◮ A fundamental problem in VLSI industry for last several decades What is still unsolved? Closing the gap across adder design stages ISLPED’17 3 / 23

  4. Parallel Prefix Adders Parallel Prefix Adders → Flexible delay-power trade-off Regular Adders → Sub-optimal Custom Adders → High TAT ISLPED’17 4 / 23

  5. Parallel Prefix Adders Parallel Prefix Adders → Flexible delay-power trade-off Regular Adders → Sub-optimal Custom Adders → High TAT This Work: Automatic Cumtom Adders ISLPED’17 4 / 23

  6. Architectural Level: Mapped to Prefix Structures a 0 b 0 a 1 b 1 a 7 b 7 Pre-processing g 7 ,p 7 g 1 ,p 1 g 0 ,p 0 Parallel Prefix Structure c 1 c 0 a 7 b 7 a 2 b 2 a 1 b 1 a 0 b 0 c 6 C out = c 7 Post-processing s 7 s 2 s 1 s 0 ISLPED’17 5 / 23

  7. Prefix Graph Problem Carry-computation can be mapped to prefix graph problem y i = x i − 1 o x i − 1 o x i − 2 o . . . x 1 o x 0 x 5 x 4 x 3 x 2 x 1 x 0 Size (s) = No. of prefix nodes = 7 Level (L) = maximum logic level = 3 Max-Fanout (mfo) = 2 y 5 ISLPED’17 6 / 23

  8. Classifying Prefix Graph Synthesis Can be classified based on the solution# Category 1: Limited number of solutions ◮ Example: [Matsunaga+,GLSVLSI’07], [Liu+,ICCAD’03], [Zhu+,ASPDAC’05], [Roy+,ASPDAC’15] - Not suitable for exploring data-driven methodologies - No analytical model to physical design stage Category 2: Innumerable solutions ◮ Example: [Roy+,TCAD’14] - Not scalable for bounded fan-out - Computationally expensive to run all solutions through full physical design flow ISLPED’17 7 / 23

  9. Gap between Prefix Structure and Physical Design 240 2300 G1 G1 230 2200 G2 G2 Area ( µ m 2 ) Node Size 220 2100 210 2000 200 1900 190 1800 180 1700 5 10 15 20 25 30 35 0.34 0.36 0.38 0.4 Max Fanout Critical Delay (ns) (a) (b) (a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space - Correlation between size and area - Not completely reliable, G1 and G2 get mixed up in physical solution space ISLPED’17 8 / 23

  10. Gap between Prefix Structure and Physical Design 240 2300 G1 G1 230 2200 G2 G2 Area ( µ m 2 ) Node Size 220 2100 210 2000 200 1900 190 1800 180 1700 5 10 15 20 25 30 35 0.34 0.36 0.38 0.4 Max Fanout Critical Delay (ns) (a) (b) (a) Architectural solution space; (b) Physical design space. ◮ G1 (less fan-out and high size); G2 (high fan-out and low size) ◮ When mapped to physical solution space - Correlation between size and area - Not completely reliable, G1 and G2 get mixed up in physical solution space What We Want to Search For: All Pareto Frontier points with low area, low power, and low critical delay. ISLPED’17 8 / 23

  11. Task 1: Prefix Adder Solution Exploration 8000 TCAD‘14 7500 Power ( µ w) 7000 6500 6000 320 340 360 380 400 420 Critical Delay (ps) ISLPED’17 9 / 23

  12. [Roy+,TCAD’14]– Summary G 2 G 3 G 4 G 3 G n+1 G n ◮ G n = set of prefix graphs of bit-width n ◮ Prefix graphs of higher order generated in bottom-up fashion ◮ Several pruning strategies during G n → G n + 1 for scaling - For bounded fan-out, these strategies compromises in size-optimality ISLPED’17 10 / 23

  13. Enhancement 1: Imposing Semi-regularity ◮ The concept is derived from regular adders such as Brent-Kung, Sklansky. ◮ x i and x i + 1 combined to form prefix nodes, where i is even. ◮ This regularity for only L = 1 ◮ For L > 1 , regularity compromises size optimality (Forbidden). ◮ Observation: this semi-regularity doesn’t degrade size-optimality. x 7 x 6 x 5 x 4 x 0 x 3 x 2 x 1 ISLPED’17 11 / 23

  14. Enhancement 2: Level restriction in Non-trivial Fan-in ◮ Trivial fan-in having same MSB ◮ x 4 and i 1 are trivial and non-trivial fan-in of i 2 ◮ Level (non-trivial fan-in) ≥ level (trivial fan-in) ◮ Reduces search space without degrading size-optimality x 5 x 5 x 4 x 3 x 2 x 1 x 0 i 1 i 2 y 5 ISLPED’17 12 / 23

  15. Comparison at Prefix Graph Stage Our Approach [Roy+,TCAD’14] mfo size Run-time (s) size Run-time (s) 4 244 302 252 241 6 233 264 238 212 8 222 423 - - 12 201 193 - - 16 191 73 192 149 32 185 0.04 185 0.04 ◮ Table is for 64 bit adders ◮ [Roy+,TCAD’14] cannot get solutions for all fanouts. ◮ Our solutions are always more size-optimal. ◮ Runtimes are comparable, adder synthesis is one-time. ISLPED’17 13 / 23

  16. Physical Solution Space Comparison 8000 TCAD‘14 7500 Power ( µ w) 7000 6500 6000 320 340 360 380 400 420 Critical Delay (ps) Our solutions cover wider space in physical domain ◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases ISLPED’17 14 / 23

  17. Physical Solution Space Comparison 8000 TCAD‘14 Ours 7500 Power ( µ w) 7000 6500 6000 320 340 360 380 400 420 Critical Delay (ps) Our solutions cover wider space in physical domain ◮ 7000 random samples from [Roy+,TCAD’14] vs. 3000 samples from us ◮ Reason: TCAD’14 misses solutions for bounded fanout in a few cases ISLPED’17 14 / 23

  18. Task 2: Pareto Frontier Driven Learning 8000 Real PF Rep. Adder 7500 Power( µ w) 7000 6500 6000 340 360 380 400 420 440 Critical Delay(ps) ISLPED’17 15 / 23

  19. Quasi-Random Data Sampling ◮ Hundreds of thousands of solutions ◮ How to choose training data? - Cannot run too many architectures as physical design flow costly. - Too few will degrade model accuracy. Quasi-Random Sampling Create architectural bins based on mfo and s . ◮ Capture all architectural bins ◮ Select solutions from each bin randomly Bin of solutions with s=246 and mfo=4 s=245 s=246 s=244 mfo=4 mfo=6 s=233 s=234 s=235 ISLPED’17 16 / 23

  20. Feature Selection and Learning Model ◮ Architectural attributes: s , mfo , sum-path-fanout ( spfo ) ◮ Tool settings: Target delay ◮ Best model fitting by support-vector-regression (SVR) with RBF kernel ◮ Including spfo improves MSE score for delay from 0.232 to 0.164 ◮ Note: linear models not sufficient for modeling delay x 2 x 1 x 0 x 3 spfo ( y 1 ) = spfo ( x 0 ) + spfo ( x 1 ) + fo ( x 0 ) + fo ( x 1 ) = 0 + 0 + 1 + 1 = 2 spfo ( i 1 ) = spfo ( x 3 ) + spfo ( x 2 ) + fo ( x 3 ) + fo ( x 2 ) = i 1 0 + 0 + 1 + 2 = 3 y 1 spfo ( y 3 ) = spfo ( i 1 ) + spfo ( y 1 ) + fo ( i 1 ) + fo ( y 1 ) = 3 + 2 + 1 + 2 = 8 ISLPED’17 17 / 23

  21. Pareto Frontier Driven Learning ◮ Conventional learning focusses on prediction accuracy - Model accuracy improvement doesn’t guarantee Pareto-frontier improvement - Need for learning integrated Pareto-frontier exploration ◮ Scalarization or α -sweep - Learning output is a linear sum of delay and power ( α × Power + Delay) - Model-fitting done with different values of alpha - Sweeping alpha from 0 to a large positive number ISLPED’17 18 / 23

  22. Experimental Setup Synthesis and placement/routing of adders ◮ Tools: Design Compiler/ IC Compiler ◮ Library: Non-linear-delay-model (NLDM) in 32nm SAED cell-library ◮ Tool settings: Target delay = 0.1ns, 0.2ns, 0.3 ns Programming Language ◮ C++ for prefix adder synthesis ◮ Python based machine learning package scikit-learn Machine Configurations ◮ 72GB RAM UNIX machine ◮ 2.8GHz CPU ISLPED’17 19 / 23

  23. Pareto-frontier Comparison 8000 Real PF Predicted PF Rep. Adder 7500 Power( µ w) 7000 6500 6000 340 360 380 400 420 440 Critical Delay(ps) Predicted pareto-frontier almost matches actual pareto-frontier ◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted) ISLPED’17 20 / 23

  24. Pareto-frontier Comparison 8000 2300 Real PF Real PF Predicted PF Predicted PF 2200 Rep. Adder Rep. Adder 7500 Power( µ w) Area( µ m 2 ) 2100 7000 2000 6500 1900 1800 6000 340 360 380 400 420 440 340 360 380 400 420 Critical Delay(ps) Critical Delay(ps) Predicted pareto-frontier almost matches actual pareto-frontier ◮ Training set is randomly selected from 300 samples. ◮ Rep. adders are quasi-random sampled from other 3000 samples ◮ Predicted frontier is from best 150 solutions (predicted) ISLPED’17 20 / 23

  25. Comparison with Other Adders Pareto-points derived from our approach beats other solutions in all metrics (delay, area, power) Area ( µ m 2 ) Method Delay (ps) Power ( mW ) Kogge-Stone 347.9 2563.7 8.78 Ours ( P 1 ) 340.0 2203.3 7.72 Sklansky 356.1 1792.5 6.1 Ours ( P 2 ) 353.0 1753.0 5.9 [Roy+,ASPDAC’15] 348.7 1971.4 6.98 Ours ( P 3 ) 346.0 1848.6 6.67 ISLPED’17 21 / 23

Recommend


More recommend