heterogeneous latch based asynchronous pipelines
play

Heterogeneous Latch-based Asynchronous Pipelines Girish - PowerPoint PPT Presentation

Heterogeneous Latch-based Asynchronous Pipelines Girish Venkataramani Tiberiu Chelcea Seth C. Goldstein Presenter: Tobias Bjerregaard April 10 th , 2008 ASYNC 2008 1 Outline Introduction Introduction Latch Selection


  1. Heterogeneous Latch-based Asynchronous Pipelines Girish Venkataramani Tiberiu Chelcea Seth C. Goldstein Presenter: Tobias Bjerregaard April 10 th , 2008 ASYNC 2008 1

  2. Outline • Introduction Introduction • • Latch Selection Algorithm • Experimental Results • Conclusions April 10 th , 2008 ASYNC 2008 2

  3. Motivation L a • Normally open latches are t Data attractive for bundled data + c h designs, e.g., Mousetrap, Ack [Singh, ICCD 01] C H/S Delay Ack Req + High-performance: Short critical path to open latch – Power hungry: Data glitches spill over to downstream stages April 10 th , 2008 ASYNC 2008 3

  4. Motivation L a t Data • Self-Resetting (SR) Latches + c h address the glitching Ack problem, [Chelcea, DAC 07] C H/S Delay – D-Latch closed during active Req computation (filters glitches) – D-Latch is opened just before stage computation stabilizes L a t Data + c + 2x improvement in energy-delay* h – 10% performance slowdown* Req C H/S Delay SR Ack * Mediabench suite, [Lee, Micro 97] April 10 th , 2008 ASYNC 2008 4

  5. Contributions • Build heterogeneous pipelines – Use D-latches for timing critical stages – Use SR-latches for the rest • Module Selection problem – What is timing critical? – When is an SR-latch warranted? • Automatic Latch Selection Algorithm – Experimental results: Heterogeneous pipelines have equivalent performance to D- latches and are more energy-efficient than either homogeneous D-latch or SR-latch pipelines April 10 th , 2008 ASYNC 2008 5

  6. Outline • Introduction • Latch Selection Algorithm Latch Selection Algorithm • • Experimental Results • Conclusions April 10 th , 2008 ASYNC 2008 6

  7. Latch Selection Algorithm • Objectives: Get best of both worlds – Performance of D-latches – Energy efficiency of SR-latches • Approach: Balance the use of SR-latches – Too many � bigger and slower designs – Too few � high energy consumption • Algo properties: Three heuristics used to track timing criticality and estimate effect of datapath glitches April 10 th , 2008 ASYNC 2008 7

  8. Power Heuristics = glitch • Data glitches are = stage proportional to the datapath sr fanout – Use SR-latches if fanout >= 2 • Protect computation- intensive stages sr sr – Assign SR-latches to inputs – Bit-operations on datapath used to estimate computation intensiveness * April 10 th , 2008 ASYNC 2008 8

  9. Timing Criticality • SR-latch controllers introduce delay when opening latches – Use D-latch if stage is timing critical • Determine the system’s critical stages using the Global Critical Path (GCP), [Venkataramani, DAC 07] April 10 th , 2008 ASYNC 2008 9

  10. Timing Analysis • Analysis produces steady-state event firing times – Events are handshake signal transitions – Behaviors are dependence relations between events • Cycle time: Time difference between an event recurrence • Alternative representation of cycle time – Set of slack values: time difference between input events – Global Critical Path (GCP): longest zero-slack path – Global slack: Timing budget for GCP tolerance • How much can stage be slowed without changing GCP April 10 th , 2008 ASYNC 2008 10

  11. Event Slack • Use concept of Time Separation of Events (TSEs) to compute slack, GCP and global slack e 2 e 1 e 1 0 4 fires last-arrival input e 2 Behavior, b e 3 fires fires e 3 Slack(e 2 , b) = 4 Slack(e 1 , b) = 0 5 9 Time 0-slack input is locally critical [Fields, ISCA 01] April 10 th , 2008 ASYNC 2008 11

  12. Global Critical Path (GCP) • GCP is longest path of zero-slack input events, [Venkataramani,DAC 07] – Equivalent to the critical cycle • Bottom-Up computation of Cycle time: – Length of GCP cycle = cycle time GCP is the sequential critical path of the system. It represents the primary performance bottleneck April 10 th , 2008 ASYNC 2008 12

  13. Global Slack • Minimum cumulative b 1 GCP slack to the GCP 0 2 0 b 2 – If event is on the GCP 0 1 1 • GSlack(b 4 ) = GSlack(b 5 ) = 0 b 3 b 4 – Otherwise: 0 • GSlack(b 2 ) = 1 4 • GSlack(b 3 ) = 4 b 5 • GSlack(b 1 ) = Min(0+4, 2+1) = 3 e 5 Global slack is a measure of how much a behavior can be delayed without affecting global performance April 10 th , 2008 ASYNC 2008 13

  14. Timing Criticality Heuristic • Let ∆ sr be delay overhead introduced by SR-latches • Iterative algorithm: Assign an SR-latch when global slack is larger than ∆ sr – Update timing – Repeat; look for more opportunities April 10 th , 2008 ASYNC 2008 14

  15. Latch Selection Algorithm Overview G=(V,E) v best = stage in (V – V sr ) with most GSlack Add all v in V to V sr , if Fanout(v) >= 2 Is Add all v in V to V sr , if No V sr GSlack(v best ) > ∆ sr ? (v,u) is in E and BitOps(u) >= BO max Yes Update GSlack Add v best to V sr Complexity: O(|V||E| + |V| 2 ) April 10 th , 2008 ASYNC 2008 15

  16. Outline • Introduction • Latch Selection Algorithm • Experimental Results Experimental Results • • Conclusions April 10 th , 2008 ASYNC 2008 16

  17. Experimental Setup • Implemented latch selection algorithm within CASH, a compiler synthesizing 4-phase bundled circuits from C [IWLS 04] • Applied on 15 Mediabench kernels [Lee, Micro 97] • Circuits mapped to [180nm/2V] STMicro standard-cell library • Synopsys DC used to estimate energy, Modelsim used for gate-level timing estimation April 10 th , 2008 ASYNC 2008 17

  18. Impact of Heuristics 100% D-Latch Ops 90% Gslack 80% Compute-intensive 70% Fanout % Contribution 60% SR-Latch Stages 50% 40% 30% 20% 10% 0% K3.g721_d K4.g721_e K9.jpeg_d K10.jpeg_e K14.pgp_d K15.pgp_e K1.adpcm_d K2.adpcm_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Combined effect of heuristics contributes to energy efficiency April 10 th , 2008 ASYNC 2008 18

  19. April 10 th , 2008 Ratio to D-Latch 0.5 1.5 2.5 0 1 2 3 K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e Energy-Delay K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e ASYNC 2008 K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Heterogeneous SR-Latch K14.pgp_d K15.pgp_e GM 19

  20. April 10 th , 2008 Ratio to D-Latch 0.6 0.7 0.8 0.9 1.1 1 End-to-End Execution Time K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e ASYNC 2008 K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Heterogeneous SR-Latch K14.pgp_d K15.pgp_e GM 20

  21. Outline • Introduction • Latch Selection Algorithm • Experimental Results • Conclusions Conclusions • April 10 th , 2008 ASYNC 2008 21

  22. Conclusions • D-latches are power-hungry and SR-latches are slow for bundled-data pipelines • Heterogeneous latch selection algorithm – Global slack to guide timing-critical selection – Simple heuristics to guide power-critical selection • Heterogeneous latch pipelines are more energy-efficient than either homogeneous D- latch or homogeneous SR-latch pipelines April 10 th , 2008 ASYNC 2008 22

  23. Thank You! Questions? April 10 th , 2008 ASYNC 2008 23

  24. Self-Resetting (SR) Latches [Chelcea, DAC 07] trigger Area and control-path power En SR cntrl overhead Dout Timing overhead D-latch Din Done Benefit: Datapath power C savings April 10 th , 2008 ASYNC 2008 24

  25. SR-latch behavior Data ready En+ Open latches to pass data EnSR+ When data latched Done+ En- close the latches EnSR- STG specification Done- [Chelcea, DAC 07] • Eliminate glitches: – open only after data is ready – close as soon as data latched • Eliminate overheads: – open before handshake starts April 10 th , 2008 ASYNC 2008 25

Recommend


More recommend