Heterogeneous Latch-based Asynchronous Pipelines Girish Venkataramani Tiberiu Chelcea Seth C. Goldstein Presenter: Tobias Bjerregaard April 10 th , 2008 ASYNC 2008 1
Outline • Introduction Introduction • • Latch Selection Algorithm • Experimental Results • Conclusions April 10 th , 2008 ASYNC 2008 2
Motivation L a • Normally open latches are t Data attractive for bundled data + c h designs, e.g., Mousetrap, Ack [Singh, ICCD 01] C H/S Delay Ack Req + High-performance: Short critical path to open latch – Power hungry: Data glitches spill over to downstream stages April 10 th , 2008 ASYNC 2008 3
Motivation L a t Data • Self-Resetting (SR) Latches + c h address the glitching Ack problem, [Chelcea, DAC 07] C H/S Delay – D-Latch closed during active Req computation (filters glitches) – D-Latch is opened just before stage computation stabilizes L a t Data + c + 2x improvement in energy-delay* h – 10% performance slowdown* Req C H/S Delay SR Ack * Mediabench suite, [Lee, Micro 97] April 10 th , 2008 ASYNC 2008 4
Contributions • Build heterogeneous pipelines – Use D-latches for timing critical stages – Use SR-latches for the rest • Module Selection problem – What is timing critical? – When is an SR-latch warranted? • Automatic Latch Selection Algorithm – Experimental results: Heterogeneous pipelines have equivalent performance to D- latches and are more energy-efficient than either homogeneous D-latch or SR-latch pipelines April 10 th , 2008 ASYNC 2008 5
Outline • Introduction • Latch Selection Algorithm Latch Selection Algorithm • • Experimental Results • Conclusions April 10 th , 2008 ASYNC 2008 6
Latch Selection Algorithm • Objectives: Get best of both worlds – Performance of D-latches – Energy efficiency of SR-latches • Approach: Balance the use of SR-latches – Too many � bigger and slower designs – Too few � high energy consumption • Algo properties: Three heuristics used to track timing criticality and estimate effect of datapath glitches April 10 th , 2008 ASYNC 2008 7
Power Heuristics = glitch • Data glitches are = stage proportional to the datapath sr fanout – Use SR-latches if fanout >= 2 • Protect computation- intensive stages sr sr – Assign SR-latches to inputs – Bit-operations on datapath used to estimate computation intensiveness * April 10 th , 2008 ASYNC 2008 8
Timing Criticality • SR-latch controllers introduce delay when opening latches – Use D-latch if stage is timing critical • Determine the system’s critical stages using the Global Critical Path (GCP), [Venkataramani, DAC 07] April 10 th , 2008 ASYNC 2008 9
Timing Analysis • Analysis produces steady-state event firing times – Events are handshake signal transitions – Behaviors are dependence relations between events • Cycle time: Time difference between an event recurrence • Alternative representation of cycle time – Set of slack values: time difference between input events – Global Critical Path (GCP): longest zero-slack path – Global slack: Timing budget for GCP tolerance • How much can stage be slowed without changing GCP April 10 th , 2008 ASYNC 2008 10
Event Slack • Use concept of Time Separation of Events (TSEs) to compute slack, GCP and global slack e 2 e 1 e 1 0 4 fires last-arrival input e 2 Behavior, b e 3 fires fires e 3 Slack(e 2 , b) = 4 Slack(e 1 , b) = 0 5 9 Time 0-slack input is locally critical [Fields, ISCA 01] April 10 th , 2008 ASYNC 2008 11
Global Critical Path (GCP) • GCP is longest path of zero-slack input events, [Venkataramani,DAC 07] – Equivalent to the critical cycle • Bottom-Up computation of Cycle time: – Length of GCP cycle = cycle time GCP is the sequential critical path of the system. It represents the primary performance bottleneck April 10 th , 2008 ASYNC 2008 12
Global Slack • Minimum cumulative b 1 GCP slack to the GCP 0 2 0 b 2 – If event is on the GCP 0 1 1 • GSlack(b 4 ) = GSlack(b 5 ) = 0 b 3 b 4 – Otherwise: 0 • GSlack(b 2 ) = 1 4 • GSlack(b 3 ) = 4 b 5 • GSlack(b 1 ) = Min(0+4, 2+1) = 3 e 5 Global slack is a measure of how much a behavior can be delayed without affecting global performance April 10 th , 2008 ASYNC 2008 13
Timing Criticality Heuristic • Let ∆ sr be delay overhead introduced by SR-latches • Iterative algorithm: Assign an SR-latch when global slack is larger than ∆ sr – Update timing – Repeat; look for more opportunities April 10 th , 2008 ASYNC 2008 14
Latch Selection Algorithm Overview G=(V,E) v best = stage in (V – V sr ) with most GSlack Add all v in V to V sr , if Fanout(v) >= 2 Is Add all v in V to V sr , if No V sr GSlack(v best ) > ∆ sr ? (v,u) is in E and BitOps(u) >= BO max Yes Update GSlack Add v best to V sr Complexity: O(|V||E| + |V| 2 ) April 10 th , 2008 ASYNC 2008 15
Outline • Introduction • Latch Selection Algorithm • Experimental Results Experimental Results • • Conclusions April 10 th , 2008 ASYNC 2008 16
Experimental Setup • Implemented latch selection algorithm within CASH, a compiler synthesizing 4-phase bundled circuits from C [IWLS 04] • Applied on 15 Mediabench kernels [Lee, Micro 97] • Circuits mapped to [180nm/2V] STMicro standard-cell library • Synopsys DC used to estimate energy, Modelsim used for gate-level timing estimation April 10 th , 2008 ASYNC 2008 17
Impact of Heuristics 100% D-Latch Ops 90% Gslack 80% Compute-intensive 70% Fanout % Contribution 60% SR-Latch Stages 50% 40% 30% 20% 10% 0% K3.g721_d K4.g721_e K9.jpeg_d K10.jpeg_e K14.pgp_d K15.pgp_e K1.adpcm_d K2.adpcm_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Combined effect of heuristics contributes to energy efficiency April 10 th , 2008 ASYNC 2008 18
April 10 th , 2008 Ratio to D-Latch 0.5 1.5 2.5 0 1 2 3 K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e Energy-Delay K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e ASYNC 2008 K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Heterogeneous SR-Latch K14.pgp_d K15.pgp_e GM 19
April 10 th , 2008 Ratio to D-Latch 0.6 0.7 0.8 0.9 1.1 1 End-to-End Execution Time K1.adpcm_d K2.adpcm_e K3.g721_d K4.g721_e K5.gsm_d K6.gsm_d K7.gsm_e K8.gsm_e ASYNC 2008 K9.jpeg_d K10.jpeg_e K11.mpeg2_d K12.mpeg2_d K13.mpeg2_e Heterogeneous SR-Latch K14.pgp_d K15.pgp_e GM 20
Outline • Introduction • Latch Selection Algorithm • Experimental Results • Conclusions Conclusions • April 10 th , 2008 ASYNC 2008 21
Conclusions • D-latches are power-hungry and SR-latches are slow for bundled-data pipelines • Heterogeneous latch selection algorithm – Global slack to guide timing-critical selection – Simple heuristics to guide power-critical selection • Heterogeneous latch pipelines are more energy-efficient than either homogeneous D- latch or homogeneous SR-latch pipelines April 10 th , 2008 ASYNC 2008 22
Thank You! Questions? April 10 th , 2008 ASYNC 2008 23
Self-Resetting (SR) Latches [Chelcea, DAC 07] trigger Area and control-path power En SR cntrl overhead Dout Timing overhead D-latch Din Done Benefit: Datapath power C savings April 10 th , 2008 ASYNC 2008 24
SR-latch behavior Data ready En+ Open latches to pass data EnSR+ When data latched Done+ En- close the latches EnSR- STG specification Done- [Chelcea, DAC 07] • Eliminate glitches: – open only after data is ready – close as soon as data latched • Eliminate overheads: – open before handshake starts April 10 th , 2008 ASYNC 2008 25
Recommend
More recommend