bubble razor
play

Bubble Razor An Architecture-Independent Approach to Timing-Error - PowerPoint PPT Presentation

1 Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Harris, David Blaauw, Dennis Sylvester mfojtik@umich.edu Electrical Engineering


  1. 1 Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Harris, David Blaauw, Dennis Sylvester mfojtik@umich.edu Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor 1 1 1

  2. Outline 2  Issues with Prior Razor  Bubble Razor Algorithm  Circuitry and Implementation  Area Overhead Tradeoffs  Test Chip Results 2 2 2

  3. Timing Margins 3 Margins for uncertainty: Associated Costs:  Lost performance  Process Variation  Lost energy  Temperature Variation  Tester time (tradeoff)  Voltage Variation  Aging Effects Lost performance/energy actual circuit delay clock Data Process Aging Temperature Voltage 3 3 3

  4. Eliminating Margins 4  Always Correct D Q Main  Tables, Canaries CLK DFF  Detect and Correct Error Shadow  Razor Style Latch DCLK S. Das, et. al. [VLSI 2005] Technique Process Ambient Data Global Local Global Local Slow Fast Slow Fast Table Lookup X X Table & Sensors X X X Canary Circuit X X Razor Designs X X X X X X X 4 4 4

  5. Speculation Window and Hold Time 5 DFF A DFF B CLK A CLK B Speculation Window Speculation window linked to minimum delay constraint (hold time) 5 5 5

  6. Architectural Invasiveness 6 EX IF ID MEM WB S. Das, et. al. [VLSI 2005] Razor I Style – All Flops Reload Previous Values IF ID EX MEM CHK WB D. Blaauw, et. al. [ISSCC 2008] K. Bowman, et. al. [ISSCC 2008] Razor II Style – Check Stage and Architectural Replay • Requires Designer Effort • RTL written with Razor in mind 6 6 6

  7. Fundamentals of Bubble Razor 7  Two-Phase Latch Timing  Automatically convert Flip-Flop based design  Time Borrowing as Correction Mechanism  Does not modify design architecture  Does not require reloading / replaying instructions  Local Correction (Bubbles)  Break requirement of stalling entire chip at once 7 7 7

  8. Two Phase Latch Razor Timing 8 LD LD A B CLK A CLK B Larger Speculation Window Minimum delay constraint the same as conventional design 8 8 8

  9. Time Borrowing as Error Correction 9 LD DFF LD LD DFF LD TD TD TD TD Bubble Razor – Switch to Latches, Borrow Time G closed open closed open closed X closed open D Error • Push-button approach • No Hold Time Issues • No metastability on datapath • Architecture Agnostic 9 9 9

  10. Stalling Locally with Bubbles 10 Stalling the Clock Locally • With flops, all registers hold data • With latches, half registers hold bubbles • Every latch stalls exactly once • Communication only between neighbors Eventually it all resolves Blue tells Green to stall Purple tells Blue to stall Red tells Yellow takes off again Purple to stall Yellow tells Yellow tells downstream Yellow stalls Red to stall Time no new data exists Not immediately overwritten 1 2 3 4 5 6 7 8 10 10 10

  11. Timing of Clock Waveforms 11 1 2 3 4 5 6 7 8 9 10 Prevent 1 Losing inst3 Should Prevent 2 Timing Arrive Losing inst2 violation Give time 3 to Recover 4 Prevent Double Sampling inst1 11 11 11

  12. Timing of Clock Waveforms 12 1 2 3 4 5 6 7 8 9 10 Prevent 1 Losing inst3 Should Prevent 2 Timing Arrive Losing inst2 violation Give time 3 to Recover 4 Prevent Double Sampling inst1 12 12 12

  13. Timing of Clock Waveforms 13 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 13 13 13

  14. Timing of Clock Waveforms 14 1 2 3 4 5 6 7 8 9 10 1 Timing 2 Stall Neighbors violation 3 Stall 3 4 5 6 7 8 9 10 14 14 14

  15. The Required Circuitry 15 TD TD TD TD 1 2 3 2 CG CG CG CG B B B B 15 15 15

  16. Error Detection And OR Circuitry 16 TD TD TD 1 16 16 16

  17. Clock Gate Control Logic 17 CG B  A cluster stalls and sends bubbles to all neighbors if  Told by a neighboring cluster  Did not stall in the previous cycle  Equivalent to sending bubbles to “other” neighbors 17 17 17

  18. Clustering with hMETIS 18  Widely used Hypergraph 1 6 4 partitioning program, hMETIS 3 2  Clusters must only contain 5 members with the same phase  Create two graphs, and partition independently 1 1 3 1  Connected in hMETIS graph, if 2 2 transitively connected in circuit  Edge Weight = number of latches 4 6 1 that form transitive connection 2 1 5 18 18 18

  19. Clustering Results 19  Tradeoff between sizes of OR gates  Combining errors  Combining bubbles  100 negative clusters  70 positive clusters 19 19 19

  20. Two Port Memory Boundary Approach 20 Must fit edge triggered memory into stalling algorithm 20 20 20

  21. “Managing” the Synthesis/APR Tools 21  Want balanced pipelines, no time borrowing  Model razor latches as flip flops  Dynamic OR always followed by latch  Model dynamic OR as static  Model latch as flip flop (captures when latch closes)  Use regular ICG cells  Can use conventional clock tree synthesis  Final design appears to be relatively “normal”  Flip-flop based design with clock gating  Everything is timing constrained  “ Razorization ” process is entirely automated  Synthesis and netlist transformation scripts 21 21 21

  22. Retiming And Number of Latches 22  Retiming can increase the number of latches  Results in area overhead 22 22 22

  23. Area Overhead of Latch Transformation 23 23 23 23

  24. Speculation Window Size 24  Full Clock Phase (100%) Minus Delay of Error Propagation Circuits  Maximum allowed by technique  Number / Location of Latches with Error Checking  Maximum slowdown that does not result in unchecked error Speculation Window 24 24 24

  25. Where Error Checking is Needed 25 50% 15% 30% Speculation Window Leave Arrive Arrive B C D  If circuit delay suddenly becomes 130% of its nominal value, all timing errors will be detected before the circuit fails 156% 91% Delay at Worst 65% 65% 26% >50? >50? >50? Delay at PoFF 50% 50% 20% A B C D 25 25 25

  26. Path Distribution for Cortex-M3 26 All Flip Latches Flops Positive Negative Latches Latches 26 26 26

  27. Area Increase from Error Checking 27 20% Area Overhead 30% Timing Speculation 27 27 27

  28. Implementation on ARM Cortex-M3 28 28 28 28

  29. Characterizing Throughput / Energy 29  Operating Point Set for Worst Case Operation  85°C  10% Supply Droop  2 σ Process  5% Safety Margin  200 MHz at 1.0 V 29 29 29

  30. Gains from Bubble Razor 30 30 30 30

  31. Gains from Bubble Razor 31 31 31 31

  32. Bubble Razor Results 32 Slow Average Fast 32 32 32

  33. Bubble Razor Results 33 Worst 200 MHz 8.5 Worst 1.0 V 3.08 μ J/FFT Case FFT/ms Case First 333 MHz 14.2 First 0.775 V 1.42 μ J/FFT Failure FFT/ms Failure Optimum 425 MHz 17.3 Optimum 0.725 V 1.18 μ J/FFT FFT/ms 33 33 33

  34. Conclusion 34  First Razor style implementation on a complete, commercial processor (ARM Cortex-M3).  Proposed two-phase latch based Razor technique  Novel local replay algorithm  Demonstrated automated nature of technique  Successfully implemented and fabricated in 45nm  60% energy efficiency or 100% throughput increase over worst case margining 34 34 34

Recommend


More recommend