1 Near-Threshold Computing: Reclaiming Moore’s Law Dr. Ronald G. Dreslinski Research Fellow University of Michigan – Ann Arbor 1 University of Michigan EnA-HPC -- September 7, 2011 1 1
Motivation 1000000 ¡ Transistors ¡(100,000's) ¡ 100000 ¡ Power ¡(W) ¡ Performance ¡(GOPS) ¡ 10000 ¡ Efficiency ¡(GOPS/W) ¡ 1000 ¡ 100 ¡ 10 ¡ Limits ¡on ¡heat ¡extrac6on ¡ 1 ¡ Stagnates ¡performance ¡growth ¡ 0.1 ¡ 0.01 ¡ Limits ¡on ¡energy-‑efficiency ¡of ¡opera6ons ¡ 0.001 ¡ 1985 ¡ 1990 ¡ 1995 ¡ 2000 ¡ 2005 ¡ 2010 ¡ 2015 ¡ 2020 ¡ 2 ¡ 2 University of Michigan EnA-HPC -- September 7, 2011
Motivation 1000000 ¡ Transistors ¡(100,000's) ¡ Result: ¡Con6nue ¡scaling ¡trends ¡ 100000 ¡ Power ¡(W) ¡ that ¡fueled ¡the ¡compu6ng ¡ revolu6on ¡ Performance ¡(GOPS) ¡ 10000 ¡ Efficiency ¡(GOPS/W) ¡ 1000 ¡ With ¡the ¡help ¡of ¡some ¡beBer ¡ 100 ¡ thermal ¡management… ¡ 10 ¡ Goal: ¡To ¡increase ¡energy-‑ efficiency ¡of ¡operaGons ¡ 1 ¡ 0.1 ¡ 0.01 ¡ 0.001 ¡ 1985 ¡ 1990 ¡ 1995 ¡ 2000 ¡ 2005 ¡ 2010 ¡ 2015 ¡ 2020 ¡ Era ¡of ¡High ¡Performance ¡Compu6ng ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Era ¡of ¡Energy-‑Efficient ¡Compu6ng ¡ c. ¡2000 ¡ 3 ¡ 3 University of Michigan EnA-HPC -- September 7, 2011
Outline 4 Define a new region of operation, Near-Threshold Computing Explore new architectures enabled by key insights of computing in the NTC region Present an initial design of a 3D stacked NTC system, Centip3De 4 University of Michigan EnA-HPC -- September 7, 2011 4 4
Power Density Limitations 5 Circuit supply Power does not decrease at the voltages are no same rate that transistor count longer scaling… increases Environmental Form factor vs. Concerns Battery Life Stagnant Shrinking A = gate area scaling 1/s 2 C = capacitance scaling < 1/s Dynamic dominates Dark Silicon — The emerging dilemma: More and more gates can fit on a die, but not all can be turned on at the same time 5 University of Michigan EnA-HPC -- September 7, 2011 5 5
Today: Super-V th , High Performance, Power Constrained 6 Super-V th Energy / Operation 3+ GHz 0.5 mW/MHz Normalized Power, Energy, & Performance Energy per operation is the key metric for efficiency. Goal: same performance, low energy per operation Log (Delay) 0 V th V nom Supply Voltage Core i7 6 University of Michigan EnA-HPC -- September 7, 2011 6 6
Subthreshold Design 7 Super-V th Sub-V th Energy / Operation 12-16X Log (Delay) 500 – 1000X Operating in the sub-threshold gives us huge power gains at the expense of performance OK for sensors! 0 V th V nom Supply Voltage 7 University of Michigan EnA-HPC -- September 7, 2011 7 7
Evolution of Subthreshold Designs 8 Subliminal 1 Design (2006) -0.13 µ m CMOS -Used to investigate existence of Vmin -2.60 µ W/MHz Phoneix 1 Design (2008) - 0.18 µ m CMOS -Used to investigate sleep current -2.8 µ W/MHz / 30pW sleep power Subliminal 2 Design (2007) -0.13 µ m CMOS -Used to investigate process variation -3.5 µ W/MHz Phoenix 2 Design (2010) - 0.18 µ m CMOS -Commercial ARM M3 Core -Used to investigate: • Energy harvesting • Power management -37.4 µ W/MHz 8 University of Michigan EnA-HPC -- September 7, 2011 8 8
Near-Threshold Computing (NTC) 9 Super-V th Sub-V th Energy / Operation ~6-8X ~2X Near-Threshold Computing (NTC): • >60X power reduction • 6-8X energy reduction • Invest portion of extra transistors from scaling to overcome barriers Log (Delay) ~50-100X ~10X 0 V th V nom Supply Voltage 9 University of Michigan EnA-HPC -- September 7, 2011 9 9
Silicon Verification of Trends 10 Phoenix 2 Processor Phoenix 2 Design [Seok’11] 180nm Design 1.8V -> 700mV ~10x NTC Performance Loss ~7x NTC Energy Reduction Seok ISSCC 2011 10 University of Michigan EnA-HPC -- September 7, 2011 10 10
NTC – Opportunities and Challenges 11 Opportunities: New architectures Optimized Processes 3D Integration – less thermal restrictions Challenges: Low Voltage Memory New SRAM designs Robustness analysis at near-threshold Variation Razor [Ernst’03] and other in-situ delay monitoring Adaptive body biasing Performance Loss Many-core designs to improve parallelism Core boosting to improve single thread performance 11 University of Michigan EnA-HPC -- September 7, 2011 11 11
Outline 12 Define a new region of operation, Near-Threshold Computing Explore new architectures enabled by key insights of computing in the NTC region Present an initial design of a 3D stacked NTC system, Centip3De 12 University of Michigan EnA-HPC -- September 7, 2011 12 12
Minimum Energy SRAM 13 Total Dynamic — Leakage SRAM has a lower activity rate than logic VDD for minimum energy operation (V MIN ) is higher Running logic at V MIN for SRAM has a small energy penalty with increased performance 13 University of Michigan EnA-HPC -- September 7, 2011 13 13
New NTC Architectures 14 Next Level Memory Next Level Memory BUS / Switched Network BUS / Switched Network L1 L1 L1 L1 L1 Cluster Cluster Cluster Core Core Core Core Core Cluster L1 L1 L1 L1 L1 Core Core Core Core Key Insight: • SRAM is run at a higher V DD than cores with little energy penalty, allowing caches to operate faster than the core Design Levers: • Operating Voltage • L1 Size • Number of Cores per Cluster • Number of Clusters 14 University of Michigan EnA-HPC -- September 7, 2011 14 14
L1 Cache Size Tradeoff 15 Core Core Decreased Miss Rate L1 L1 Higher Energy/Access L2 L2 15 University of Michigan EnA-HPC -- September 7, 2011 15 15
Results – Energy Optimal L1 Size (Single Core) 16 Energy dependency on L1 size Trade-off between L1 and L2 access 16 University of Michigan EnA-HPC -- September 7, 2011 16 16
Clustering Tradeoffs 17 CPU CPU CPU CPU CPU CPU CPU CPU L1 L1 L1 L1 L1 L1 O X X Tradeoffs ----------------------- + Clustered Sharing L2 L2 - Cluster Conflict - New Bus - L1 Speed 17 University of Michigan EnA-HPC -- September 7, 2011 17 17
Energy Optimal Cluster-based CMP (Fixed Die Size) 18 18 University of Michigan EnA-HPC -- September 7, 2011 18 18
Full Space Analysis 19 19 University of Michigan EnA-HPC -- September 7, 2011 19 19
Various Scaling Methods 20 Baseline Normalized Energy/Operation Single CPU @ 1 233MHz L2 38% 0.8 L1 71% 4 Cores Simple CMP Core 4 L1’s 0.6 One core per L1 Vdd scaling 53% 0.4 2 Cores/Cluster 3 Clusters Proposed cluster- 0.2 based CMP Multiple cores per L1 0 Vdd scaling Uniprocessor CMP w/ NTC DVFS 20 University of Michigan EnA-HPC -- September 7, 2011 20 20
Energy Optima for SPLASH2 21 Cluster based architecture with Vdd and Vth scaling Optimal cluster size is 2 for most of the apps Rad choose non-clustered CMP Average: 74% over baseline, 55% over simple CMP energy savings energy savings over n c k L1 size/kB over baseline simple CMP Cho 3 2 64 70.8% 52.8% Fft 2 2 32 72.6% 68.5% fmm � 8 � 2 � 128 � 79.7% � 41.6% luc � 3 � 2 � 32 � 77.8% � 64.4% lun � 2 � 2 � 64 � 69.2% � 58.0% rad � 16 � 1 � 128 � 84.2% � 35.1% ray � 3 � 2 � 128 � 65.1% � 54.9% -21- 21 University of Michigan EnA-HPC -- September 7, 2011 21 21
Energy Optima w/ Performance Requirements 22 Cluster based approach provides best savings Traditional approach only saves energy at high end 53% 20% 32% 22 University of Michigan EnA-HPC -- September 7, 2011 22 22
Outline 23 Define a new region of operation, Near-Threshold Computing Explore new architectures enabled by key insights of computing in the NTC region Present an initial design of a 3D stacked NTC system, Centip3De 23 University of Michigan EnA-HPC -- September 7, 2011 23 23
A Closer Look at Wafer-Level Stacking 24 Oxide Silicon Dielectric(SiO2/SiN) “Super-Contact” Gate Poly STI (Shallow Trench Isolation) W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal) Illustration from Bob Patti, Tezzaron 24 University of Michigan EnA-HPC -- September 7, 2011 24 24
Next, Stack a Second Wafer & Thin: 25 25 University of Michigan EnA-HPC -- September 7, 2011 25 25
Then, Stack a Third Wafer: 26 3rd wafer 2nd wafer 1st wafer: controller 26 University of Michigan EnA-HPC -- September 7, 2011 26 26
Centip3De – 3D NTC Prototype 27 Logic - A Logic - B F2F Bond Logic - B Logic - A DRAM Sense/Logic – Bond Routing DRAM F2F Bond DRAM Centip3De Design • 130nm, 7-Layer 3D-Stacked Chip • 128 - ARM M3 Cores • 150mm 2 27 University of Michigan EnA-HPC -- September 7, 2011 27 27
Recommend
More recommend