A Systems Approach to Computing in Beyond CMOS Fabrics A. D. Patil, N. R. Shanbhag, L. R. Varshney, E. Pop, H.-S. P. Wong, S. Mitra, J. Rabaey, J. Weldon, L. Pileggi, S. Manipatruni, D. Nikonov, and I. A. Young A Center www.sonic-center.org
Machines are Beating Humans at Complex Inference Tasks … [The Guardian, May 2017] Extracting information from data recognition, prediction, classification decision-making …… DATA BMW Ian Sasi INFORMATION … the energy cost is HUGE (e.g., 𝟐𝟏𝟏𝟏𝟏𝒀 ) 2
Existing Solutions are Reaching their Limits architecture device Memory d 0 d 1 d 2 d 3 … … Row decoder Row decoder memory WL ∆𝑾 𝑪𝑴 driver Precharge 𝑢ℎ L :1 L :1 L :1 col. mux col. mux col. mux SA SA SA interface Mux & buffer K -b bus Decision ( 𝒛 ) Digital processor processor [Pop-NanoResearch-2010] Input buffer ( 𝑸 ) • diminishing energy-delay benefits • von Neumann architecture mismatched to from CMOS scaling inference workloads requirements • variations dominate • data movement problem (memory wall) 3
fundamental question how do we design intelligent machines that operate at the limits of energy ? 4
Research Inspiration Shannon-inspired Brain-inspired reliable systems operating at the limits of energy-efficiency employ stochastic components & statistical models of computation stochastic channel stochastic neural fabric Key: embrace stochasticity of nanoscale fabrics via Shannon/brain- inspired models of computation 5
Statistical Information Processing A principled approach to non von Neumann computing A Journey from Systems-to-Devices Information-based metrics, fundamental limits, design principles Principles of Prototypes Shannon/brain-inspired Statistical Information to In CMOS & to architectures Processing Beyond CMOS deep in-memory arch. fundamental limits on energy efficiency Input buffer ( ) Decision ( ) Cross BL processor (CBLP) ADC e h e h P , ( , ) observations corrected & RDL BLP BLP BLP BLP BLP BLP output y 1 d 0 y FR row decoder FR row decoder Estimator/ d 1 x y ˆ 2 C y Detector d 2 N d 3 application-derived WL driver Precharge metric L :1 L :1 L :1 col. mux col. mux col. mux = max 𝑄(𝑧|𝑧 * , . . , 𝑧 - ) SA SA SA Mux & buffer K -b bus Stochastic nanofunctions, models, prototypes 6
Statistical Information Processing algorithms prototypes in & beyond CMOS architectures Application requirements Shannon & Brain-inspired Models of Computing Nanofunctions device & Beyond-CMOS devices circuit 7
Systems in Beyond CMOS using Shannon & Brain-inspired Models of Computing • hyperdimensional (HD) computing via 3D VRRAM– brain-inspired • Shannon-inspired spintronics 8
Hyperdimensional (HD) Computing [Rabaey, Olshausen, Mitra, Wong] hypervectors HD HD input output mapper Algebra HD ≈ 10000 -bit ⃗ = 𝐵 ⃗ + 𝐶 + 𝐷 ⃗ 4-bit 𝑇 ⃗ is similar to 𝐵 ⃗, 𝐶 and 𝐷 ⃗ ) ( 𝑇 ⃗ = 𝐵 ⃗ ∗ 𝐶 ∗ 𝐷 ⃗ 𝑇 ⃗ is dissimilar to all 𝐵 ⃗, 𝐶 and 𝐷 ⃗ ) ( 𝑇 MAP • almost all vectors are nearly-orthogonal • orthogonalization via local M ultiply (XOR) • requires random mapping • summarization via local A dd (OR) • related to Shannon’s random codes • ordered summarization via P ermutation 9
RRAM Stochasticity for Random Mapping Binary bits statistically produced/stored in RRAMs Experimental data High resistance state: 0 Low resistance state: 1 4 10 100% 1. Top electrode Top electrode V TE < V SET Pulse Width (ns) 75% 0. 3 10 50% 50% 0. P SET < 1 Bottom electrode Bottom electrode 25% 0. Oxygen vacancy (V O ) Stochastic V O 2 10 0. 0% 1001100111……0100111101 0.7 0.8 0.9 1.0 1.1 (hyper-vector) Pulse Amplitude (V) P SET : SET probability (switching from ‘0’ to ‘1’) [H. Li,…, H.-S. P. Wong, IEDM , 2016] 10
3D VRRAM as In-Memory Computing Kernels VRRAM: Vertical Resistive Random Access Memory Plane electrode Word line (WL) Pillar electrode RRAM cell z x y Select line (SL) 1T-4R Bit line (BL) TiN/Ti TiN/Ti (50 nm) Layer 4 (L4) TiN (TE) TiN Layer 3 (L3) TiN Layer 2 (L2) TiN (BE) Layer 1 (L1) TiN (20 nm) FinFET 50 nm Fab by NDL, Taiwan 11
In-Memory MAP Kernels using 3D RRAM 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0 Multiplication Addition Permutation Resistance 1M Measured LRS Measured HRS 1 1 0 (~10kΩ) (400kΩ-1MΩ) 10k 1111 4 5 6 7 1 D = 1 28 L4 0 0 0 1 4 gnd Resistance ( W ) 0111 Bit 1 up L3 V DD 0 0 1 0 3 gnd D 1 21 Current ( µ A) 0 C L2 0 1 V DD 0 0 2 gnd 100k 0011 0 B 14 200 ns 1 L1 1 V DD 0 0 0 A 1 0001 Input AB = pillar addr. = 10 7 0000 L4 0 V DD 1 1 1 1M C = 0 Bit 0 down 200 ns 0 L3 1 0 V DD 1 1 1k 1M 1G 1T gnd Logic Evaluation 1 3 5 7 9 11 10 10 10 10 10 10 Logic Evaluation Cycle (#) 1 0 L2 1 V DD 1 gnd Addition Cycle (#) L1 1 1 1 0 gnd [H. Li,…, H.-S. P. Wong, IEDM , 2016] 12
3D VRRAM In-Memory HD Architecture • Device-architecture co-design for 3D VRRAM-centric HD computing • MAP operations of HD computing mapped onto 3D VRRAM arrays • > 400 × area reduction (28-nm node) compared with digital design [1] Algorithm Architecture Device (a) (b) Input texts 1.78E6 Decoder 1600 P SET » 50% Routing: + 2158.1 Projection 9.16E5 6 MUX 10 PERM 2 ) Letter Random HD vectors SA Component Area ( µ m (3 layers) Sampling Cell array 660 ´ 1200 ADD 2 ) 3-letter sequences 5 10 XOR Total Area ( µ m 412 ´ Trigram MAP XOR Routing: (5 layers) + 2081.5 Routing: Compute trigrams + 1699.7 800 XOR 4 ADD 10 (1 layer) 3394 Binding (addition) 2691 2223 ADD Generate (learn) LangMap language/text maps 400 Store (6 layers) 3 10 (one for each text) MAP (HamD) XOR 28nm LP HamD 3D VRRAM Training : finish XOR 0 2 10 Measure Inference : measure HamD 1 kb 2 kb 10 kb 1 kb 2 kb 10 kb (21 layers) & identify the ‘nearest’ Store ADD HD Vector Size HD Vector Size 4 kb ´ 36 layers MAP One-shot learning [[1] A. Rahimi et al. , ISLPED, p.64, 2016] 38 13
Shannon-inspired Spintronics [Patil, Shanbhag, Manipatruni, Nikonov, Young, MMM-Intermag’16, arXiv’17] • Can we make all-spin logic competitive with CMOS? SPIN CMOS • deterministic nanomagnet switching deterministic costs much energy & delay regime stochastic • need to design spintronic systems in regime the stochastic regime [Nikonov-JXCDC-2015] • But first…......need system-friendly abstract models of spin devices 14
� Systems-friendly Nanomagnetic Model 40× reduction in switching energy as 𝜗 scales from 10 @*P to 10 @* 𝜗 > ≈ 𝑓 @ AB 𝜗 = 0.5 𝜗 𝜗 𝐹, 𝑈 C 𝜗 𝜗 [Butler-TMAG-2012] 𝟗. 𝟒× 𝜗 L 𝜗 > = 𝑗 L 𝐽 HIJK 𝐹 𝑗, 𝑈 𝑆𝑈 > 𝟕× 𝜗 𝟗× 𝟔× 𝜗 𝜗 𝜗 : Switching error rate 𝐹 : Switching energy 𝑗 : Current overdrive factor energy numbers from [Manipatruni, et al., Physical Review Applied’16] 𝐽 HIJK : Critical supply current 15
The 𝝑 -Noisy Spin Gate Model ! -noisy non-volatile AND gate 𝜄 ∈ {0,1} : Bernoulli random variable ' ( = * + , + Pr 𝜄 = 1 = 𝜗 A ' - t B 0 q ! t C + " t T Error-free g 1 gate Gate is OFF Gate is OFF " # $ %# & 1 = { C A B } # # + ! t t t " Virtual gate operation Gate is ON emulating ! -noise Switching occurs • includes input dependence • enables evaluation of complex networks 16
Shannon-inspired Statistical Error Compensation (SEC) Fusion block 𝑧 _ = MAP estimate of 𝑧 ` Desired property : disparity between error probability mass functions ! " ($) ! & (') sparse dense error compensator is robust: 𝜗 < 10 @^ error compensator is efficient: 𝑚 > 𝑟 & 𝑙 > 𝑛 0 [Zhang, Shanbhag, IEEE Trans. Signal Processing, 2016] [Gonugondla, Shim, Shanbhag, ICASSP, 2016] [Abdallah, Shanbhag, IEEE J. solid-state circuits, 2013] 17
Enhancing Disparity in Error Error distribution at the output of 15bit RCA Probablity Mass Functions all delays equal error probability Path Delay Balancing (PDB) “maximally” slow network → “minimally” error-prone network, without energy increase error magnitude error probability after PDB Path Delay Redistribution (PDR) generates a sparse error distribution error magnitude error probability after PDB & PDR error magnitude 18
Seizure Detection using a Support Vector Machine (SVM) 𝒙 𝑐 CHB-MIT EEG dataset SVM 𝑨 𝒚 Classifier 𝑨 = 0 𝒙 B 𝒚 + 𝑐 ≶ 0 [Verma-JSSC-2010] 𝑨 = 1 𝒚 : feature vector extracted from EEG signals 𝒙 : trained weight vector 𝑐 : trained scalar bias 𝑨 : decision; 𝑨 = 1 ⇒ seizure 19
Simulation Results Main Block Error Compensator Gate count 52.8k 5.608k RPE-EST : Reduced precision embedded estimator 20
Fundamental Limits on Energy, Delay and Accuracy (what’s possible , what’s impossible , and the grey area in between) [Shannon, 1948] § use abstract models of nanoscale fabrics to § obtain fundamental limits on information processing capacity and § extract design principles to approach those limits 21
Recommend
More recommend