PROMISE An End-To-End Design of a PROgrammable MIxed-Signal AccElerator for Machine Learning Algorithms Prakalp Srivastava *, Mingu Kang *, Sujan K. Gonugondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam S. Kim, Naresh Shanbhag (psrivas2@illinois.edu, mingu.kang@ibm.com) * Equal Contribution Supported by NSF, C-FAR, and SONIC PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign
Machine Learning under Resource Constraints • Embedded statistical inference: IoT, sensor-rich platforms • Decision making under resource constraints PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 1 / 23
Energy Trend of Memory vs. Processing Inception-v4 Integer ADD Mult 80 8 bits 0.03 pJ 0.2 pJ ResNet-152 Inception-v3 32 bits 0.1 pJ 3 pJ Top-1 accuracy [%] Computation energy (45nm) ENet Memory 64 bits VGG-16 VGG-19 BN-NIN Cache 8 KB 10 pJ 35M 155M BN-AlexNet Cache 32 KB 20 pJ Cache 1 MB 100 pJ AlexNet # of parameters DRAM 1.2 – 2.6 nJ 50 0 40 Memory access energy (45nm) Operations [G-Ops] Component-level energy trend Accuracy vs. amount of operations, in modern processor and number of parameters [ Horowitz, ISSCC14’ ] [ Canziani, Arxiv16’ ] PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 2 / 23
Deep In-memory Architecture (DIMA) Precharge/Y-decoder X-decoder X-decoder BLP BLP BLP BLP BLP BLP Cross Bitline Processor RDL Decision Deeply embeds analog computing at the periphery of bitcell array • Low-swing / Low-SNR operations for aggressive energy efficiency • [M. Kang, JSSC18, J. Zhang, VLSI16, S. Gonugondla, ISSCC18, A Biswas, ISSCC18] PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 3 / 23
DIMA Prototypes f o o r Multi-functional Random forest On-chip training P inference processor processor processor (65nm CMOS) (65nm CMOS) (65nm CMOS) E [ Sujan Gonugondla, [ Mingu Kang, JSSC18 [ Mingu Kang, JSSC18 ] E ISSCC18 ] Mingu Kang, ESSCIRC17 ] E 53 × EDP ↓ 7 × EDP ↓ 100 × EDP ↓ I PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 4 / 23
DIMA Prototypes f o o r Multi-functional Random forest On-chip training P inference processor processor processor (65nm CMOS) (65nm CMOS) (65nm CMOS) E [ Sujan Gonugondla, [ Mingu Kang, JSSC18 [ Mingu Kang, JSSC18 ] E ISSCC18 ] Mingu Kang, ESSCIRC17 ] E Lack of Programmability 53 × EDP ↓ 7 × EDP ↓ 100 × EDP ↓ I PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 4 / 23
Goals & Challenges of PROMISE 1. Analog programmable hardware and ISA design 2. End-to-End application mapping to PROMISE 3. Optimal energy with accuracy guarantee PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 5 / 23
Goals & Challenges of PROMISE 1. Analog programmable hardware and ISA design − Analog noise management − Intrinsic sequentiality of operations − High variations in delay across different analog operations 2. End-to-End application mapping to PROMISE 3. Optimal energy with accuracy guarantee PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 5 / 23
Goals & Challenges of PROMISE 1. Analog programmable hardware and ISA design 2. End-to-End application mapping to PROMISE Analog High-level circuit language (DIMA) - Voltage swing e.g. Fully- - ADC precision connect. layer - Analog noise 𝒁 = 𝑿 % 𝒀 - Leakage 3. Optimal energy with accuracy guarantee PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 5 / 23
Goals & Challenges of PROMISE 1. Analog programmable hardware and ISA design 2. End-to-End application mapping to PROMISE 3. Optimal energy with accuracy guarantee − Energy vs. accuracy trade-off in analog circuit − Maximize energy savings − Accuracy guarantees across long chain of analog processing PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 5 / 23
Our Contributions Programmability Challenge – 1 Precharge/Y-decoder High-level X-decoder X-decoder Program DNN DNN Ma Matched Filter BLP BLP BLP BLP BLP BLP SVM SV Cross Bit-line Processor RDL PCA PC PROM PR OMIS ISE Ha Hardware … PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 6 / 23
Our Contributions Programmability Challenge – 1 Programmability Challenge – 2 Precharge/Y-decoder PROMISE High-level PROMISE X-decoder X-decoder ISA Program Compiler DNN DNN Ma Matched Filter BLP BLP BLP BLP BLP BLP SVM SV Cross Bit-line Processor RDL PCA PC PROM PR OMIS ISE Ha Hardware … PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 6 / 23
Our Contributions Programmability Challenge – 1 Programmability Challenge – 2 Precharge/Y-decoder PROMISE Optimized PROMISE High-level PROMISE X-decoder X-decoder ISA ISA Program Compiler DNN DNN Ma Matched Filter BLP BLP BLP BLP BLP BLP SV SVM Cross Bit-line Processor RDL PCA PC Energy PROMISE PR PROM OMIS ISE Ha Hardware … Optimization ISA Programmability Challenge – 3 PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 6 / 23
Prior Art PRIME RedEye PuDianNao [ P. Chi, ISCA16 ] [ R.L. Wa, ISCA15 ] [ D. Liu, ASPLOS15 ] • Instruction set architecture • ReRAM in-memory • Processor in image processor sensor • Various ML algorithms • Limited programmability • Digital implementation • Limited error management PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 7 / 23
Processing Stages in DIMA Precharge/Y-decoder X-decoder X-decoder BLP BLP BLP BLP BLP BLP Cross Bitline Processor 1. Analog READ (aRead) 2. Bitline processing (BLP) 3. Cross BLP (CBLP) 4. ADC & Residual digital logic (RDL) PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 8 / 23
Energy vs. Accuracy Trade-off Energy efficiency↑ 120 Probability of detection* [%] 100 80 Accuracy↑ 60 40 20 0 0 10 20 30 Bitline voltage swing [mV] * Silicon measured results of template matching from [ Kang JSSC18 ] PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 9 / 23
Energy vs. Accuracy Trade-off Energy efficiency↑ 120 Probability of detection* [%] 100 80 Accuracy↑ 60 40 20 0 0 10 20 30 Bitline voltage swing [mV] PROMISE SWING = 000 (min) SWING = 111 (max) Instruction * Silicon measured results of template matching from [ Kang JSSC18 ] PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 10 / 23
Energy vs. Accuracy Trade-off … Accuracy × × ( SWING = ??? ) ( SWING = ??? ) ( SWING = ??? ) goal > 4096 possible combinations for 4 layers PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 11 / 23
End-to-End Application to Architecture Mapping Programmability Challenge – 1 Programmability Challenge – 2 Precharge/Y-decoder Optimized PROMISE Julia PROMISE X-decoder X-decoder ISA Program Compiler DNN DNN Matched Filter Ma BLP BLP BLP BLP BLP BLP SVM SV Cross Bit-line Processor RDL PCA PC Energy PROMISE PR PROM OMIS ISE Ha Hardware … Optimization ISA Programmability Challenge – 3 PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 12 / 23
Machine Learning Algorithms Distance Metric 𝒈(𝒆𝒋𝒕𝒖 𝑿, 𝒀 ) 𝒈( ) 6 SVM sign 0 𝑥 𝑗 𝑦[𝑗] 789 Vector distance (VD) 6 min Template Match 1 0 |𝑥 𝑗 − 𝑦 𝑗 | 789 6 min Template Match 2 < 0 𝑥 𝑗 − 𝑦 𝑗 789 6 DNN tanh, ReLU 0 𝑥 𝑗 𝑦[𝑗] 789 6 PCA - 0 𝑥 𝑗 𝑦[𝑗] 789 6 K -NN 1 majority vote 0 |𝑥 𝑗 − 𝑦 𝑗 | Scalar distance (SD) 789 6 K -NN 2 majority vote < 0 𝑥 𝑗 − 𝑦 𝑗 Threshold (TH) 789 6 Matched Filter min 0 𝑥 𝑗 𝑦[𝑗] 789 … … … Scalar distance (SD) à Aggregation: Vector distance (VD) à Threshold ( 𝒈( ) ) PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 13 / 23
PROMISE ISA Precharge/Y-decoder Class 1 X-decoder X-decoder (aREAD) Analog Digital Class 2 BLP BLP BLP BLP BLP BLP Class 3 Class 4 (aSD, aVD) Cross Bitline Processor Decision RDL ADC (TH) Class 1 Class 2 Class 3 Class 4 signed multiply max/min aREAD unsigned multiply mean ADC aADD sum-abs sum Bit Precision aSUBT sum-abs 2 sigmoid/reLU/tanh compare threshold PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 14 / 23
� � PROMISE ISA: Task Example: SAD-based template matching Rep Count Class 0 Class 1 Class 2 Class 3 Class 4 Set absolute # of ADC aSUBT Parameters candidates 6 bit min 0 |𝑒 7 | 𝑌 − 𝑍 SWING 𝑁 𝑌, 𝑋 address Task: PROMISE macro instruction (51 bits) Rep Count Class 0 Class 1 Class 2 Class 3 Class 4 Set Loop aREAD aSD, aVD ADC TH Parameters Iterations PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign 15 / 23
Recommend
More recommend