deep in memory architectures for nand flash memory
play

Deep In-memory Architectures for NAND Flash Memory Sujan K - PowerPoint PPT Presentation

Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1 Machines are Beating Humans at


  1. Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1

  2. Machines are Beating Humans at Complex Inference Tasks • game of Go is complex → huge search space: ~250 &'( (Go) vs. ~35 *( (Chess) • AlphaGo machine: 1202 CPUs+176 GPUs • HUGE Energy (and latency) Cost ~10,000× more than human brain • Critical issue at the Edge – IoT, wearables, autonomous The Economist, March 2016

  3. Energy Cost - Memory Access vs. Computation ! "#" ! "$% ≈ '(× to '((× (in SRAMs) [Horowitz, ISSCC14’] ML kernel-level energy breakdown Computation energy (45nm) for TM (8-b operands) Integer ADD Mult [Kang, Shanbhag] 8 bits 0.03 pJ 0.2 pJ Dot product 32 bits 0.1 pJ 3 pJ Memory energy (45nm) Memory 64 bits Cache 8 KB 10 pJ Cache 32 KB 20 pJ Cache 1 MB 100 pJ *Post-layout simulations with SRAM + DRAM 1.2 – 2.6 nJ synthesized logic in 65nm CMOS 3

  4. Energy Costs in the Memory Hierarchy Bandwidth Latency $" − -".& DDR CPU 25.6Gb/s DRAM 20-60-clock cycles Relative Energy Costs Cost per bit ($) !" − $""%& PCIe NAND 16Gb/s • SRAM vs. compute: ≈ ($( − Flash (")×$" + -clock $"×−$""× cycles $,& • DRAM vs. compute: ≈ SATA SATA 8-12Gb/s PCH $" ! -clock SSD 0""× [Sze, ISSCC’16] cycles • Flash vs. compute: ≈ SATA HDD $"""× Access time Density Energy per (ns) (MB/mm 2 ) Key Question bit (pJ) 100000 1000 100 1000 How to reduce memory access 10 100 1 10 10 costs? 0.1 1 0.1 0.01 0.001 0.1 0.001 M M H M M H S S A A A A M M H A A R R R R S L L A A 4 S D S D A F F R R L S D F [Yang, J. Joshua, Nature Nano, 2013]

  5. Proposed Solution Deep In-memory Architecture (DIMA) https://spectrum.ieee.org/computing/hardware/ to-speed-up-ai-mix-memory-and-processing “breaching the memory wall” 5

  6. The Deep In-memory Architecture (DIMA) [Kang, et al., -ICASSP-2014] [Kang, et al.,-JSSC-2018] Precharge/Column Mux/Y-DEC multi-row functional READ (reads multiple-bits/col/precharge) X-DEC X-DEC analog, mixed-signal low-SNR fabric bitline processing (SIMD analog processing) BLP BLP BLP BLP BLP BLP cross bitline processing Cross Bitline Processor (analog averaging enhances SNR) Residual Digital Unit low complexity, low (decision) inference/decisions rate digital output 6

  7. Functional Read (FR) – Voltage Mode (for SRAM) Precharge_b • single FR → vector inner product column-major word V 3 • multi-row access per precharge WL 3 d 3 T 3 • PWM, PAM, PWAM access pulses V 2 WL 2 Per-column dot-product d 2 T 2 123 V 1 WL 1 " ()* d 1 ∆" %& = - 4(" . )7 . 8 . T 1 + %& , %& ./0 WL 0 V 0 BLB BL d 0 T 0 (9 : ≪ < => ? => ) " ∆" #$ ∆" #$# ···· BC => ∝ A multiple-bits per column 7

  8. Bit-line Analog Processors Precharge_b D column pitch-matched WL 3 computations dominating bitline processors d 3 machine learning 6T bit cell WL 2 0.915 ! m d 2 Kernel BLP CBLP Manhattan subtract- WL 1 aggregation distance compare d 1 Euclidean subtract- WL 0 aggregation distance square d 0 subtractor 15.4 ! m BLB BL ···· weighted Dot product multiply P aggregation RWL 3~0 Replica Cell Hamming XOR aggregation distance ∆V BL ∝ D+P ∆V BLB ∝ D+P MAX charge-redistribution based cross BLP BL bitline aggregation processor 2.11 ! m (BLP) BL254 BL255 BL1 BL0 - ø 1 ø 1 ø 1 ø 1 cross BL !"#(#, &) = ) |# * − & * | processor Sum ø 2 ø 2 ø 2 *+, ø 2 (CBLP) 8

  9. SRAM DIMA IC Prototypes with 16kB standard 6T SRAM in 65nm CMOS multi-functional FIRST random forest IC on-chip learning TRAINER ADC Test block BLP-CBLP 64b Bus SRAM Bitcell CTRL Array normal R/W Interface iso-accuracy comparisons wrt. post-layout 8-b digital processor & measured SRAM read energy 8b, 128-dim SVM; MIT- SVM, TM, k-NN, MF; RF with 64 trees; CBCL dataset; MIT-CBCL, MNIST, ..; KUL traffic sign; SGD-based learning; energy savings = 10X; energy savings = 3X energy savings = 21X; EDP reduction = 50X; EDP reduction = 7X EDP reduction = 100X; [ESSCIRC 2017, [ISSCC 2018, JSSC special [JSSC January 2018] JSSC (special issue) May 2018] issue (Invited)]

  10. ISCAS 2018 - Migrate DIMA into Flash Challenges in Flash (relative to SRAM) → DIMA Solution • NAND flash bit cell is ( !"× ) smaller than SRAM → multi-col processing • Much larger BL caps → use current sensing • High % & variations → use high dimensional vector processing • Slower logic devices → use mixed-signal analog circuits 10

  11. NAND Flash-based DIMA Weights 16kB Input/Weights Reg. WBL [ 0 ] WBL [ 255 ] Out A/D Cross BL processor ( CBLP ) Processor Analog BL Processor BL Processor ( BLP ) ( BLP ) 16k Multiplier (multIplier) (multiplier) • Multi-col functional read: converts a stored word into Functional Read (D/A) output voltage • Multi-BL Processor (MBLP): X-Dec. & Pulse Gen. X-Dec. & Pulse Gen. WL Driver WL Driver performs scalar mixed signal multiplication 128k X 192k SLC NAND • Cross BL Processing: Flash Array aggregation via charge sharing 11

  12. MC Functional Read – Current Mode (./ ! "" #$% = ' #( ∆" * 0 + 1 + ∝ 3 PCH ) #$% C OUT +,- OUT Precharge Evaluation Phase Phase SEL2 SEL3 SEL0 SEL1 ! (( + ! &' BL 3 PCH BL 1 BL 2 BL 0 BSL ! "#$ + ! &' SEL3 ) V pass - WL<63> SEL2 ) , V read WL<62> SEL1 1 - 1 / 1 4 1 5 ) + V pass WL<0> SEL0 ) * SSL ! BL 1,2,3,4 "#$ ! (( OUT ! 234 • Current sensing - BL not discharged SSL • Binary word stored horizontally BSL • Use time modulated SEL signals ) ) $./0 1#$ 12

  13. Simulation Methodology Challenge: need to reflect device/process non-idealities at the system level Solution : use device models + array parameters to estimate energy, delay and behavior • behavioral models - ! " Behavioral System level System Simulations variations, ICI, pattern Model performance dependency, read/pgm Model Parameters disturbance Compare BSIM Level 49 SPICE • energy and delay models : Model Simulations estimated from circuit Model Delay Verification BLP Energy simulations & analysis System level Energy NAND flash Energy and and througput Energy Models throughput Estimation 13

  14. ̅ Architectural Set-up IO & % Dot Product Dot Product Dot Product Dot Product 16kB Page 16kB Page 16kB Page 16kB Page ( NAND Flash Plane NAND Flash Plane NAND Flash Plane NAND Flash Plane • 32nm node; 16kB/page; 64 pages/block; 3000 blocks/plane; 4 planes/IC • 200×320 8-b images; one image stored in 4 pages in 4 planes • I/O limited to 800MB/s (ONFI 4) 14

  15. Machine Learning Applications Face detection via Face recognition via linear SVM k- NN Face database ( " ) stored in NAND flash decision rule Person 1 # $ % + ' ≥ 0 → face else → no face Person 2 200 Query 320 image ( ! ) Person 28 “Based on 3 closest images pick person 1” • extended Yale B dataset (2336 test • Caltech 101 dataset images; 28 classes) • input buffer stored • input buffer stores reference image weights 15

  16. Simulation Results - Accuracy typical range SVM k -NN Detection Accuracy ( 𝑄 𝑒𝑓𝑢 ) Detection Accuracy ( 𝑄 𝑒𝑓𝑢 ) 𝑄 𝑒𝑓𝑢 𝑄 𝑒𝑓𝑢 PSNR PSNR ������ 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 • Detection accuracy robust to ! " variations in the typical range • SVM accuracy: 92%; k -NN accuracy (top-3): 95% 16

  17. Energy & Single NAND IC SSD (4 planes/IC) (16 ICs/package) Throughput Planes NAND NAND NAND NAND MBLP/ CBLP Benefits Plane Plane Plane Plane Host I/O NAND Controller BLP-CBLP BLP-CBLP BLP-CBLP BLP-CBLP IO NAND NAND NAND NAND Normalized Energy Normalized Energy 1 1 0.8 0.8 other compute 0.6 8.3 × 0.6 23 × 0.4 I/O string current 0.4 0.2 0.2 word-line bit-line 0 0 conventional DIMA conventional DIMA !× #$× → throughput enhancement 17

  18. Summary & Future Work • Deep In-memory Architecture – energy ( 23× ) and throughput ( 15× ) enhancement for SSDs. • DIMA in other technologies – FD-SOI , MRAM , eDRAM, DRAM, emerging devices (e.g., RRAM) • Scaled-up DIMA – multi-bank architectures for DNNs • Robustifying DIMA – using Shannon-inspired statistical error compensation, on-chip learning [ISSCC 2018] • Programmable DIMA – programming models, compilers (with Adve, Kim (UIUC)) [ISCA 2018] • Inference algorithms for DIMA – analog data flow • DIMA physical compilers – auto. synth. of DIMA cores 18

  19. Acknowledgements This work was supported in part by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. http://shanbhag.ece.illinois.edu 19

Recommend


More recommend