What can in-memory computing deliver, and what are the barriers? Naveen Verma (nverma@princeton.edu), L.-Y. Chen, H. Jia, M. Ozatay, Y. Tang, H. Valavi, B. Zhang, J. Zhang March 20 th , 2019
The memory wall • Separating memory from compute fundamentally raises a communication cost Energy per Access 64b Word (pJ) MULT (FP32): 5pJ MULT (INT32): 3pJ MULT (INT8): 0.3pJ MULT (INT4): 0.1pJ Memory Size ( ! ) More data → bigger array → larger comm. distance → more comm. energy
So, we should amortize data movement • Reuse accessed data for compute • Specialized (memory-compute integrated) operations architectures ! = . × , ⃗ & "," … & ",) ! " , " Memory Compute ⋮ ⋮ ⋱ ⋮ ⋮ = Bound Bound ! $ & $," ⋯ & $,) , ) Processing Comp. Intensity E MEM = Element (PE) OPS/W COMP
In-memory computing (IMC) ( $,$ … ( $,+ # $ . $ # = 0. ⃗ ⇒ ⋮ ⋮ ⋱ ⋮ ⋮ = # & ( &,$ ⋯ ( &,+ . + • In SRAM mode, matrix A stored in bit cells row-by-row • In IMC mode, many WLs driven simultaneously ! → amortize comm. cost inside array • Can apply to diff. mem. Technologies → enhanced scalability → embedded non-volatility IMC Mode SRAM Mode " [J. Zhang, VLSI ’16][J. Zhang, JSSC ’17]
The basic tradeoffs CONSIDER: Accessing ! bits of data associated with computation, from array with ! columns ⨉ ! rows. Metric Traditional In-memory Traditional IMC 1/D 1/ 1/2 Bandwidth 1/ 1 D 1 Latency Memory & Memory Computation ( D 1/2 × D 1/2 array) D 3/ 3/2 2 Energy ~D ~D ( D 1/2 × D 1/2 array) 1 ~1/D 1/ ~1 1/2 SNR D 1/2 • IMC benefits energy/delay at cost of SNR Computation • SNR-focused systems design is critical (circuits, architectures, algorithms)
IMC as a spatial architecture ! " = $! Data Movement: 1. % &,( ′* broadcast min. distance 2. (Many) + ,,& ′* stationary 3. High-dynamic-range analog - ,,( ′* due to high-density bit cells in high-density bit cells computed in distributed manner
IMC as a spatial architecture PRE PRE PRE PRE b11 Assume: [3] a11 a11 a11 a11 • 1k dimensionality [3] [2] [1] [0] b21 • 4-b multiplies [3] a12 a12 a12 a12 [3] [2] [1] [0] • 45nm CMOS c11(2 3 ) c11(2 2 ) c11(2 1 ) c11(2 0 ) Operation Digital-PE Energy (fJ) Bit-cell Energy (fJ) Storage 250 Multiplication 100 50 Accumulation 200 Communication 40 5 Total 590 55
Where does IMC stand today? • Potential for 10× higher efficiency & • Limited scale, robustness, configurability throughput 10e4 10e3 Valavi, Khwa, ISSCC’18, 65nm Lee, ISSCC’18, VLSI’18, 65nm 65nm On-chip Memory Size (kB) Yin, VLSI’17, 65nm Zhang, VLSI’16, 130nm IMC Normalized Throughput Bankman, Yuan, VLSI’18, 65nm Not IMC Moons, 10e2 ISSCC’18, 28nm Valavi, VLSI’18, 65nm Ando, VLSI’17, 65nm 10e3 ISSCC’17, Chen, ISSCC’16, (GOPS/mm 2 ) 28nm Jiang, 65nm Lee, ISSCC’18, 65nm VLSI’18, 65nm Gonug, ISSCC’18, 65nm Moons, ISSCC’17, 28nm 10 10e2 Ando, VLSI’17, 65nm Jiang, VLSI’18, Biswas, 65nm Shin, ISSCC’17, 65nm ISSCC’18, 65nm Biswas, ISSCC’18, 65nm 1 Zhang, Bankman, Yin, VLSI’17, 65nm VLSI’16, ISSCC’18, 28nm Khwa, ISSCC’18, 65nm 10 Chen, ISSCC’16, 65nm 130nm Gonug, ISSCC’18, 65nm Yuan, VLSI’18, 65nm 10e-2 10e-1 1 10 10e2 10e3 10e-2 10e3 10e-1 1 10 10e2 Energy Efficiency (TOPS/W) Energy Efficiency (TOPS/W)
Challenge 1: analog computation • Use analog circuits to ‘fit’ compute in bit cells ⟶ SNR limited by analog-circuit non-idealities Ideal transfer curve ⟶ Must be feasible/competitive @ 16/12/7nm 0.06 Δ V BL (V) 0.04 I-DAC 0.02 V BIAS V BIAS,O 1x 2x 16x Nominal transfer curve 0 X Offset X[0] X[1] X[4] 5 10 15 20 25 30 35 WL WLDAC Code M A,R M A CLASS_EN WL_RESET M D,R M D BLB BL Bit-cell replica Bit-cell [J. Zhang, VLSI ’16][J. Zhang, JSSC ’17]
Algorithmic co-design(?) • Chip-specific weight tuning • Chip-generalized weight tuning Trainer Parameters Training Inference $(&, (, ℒ) • • •• G g i • • WEAK classifier 1 • • • • • • • • • • • • • • • • • • WEAK classifier 2 • • • • • Feature 2 • • • • • • • • • Weighted • • • • • • Voter • • • • • • • • • • • • Feature 1 • • • • • • • • WEAK classifier K • [Z. Wang, TVLSI’ 15] • • • E.g.: BNN Model (applied to CIFAR-10) Classifier [Z. Wang, TCAS-I ’15] 100 90 -(&, $, ()| 0 L = |- − / 80 Accuracy 70 60 50 40 -(&, $)| 0 L = |- − / 30 20 10 1 2 3 4 5 6 7 8 9 10 Normalized MRAM cell standard dev. [B. Zhang, ICASSP 2019] [S. Gonu., ISSCC’ 18]
Challenge 2: programmability • Matrix-vector multiply is only 70-90% of operations ⟶ IMC must integrate in programmable, heterogenous architectures [B. Fleischer, VLSI ’18] General Matrix Multiply Single/few-word operands (~256 ⨉ 2300=590k elements) (traditional, near-mem. acceleration)
Challenge 3: efficient application mappings • IMC engines must be ‘virtualized’ ⟶ IMC amortizes MVM costs, not weight loading. But… ⟶ Need new mapping algorithms (physical tradeoffs very diff. than digital engines) Activation Accessing Weight Accessing ' " #,%,& • E DRAM→IMC /4-bit: 40pJ • E DRAM→IMC /4-bit: 40pJ ( N - I ⨉ J ⨉ K filters) • Reuse: .×0×1 (10-20 lyrs) • Reuse: 2×3 • E MAC,4-b : 50fJ • E MAC,4-b :50fJ )* +,,,- Memory Compute ( X ⨉ Y ⨉ Z input Bound Bound activations) (output activations) Reuse ≈ 1k
Path forward: charge-domain analog computing ~1.2fF metal capacitor (on top of bit cell) 1. Digital multiplication 2. Analog accumulation [H. Valavi, VLSI ’18]
2.4Mb, 64-tile IMC Moons, Bang, Ando, Bankman, Valavi, ISSCC’17 ISSCC’17 VLSI’17 ISSCC’18 VLSI’18 Technology 28nm 40nm 65nm 28nm 65nm Area ( !! " ) 1.87 7.1 12 6 17.6 0.8/0.8 Operating VDD 1 0.63-0.9 0.55-1 0.94/0.68/1.2 (0.6/0.5) Bit precision 4-16b 6-32b 1b 1b 1b on-chip Mem. 128kB 270kB 100kB 328kB 295kB Throughput 400 108 1264 400 (60) 18,876 (GOPS) TOPS/W 10 0.384 6 532 (772) 866 • 10-layer CNN demos for MNIST/CIFAR-10/SVHN at energies of 0.8/3.55/3.55 μJ/image • Equivalent performance to software implementation [H. Valavi, VLSI ’18]
Programmable IMC A To E 2 PROM To DRAM Controller 32b (data) (addr.) Data Mask (data/addr.) Memory Read/Write I/F 8 13 32 x 0 xb 0 Ext. Boot- <0> Bit Mem. I/F Sparsity/AND-logic Controller loader Cell Compute-In-Memory Row Decoder/ WL Drivers w2b Reshaping Buffer Unit (CIMU) Compute-In- • 590 kb Memory Array • 16 bank Program Data x (CIMA) Memory Memory 32b (128 kB) (128 kB) Config. Regs. APB Bus 32 <767> x 2303 xb 2303 <63> <255> & ABN & ABN ADC ADC 32 <192> <0> 8b AXI Bus Near-Mem. Near-Mem. Data Path Data Path <0> <31> CPU Config 32b Timers GPIO UART (RISC-V) DMA f(y = A x) 32 Tx Rx [H. Jia, arXiv :1811.04047]
Bit-scalable mixed-signal compute • SQNR different that standard integer compute 40 B x =8 30 20 N=2304, 2000, 1500, 10 1000, 500, 255 18 B x =4 SQNR (dB) 14 N=2304, 2000, 1500, 10 1000, 500, 255 6 6 B x =2 N=2304, 2000, 1500, 4 1000, 500, 255 2 2 3 4 5 6 7 8 B A [H. Jia, arXiv :1811.04047]
Development board To Host Processor
Design flow 2. Deep-learning Inference Libraries 1. Deep-learning Training Libraries (Python, MATLAB, C) (Keras) High-level network build (Python): Standard Keras libs: chip_mode = True Dense(units, ...) outputs = QuantizedConv2D(inputs, Conv2D(filters, kernel_size, ...) weights, biases, layer_params) ... outputs = BatchNormalization(inputs, layer_params) ... Custom libs: Function calls to chip (Python): (INT/CHIP quant.) chip.load_config(num_tiles, nb_input=4, QuantizedDense(units, nb_input=4, nb_weight=4, nb_weight=4) chip_quant=False, ...) chip.load_weights(weights2load) QuantizedConv2D(filters, kernel_size, nb_input=4, chip.load_image(image2load) nb_weight=4, chip_quant=False, ...) outputs = chip.image_filter() ... Embedded C: chip_command = get_uart_word(); QuantizedDense(units, nb_input=4, nb_weight=4, chip_config(); chip_quant=True, ...) load_weights(); load_image(); QuantizedConv2D(filters, kernel_size, nb_input=4, image_filter(chip_command); nb_weight=4, chip_quant=True, ...) read_dotprod_result(image_filter_command); ...
Recommend
More recommend