CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T. Aamodt*, N. E. Jerger, A. Moshovos * Please cite the original source.
CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio P. Judd T. Hetherington* T. Aamodt* N. Enright Jerger A. Moshovos *
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ DNNs = SIMD Heaven x + x 100’s — 1000's 3
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ DNNs = SIMD Heaven x + x 100’s — 1000's 4
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ DNNs = SIMD Heaven x + x 100’s — 1000's 5
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ DNNs = SIMD Heaven x + x 100’s — 1000's 6
⋯ ⋯ ⋯ ⋯ ⋯ DNNs = SIMD Heaven x + x 100’s — 1000's 7
CNVLUTIN: Smarter SIMD 52% Performance — 2x ED 2 P Out-of-the-box networks 8
Outline 1. What’s a CNN? 2. A wide SIMD design 3. CNVLUTIN: Skipping neurons in a wide SIMD design 4. Evaluation 5. Our approach 9
What’s a CNN? Korean … mask! 10’s of layers 10
What’s a CNN? … 11
What’s a CNN? Neurons (Input) … 11
What’s a CNN? Synapses Neurons (Filters) (Input) … … 11
What’s a CNN? … … 12
What’s a CNN? Neurons (Output) … … 12
What’s a CNN? Neurons (Output) … … 12
What’s a CNN? Neurons (Output) … … … 12
What’s a CNN? Korean … mask! 10’s of layers 13
What’s a CNN? Convolution ReLU Pool Korean … mask! 10’s of layers 13
What’s a CNN? CNN typical layer Convolution ReLU Pool Data size Negatives to 0 Inner products 3 reduction 2 x 1 … + 0 x -1 -2 -3 -3 -2 -1 0 1 2 3 14
~90% Time spent in convolutions 15
Lots of Runtime Zeroes 0.6 0.5 0.4 0.3 0.2 0.1 0 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16
Lots of Runtime Zeroes 0.6 0.5 0.4 0.3 Waste of time and energy!!! 0.2 0.1 0 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16
Lots of Runtime Zeroes 0.6 0.5 0.4 0.3 Waste of time Dynamically and energy!!! generated 0.2 = 0.1 Not predictable 0 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16
How to compute DNNs: DaDianNao* NBin Neuron 16 Lane 0 Neuron Lane 15 SB (eDRAM) IP0 x Neurons + f NBout x Filter 0 Filter 0 IP15 x + f Filter 15 x Filter 15 *Chen et al. MICRO 2014
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Processing in DaDianNao 0 1 1 2 0 Neuron 1 2 1 0 3 Lanes 15 0 1 1 1 0 Synapse 1 Lanes Filter 0 15 0 Synapse 1 Lanes Filter 15 15 18
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Processing in DaDianNao 0 1 1 2 0 Neuron 1 3 2 1 0 Lanes 1 15 0 1 1 0 Synapse 1 Lanes Filter 0 15 0 Synapse 1 Lanes Filter 15 15 18
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Processing in DaDianNao 0 1 1 2 0 Neuron 1 3 2 1 0 Lanes 1 15 0 1 1 X 0 Synapse Multiplication of corresponding 1 Lanes neuron and synapse elements Filter 0 15 0 X Synapse 1 Lanes Filter 15 15 18
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Zero-skipping in DaDianNao? 0 2 1 1 2 0 Neuron 3 1 2 1 0 3 Lanes 1 15 0 1 1 1 0 Synapse 1 Lanes Filter 0 15 0 Synapse 1 Lanes Filter 15 15 19
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Zero-skipping in DaDianNao? Zero 0 1 1 2 1 1 2 0 removal Neuron 2 1 3 1 2 1 0 3 Lanes 1 1 1 15 0 1 1 1 0 Synapse 1 Lanes Filter 0 15 0 Synapse 1 Lanes Filter 15 15 19
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Zero-skipping in DaDianNao? Zero 0 1 1 2 1 1 2 0 removal Neuron 2 1 3 1 2 1 0 3 Lanes 1 1 1 15 0 1 1 1 0 Synapse 1 Lanes Filter 0 15 0 Synapse 1 Lanes Filter 15 15 19
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Zero-skipping in DaDianNao? Zero 0 1 1 2 1 1 2 0 removal Neuron 2 1 3 1 2 1 0 3 Lanes 1 1 1 15 0 1 1 1 0 X Synapse 1 Lanes Filter 0 15 X 0 Synapse 1 Lanes Filter 15 15 19
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Zero-skipping in DaDianNao? Zero 0 1 1 2 1 1 2 0 removal Neuron 2 1 3 1 2 1 0 3 Lanes 1 1 1 15 0 1 1 1 0 X Synapse Lanes can 1 Lanes not longer Filter 0 15 operate in lock-step! X 0 Synapse 1 Lanes Filter 15 15 19
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ CNVLUTIN: Decoupling Lanes Subunit 0 0 Neuron Lane 0 Neuron 1 Lanes Filter 0 Synapses Filter 1 15 Lane 0 Filter 15 0 Synapse 1 Lanes Filter 0 Subunit 15 15 Neuron Lane 15 Filter 0 0 Synapse Synapses Filter 1 1 Lanes Lane 15 Filter 15 Filter 15 15 CNVLUTIN DaDianNao 20
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ CNVLUTIN: Decoupling Lanes Subunit 0 Neuron Lane 0 1 1 2 0 Offsets 3 2 1 Filter 0 Synapses Lane 0 Filter 15 Subunit 15 Neuron Lane 15 0 1 1 1 Offsets 2 1 0 Filter 0 Synapses Lane 15 Filter 15 21
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ CNVLUTIN: Decoupling Lanes Subunit 0 Neuron Lane 0 1 1 2 Offsets 3 2 1 Filter 0 Synapses Lane 0 Filter 15 Subunit 15 Neuron Lane 15 1 1 1 Offsets 2 1 0 Filter 0 Synapses Lane 15 Filter 15 21
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ CNVLUTIN: Decoupling Lanes Subunit 0 Neuron Lane 0 1 1 2 Offsets 3 2 1 Filter 0 X Synapses Lane 0 Filter 15 Subunit 15 Neuron Lane 15 1 1 1 Offsets 2 1 0 X Filter 0 Synapses Lane 15 Filter 15 21
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 22
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 Dispatcher Encoder eDRAM 23
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 Dispatcher Encoder eDRAM 23
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 Dispatcher Encoder eDRAM Brick 2 Brick 1 Brick 0 Neurons 7 6 5 0 0 0 0 0 0 2 1 0 Packed neurons 0 7 6 5 0 0 0 0 0 0 2 1 eDRAM O ff sets 0 3 2 1 0 0 0 0 0 0 2 1 ZF Neurons 7 6 5 0 2 1 Unit Bu ff ers O ff set 3 2 1 0 2 1 23
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 Dispatcher Encoder eDRAM Brick 2 Brick 1 Brick 0 Neurons 7 6 5 0 0 0 0 0 0 2 1 0 Packed neurons 0 7 6 5 0 0 0 0 0 0 2 1 eDRAM O ff sets 0 3 2 1 0 0 0 0 0 0 2 1 ZF Neurons 7 6 5 0 2 1 Unit Bu ff ers O ff set 3 2 1 0 2 1 23
CNVLUTIN: Ineffectual-neuron Filtering Layer i Layer i+1 Dispatcher Encoder eDRAM Brick 2 Brick 1 Brick 0 Neurons 7 6 5 0 0 0 0 0 0 2 1 0 Packed neurons 0 7 6 5 0 0 0 0 0 0 2 1 eDRAM O ff sets 0 3 2 1 0 0 0 0 0 0 2 1 ZF Neurons 7 6 5 0 2 1 Unit Bu ff ers O ff set 3 2 1 0 2 1 23
CNVLUTIN: Computation Slicing … Neuron Lane 15 Neuron Lane 1 Neuron Lane 0 24
Methodology • In-house timing simulator: baseline + CNVLUTIN • Logic + SRAM: Synthesis on 65nm TSMC • eDRAM model: Destiny • DNNs: Trained models from Caffe model zoo 25
Area Only +4.5% in area overhead 26
Speedup: ineffectual = 0 2 1.5 1 Better 0.5 0 Alexnet Google NiN VGG19 VGG_M VGG_S Geo 27
Speedup: ineffectual = 0 2 1.5 1 Better 0.5 0 Alexnet Google NiN VGG19 VGG_M VGG_S Geo 27
Speedup: ineffectual = 0 2 1.5 1 Better 0.5 0 Alexnet Google NiN VGG19 VGG_M VGG_S Geo 1.37x Performance on average 27
Loosening the Ineffectual Neuron Criterion CNVLUTIN zero “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) 37 0 13 10 15 1 123 0 0 7 1 3 0 1 20 0 18 31 0 33 28
Loosening the Ineffectual Neuron Criterion CNVLUTIN zero “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) 37 0 13 10 15 1 123 0 0 7 1 3 0 1 20 0 18 31 0 33 Example: consider ineffectual if value<2 29
Speedup: ineffectual >= 0 2 1.5 1 Better 0.5 0 Alexnet Google NiN VGG19 VGG_M VGG_S Geo only 0's 0's and more 1.52x Performance No accuracy lost 30
Speedup: ineffectual >= 0 2 1.5 1 Better 0.5 0 Alexnet Google NiN VGG19 VGG_M VGG_S Geo only 0's 0's and more 1.52x Performance No accuracy lost 30
Loosening the Ineffectual Neuron Criterion CNVLUTIN zero “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) 37 0 13 10 15 1 123 0 0 7 1 3 0 1 20 0 18 31 0 33 Example: consider ineffectual if value<2 31
Loosening the Ineffectual Neuron Criterion CNVLUTIN zero “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) 37 0 13 10 15 1 123 0 0 7 1 3 0 1 20 0 18 31 0 33 Example: consider ineffectual if value<8 32
Recommend
More recommend