SCALEDEEP Bryce Paputa and Royce Hwang
Motivation: DNN Applications • Google image search, Apple Siri • Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source: http://fortune.com/2017/03/27/waymo-self-driving-minivans-snow/ Source: https://www.verizonwireless.com/od/smartphones/apple-iphone-x/ 2
Simple Neural Network Source: https://www.dtreg.com/solution/view/22 3
3 Stages of Training - Forward propagation: Evaluates the network. - Back propagation: Calculates the error and propagates it from the output stages to the input stages - Weight gradient and update: Calculates the gradient of the error and updates the weights to reduce the error 4
From Simple to Deep NN Source: https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e 5
Convolutional Neural Network Source: http://cs231n.github.io/convolutional-networks/ 6
Implemention Challenges • Training and inference steps are extremely computational and data intensive • Example: Overfeat DNN – 820K Neurons, 145M parameters ~3.3 x 10 9 operations for a single 231 x 231 image • To process ImageNet dataset (1.2 million images), it needs ~15 x 10 15 operations for a single training iteration 7
Escalating Computational Requirements Unless otherwise noted, all figures are from SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, Venkataramani et al, ISCA 2017. 8
Ways to Speed This Up 9
System Architecture 10
Convolutional DNN Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 11
3 Main Layers: Convolution • Convolution (CONV) Layer – Takes in inputs and applies convolution operation with weights – Outputs values (features) to the next layers. – Computationally intensive 12
3 Main Layers: Sampling • Sampling (SAMP) Layer – Also known as pooling layer – Performs up/down sampling on features – Example: Decreasing image resolution – Data Intensive 13
3 Main Layers: Fully Connected • Fully Connected (FC) Layer – Composes features in the CONV layers into output (classification, etc.) – Data Intensive 14
Computation Heavy Layers Initial CONV layers - Fewer, but larger features - 16% of Flops - Very high reuse of weights Middle CONV layers - Smaller features, but more numerous - 80% of Flops 15
Memory Heavy Layers Fully connected layers - Fewer Flops (4%) - No weight reuse Sampling Layers - Even fewer Flops (0.1%) - No training step/weights - Very high Bytes/FLOP 16
Summary of Characteristics 17
CompHeavy Tile - Used for low Byte/FLOP stages - 2D-PE computes the dot product of an input and kernel - Computes many kernels convolved with the input - Statically controlled 18
MemHeavy Tile - Stores features, weights, errors, and error gradients in scratchpad memory. - Special Function Units (SFU) implement activation functions like ReLu, tanh, sigmoid 19
SCALEDEEP Chip 20
Heterogeneous Chips . 21 CONV Layer Chip FC Layer Chip
Node Architecture . - All memory is on chip or directly connected - Wheel configuration allows for high memory bandwidth and for layers to be split between chips - Ring configuration allows for high model parallelism 22
Intra-layer Parallelism . 23
Inter-Layer Parallelism . - Pipeline depth is equal to twice the number of layers using during training - Depth is equal to the number of layers during evaluation 24
Experimental Results - System tested using 7032 Processing Elements - Single precision - 680 TFLOPS - Half precision - 1.35 PFLOPS - 6-28x speedup compared to TitanX GPU 25
Power Usage 26
Hardware (PE) Utilization 1. The granularity that they can allocate PE at is higher than ideal: a. Layer distribution to columns b. Feature distribution to MemHeavy Tiles c. Feature sizes are not a multiple of 2D-PE rows 2. Control logic and data transfer also lower utilization Total utilization is 35% 27
Key Features of SCALEDEEP • Heterogeneous processing and compute chips • System design matches the structure of memory access of DNNs • Nested pipelining to minimize data movement and improve core utilization 28
Discussion • Since DNN design is still more of an art than science at this point, does it make sense to make an ASIC, given the high cost of developing hardware? • How does ScaleDeep compare to other systems like Google’s TPU and TABLA? In what situations is it better and worse? • What are some pitfalls of this design? 29
Recommend
More recommend