scaledeep
play

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - PowerPoint PPT Presentation

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications Google image search, Apple Siri Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source:


  1. SCALEDEEP Bryce Paputa and Royce Hwang

  2. Motivation: DNN Applications • Google image search, Apple Siri • Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source: http://fortune.com/2017/03/27/waymo-self-driving-minivans-snow/ Source: https://www.verizonwireless.com/od/smartphones/apple-iphone-x/ 2

  3. Simple Neural Network Source: https://www.dtreg.com/solution/view/22 3

  4. 3 Stages of Training - Forward propagation: Evaluates the network. - Back propagation: Calculates the error and propagates it from the output stages to the input stages - Weight gradient and update: Calculates the gradient of the error and updates the weights to reduce the error 4

  5. From Simple to Deep NN Source: https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e 5

  6. Convolutional Neural Network Source: http://cs231n.github.io/convolutional-networks/ 6

  7. Implemention Challenges • Training and inference steps are extremely computational and data intensive • Example: Overfeat DNN – 820K Neurons, 145M parameters ~3.3 x 10 9 operations for a single 231 x 231 image • To process ImageNet dataset (1.2 million images), it needs ~15 x 10 15 operations for a single training iteration 7

  8. Escalating Computational Requirements Unless otherwise noted, all figures are from SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, Venkataramani et al, ISCA 2017. 8

  9. Ways to Speed This Up 9

  10. System Architecture 10

  11. Convolutional DNN Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 11

  12. 3 Main Layers: Convolution • Convolution (CONV) Layer – Takes in inputs and applies convolution operation with weights – Outputs values (features) to the next layers. – Computationally intensive 12

  13. 3 Main Layers: Sampling • Sampling (SAMP) Layer – Also known as pooling layer – Performs up/down sampling on features – Example: Decreasing image resolution – Data Intensive 13

  14. 3 Main Layers: Fully Connected • Fully Connected (FC) Layer – Composes features in the CONV layers into output (classification, etc.) – Data Intensive 14

  15. Computation Heavy Layers Initial CONV layers - Fewer, but larger features - 16% of Flops - Very high reuse of weights Middle CONV layers - Smaller features, but more numerous - 80% of Flops 15

  16. Memory Heavy Layers Fully connected layers - Fewer Flops (4%) - No weight reuse Sampling Layers - Even fewer Flops (0.1%) - No training step/weights - Very high Bytes/FLOP 16

  17. Summary of Characteristics 17

  18. CompHeavy Tile - Used for low Byte/FLOP stages - 2D-PE computes the dot product of an input and kernel - Computes many kernels convolved with the input - Statically controlled 18

  19. MemHeavy Tile - Stores features, weights, errors, and error gradients in scratchpad memory. - Special Function Units (SFU) implement activation functions like ReLu, tanh, sigmoid 19

  20. SCALEDEEP Chip 20

  21. Heterogeneous Chips . 21 CONV Layer Chip FC Layer Chip

  22. Node Architecture . - All memory is on chip or directly connected - Wheel configuration allows for high memory bandwidth and for layers to be split between chips - Ring configuration allows for high model parallelism 22

  23. Intra-layer Parallelism . 23

  24. Inter-Layer Parallelism . - Pipeline depth is equal to twice the number of layers using during training - Depth is equal to the number of layers during evaluation 24

  25. Experimental Results - System tested using 7032 Processing Elements - Single precision - 680 TFLOPS - Half precision - 1.35 PFLOPS - 6-28x speedup compared to TitanX GPU 25

  26. Power Usage 26

  27. Hardware (PE) Utilization 1. The granularity that they can allocate PE at is higher than ideal: a. Layer distribution to columns b. Feature distribution to MemHeavy Tiles c. Feature sizes are not a multiple of 2D-PE rows 2. Control logic and data transfer also lower utilization Total utilization is 35% 27

  28. Key Features of SCALEDEEP • Heterogeneous processing and compute chips • System design matches the structure of memory access of DNNs • Nested pipelining to minimize data movement and improve core utilization 28

  29. Discussion • Since DNN design is still more of an art than science at this point, does it make sense to make an ASIC, given the high cost of developing hardware? • How does ScaleDeep compare to other systems like Google’s TPU and TABLA? In what situations is it better and worse? • What are some pitfalls of this design? 29

Recommend


More recommend