neurosurgeon
play

Neurosurgeon Collaborative Intelligence Between the Cloud and - PowerPoint PPT Presentation

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang Stefanos Laskaridis sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation


  1. Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang, J.Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars and L. Tang Stefanos Laskaridis 
 sl829@cam.ac.uk R244: Large-Scale Data Processing and Optimisation

  2. Summary [2.41, 0.87] [7,92, 0.87] a. Status quo b. Mobile-only c. Neurosurgeon Approach Approach Approach 2 Image taken from [1]

  3. Status Quo 3

  4. Status Quo • Deep Neural Networks in “intelligent” applications • Apple Siri, Google Now, Microsoft Cortana • Deep Neural Network applications are mostly o ffl oaded to powerful private or public clouds for computation • Computer Vision • Natural Language Processing • Speech Recognition • Large volume of data transfers cause latency and energy consumption . • However , SoC advancements urged authors to revisit the problem. 4

  5. The Mobile edge 5

  6. Experiment Setup Power Consumption 
 Watts Up? meter Software • Deep Learning: Ca ff e • mCPU: OpenBLAS • GPU: cuDNN Server Platform Mobile Platform • 4U Intel Dual CPU Chassis, 8xPCIe 3.0 x 16 slots • Tegra K1 SoC • 2x Intel Xeon E5-2620, 2.1 GHz • 4+1 quad core ARM Cortex A15 CPU • 1TB HDD • 2GB DDR3L 933MHz • 16x16GB DDR3 1866MHz ECC • NVIDIA Kepler with 192 CUDA cores • NVIDIA Tesla K40 M-class 12GB PCIe 6

  7. Testing the Mobile Edge • Experiment running an Image of 152KB image through AlexNet [3] • Measuring: • Communication Latency : 3G, LTE, WiFi • Computation Latency : mCPU, mGPU, cloud GPU • End-to-end Latency • Energy Consumption 7

  8. Testing the Mobile Edge Transmission has the dominating cost More power but shorter bursts 8 Images taken from [1]

  9. Neurosurgeon: 
 Partitioning between Cloud and Mobile 9

  10. cNN Pooling Convolution 10 Images taken from [2]

  11. DNN Layer types • Fully Connected Layer ( fc ) 
 All neurons are connected with all the neurons of the previous layer. Depth is the number of filters. Stride is how much we slide the filter each time. [2] • Convolutional & Local Layer ( conv, local ) 
 Convolves an image with one or more filters to produce a set of maps. • Pooling Layer ( pool ) 
 Downsamples an image to simplify representation. Can be average, max, or L2. [2] • Activation Layer ( sig, relu, htanh ) 
 Applies non-linear function to its input (sigmoid, Rectified Linear Unit, Tanh) • Normalisation layer ( norm ) 
 Normalises features across feature map. • Softmax Layer ( softmax ) 
 Probability distribution over possible classes. • Argmax Layer ( argmax ) 
 Chooses class with higher probability. • Dropout Layer ( dropout ) 
 Randomly ignores neurons as regularisation to prevent overfitting. 11

  12. AlexNet Inference-only 
 (fw propagation) 2x over 
 18% more cloud-only energy-efficient 12 Images taken from [1] and [3]

  13. AlexNet • Convolutional layers produce a lot of data. • Pooling layers reduce a lot of data. • Fully connected layers operate on few data but have high latency. 13 Images taken from [3]

  14. Partitioning • First layers have most of the data (convolutions and pooling) • Later layers have most of the latency (fully connected layers) • Key idea : Compute locally until the point it make sense and then o ffl oad to cloud. 14

  15. More Applications Abbreviation Network Input Layers IMC AlexNet Image 24 Image Classification VGG VGG Image 46 Facial Recognition FACE DeepFace Image 10 DIG MNIST Image 9 Digit Recognition ASR Kaldi Speech 13 Speech Recognition POS SENNA Word vectors 3 Part-of-speech Tagging Named Entity NER SENNA Word vectors 3 Recognition Word Chunking CHK SENNA Word vectors 3 15

  16. VGG Server processing latency Data communication latency Mobile processing latency 20 Latency (s) 15 10 5 0 Partition points (a) VGG Data communication energy Mobile processing energy 70 60 Energy (J) Energy (J) 50 40 30 20 10 0 Partition points (a) VGG Layer latency Size of output data 250 14 12 Data size (MB) 200 Latency (ms) 10 150 8 6 100 4 50 2 0 0 input conv1.1 relu1.1 conv1.2 relu1.2 pool1 conv2.1 relu2.1 conv2.2 relu2.2 pool2 conv3.1 relu3.1 conv3.2 relu3.2 conv3.3 relu3.3 conv3.4 relu3.4 pool3 conv4.1 relu4.1 conv4.2 relu4.2 conv4.3 relu4.3 conv4.4 relu4.4 pool4 conv5.1 relu5.1 conv5.2 relu5.2 conv5.3 relu5.3 conv5.4 relu5.4 pool5 fc6 relu6 drop6 fc7 relu7 drop7 fc8 softmax argmax Layers 16 Images taken from [1] (a) VGG

  17. FACE Server processing latency Data communication latency Mobile processing latency Data communication energy Mobile processing energy 3 . 5 14 3 . 0 12 Latency (s) 2 . 5 Energy (J) 10 2 . 0 8 1 . 5 6 1 . 0 4 0 . 5 2 0 . 0 0 Partition points Partition points (b) FACE (b) FACE Layer latency Size of output data 100 2 . 5 Data size (MB) 80 2 . 0 Latency (ms) 60 1 . 5 40 1 . 0 20 0 . 5 0 0 . 0 input conv1 pool2 conv3 local4 local5 local6 fc7 fc8 softmax argmax Layers (b) FACE 17 Images taken from [1]

  18. DIG Server processing latency Data communication latency Mobile processing latency Data communication energy Mobile processing energy 25 6 20 5 Energy (J) Latency (s) 4 15 3 10 2 5 1 0 0 Partition points Partition points (c) DIG Layer latency Size of output data 5 5 Data size (MB) 4 4 Latency (ms) 3 3 2 2 1 1 0 0 input conv1 pool1 conv2 pool2 fc1 relu1 fc2 softmax argmax Layers (c) DIG 18 Images taken from [1]

  19. ASR Server processing latency Data communication latency Mobile processing latency Data communication energy Mobile processing energy 7 30 6 25 Latency (s) 5 Energy (J) 20 4 15 3 10 2 1 5 0 0 Partition points Partition points Layer latency Size of output data 50 5 Data size (MB) 40 4 Latency (ms) 30 3 20 2 10 1 0 0 input fc1 sig1 fc2 sig2 fc3 sig3 fc4 sig4 fc5 sig5 fc6 sig6 fc7 Layers (d) ASR 19 Images taken from [1]

  20. POS Server processing latency Data communication latency Mobile processing latency Data communication energy Mobile processing energy 0 . 25 × 10 − 2 6 0 . 20 Energy (J) Energy (J) 5 Latency (s) 0 . 15 4 3 0 . 10 2 0 . 05 1 0 . 00 0 Partition points Partition points (e) POS (e) POS Layer latency Size of output data × 10 − 2 0 . 4 4 . 5 Data size (MB) Latency (ms) 3 . 4 0 . 3 0 . 2 2 . 2 0 . 1 1 . 1 0 . 0 0 . 0 input fc1 htanh fc3 Layers (e) POS 20 Images taken from [1]

  21. NER Server processing latency Data communication latency Mobile processing latency Data communication energy Mobile processing energy × 10 − 2 0 . 25 6 5 0 . 20 Latency (s) Energy (J) 4 0 . 15 3 0 . 10 2 0 . 05 1 0 0 . 00 Partition points Partition points (f) NER (f) NER Layer latency Size of output data × 10 − 2 0 . 4 4 . 5 Data size (MB) Latency (ms) 3 . 4 0 . 3 0 . 2 2 . 2 0 . 1 1 . 1 0 . 0 0 . 0 input fc1 htanh fc3 Layers (f) NER 21 Images taken from [1]

  22. CHK Server processing latency (b) FACE Data communication latency Mobile processing latency Data communication energy Mobile processing energy × 10 − 2 0 . 25 6 5 0 . 20 Latency (s) Energy (J) 4 0 . 15 3 0 . 10 2 0 . 05 1 0 0 . 00 Partition points Partition points (g) CHK Layer latency Size of output data × 10 − 2 0 . 4 4 . 5 Data size (MB) Latency (ms) 3 . 4 0 . 3 0 . 2 2 . 2 0 . 1 1 . 1 0 . 0 0 . 0 input fc1 htanh fc3 Layers (g) CHK 22 Images taken from [1]

  23. Neurosurgeon 23

  24. Neurosurgeon • Partitions DNN based on: • DNN Topology • Computation latency • Data size output • Dynamic factors • Wireless network • Datacenter workload 24

  25. Neurosurgeon • Profiles device and cloud server • To generate prediction models • One time, in advance • Results stored in device for decision-making • Two, distinct goals: • Latency minimisation • Energy optimisation 25

  26. Neurosurgeon 1) Generate 1) Extract layer 2) Predict layer 3) Evaluate 4) Partitioned prediction models configurations performance partition points Execution CONV FC ++ ++ ++ ++ ++ + + ++ + ++ ++ + + ACT … ++ + + ++ + ++ + ++ + + + POOL ++ ++ Prediction + ++ ++ + ++ ++ + ++ ++ + ++ + + + + ++ ++ + ++ Model + + + CONV FC ++ ++ ++ ++ + ++ ++ + ++ + Prediction + ACT … ++ ++ + + ++ + + ++ + ++ + + + Model POOL ++ ++ + ++ ++ ++ + ++ ++ + ++ + + + + ++ + ++ ++ + ++ + + + Target Application Prediction Prediction Prediction Model Model Model Deployment Phase Runtime Phase 26 Image taken from [1]

  27. Regression model per DNN Layer Linear or logarithmic regression model. GFLOPS for performance. Layer Regression Variables (filter_size/stride)^2 * 
 Convolution (# filters) Local, Pooling input, output feature maps Fully Connected # Input/Output neurons Softmax, Argmax # Input/Output neurons Activation, Normalisation # neurons 27

Recommend


More recommend