deep learning in cloud edge ai systems
play

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL - PowerPoint PPT Presentation

Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU


  1. Deep Learning in Cloud-Edge AI Systems Wei Wen COMPUTATIONAL EVOLUTIONARY INTELLIGENCE LAB Electrical and Computing Engineering Department Duke University www.pittnuts.com

  2. AI Systems Landing Into the Cloud: Facebook Big Basin Google TPU Cloud

  3. AI Systems Landing Upon the Edge: iPhone X Face ID Autonomous Driving DJI Drone Challenges in landing???

  4. The Rocket Analogy of Deep Learning AI An analogy by Andrew Ng Rocket: Big Neural Networks Engine: Big Computing Fuel: Big Data 3 BIGs

  5. Deep Learning in the Cloud Tons of Cables!!! Communication Bottleneck!!! Google TPU Cloud Parallelism

  6. Deep Learning on the Edge Real-time Model Size and Inference Speed Matter!!!

  7. Research Highlights • Distributed Training in the Cloud – TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning, NIPS 2017 (oral) – On-going work on large-batch training • Efficient Inference on the Edge – Structurally Sparse DNNs ( NIPS 2016 & ICLR 2018 ) – Lower-rank DNNs ( ICCV 2017 ) – A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation ( CVPR 2017 ) – Direct Sparse Convolution ( ICLR 2017 )

  8. TernGrad : Ternary Gradients to Reduce Communication in Distributed Deep Learning NIPS 2017 (Oral) Wei Wen 1 , Cong Xu 2 , Feng Yan 3 , Chunpeng Wu 1 , Yandan Wang 4 , Yiran Chen 1 , Hai (Helen) Li 1 Duke University 1 , Hewlett Packard Labs 2 , University of Nevada - Reno 3 , University of Pittsburgh 4 https://github.com/wenwei202/terngrad

  9. Background – Stochastic Gradient Descent 𝐷 𝒙 ≜ 1 + Minimization target: 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- 𝒙 ./- = 𝒙 . − 𝜃 . + ()) Batch gradient descent: 𝑜 & 𝑕 . Computation expensive ),- ()) = 𝛼𝑅 𝒜 ) , 𝒙 . 𝑕 . Mini-batch stochastic gradient descent (SGD): 𝒙 ./- = 𝒙 . − 𝜃 . 8 (7) Computation cheap 𝐶 & 𝑕 . 7,- 𝐶 <<n samples are randomly drawn from training dataset

  10. Background - Distributed Deep Learning 𝒉 . = 1 =8 ()) Synchronized Data Parallelism for 𝑂𝐶 & 𝑕 . Parameter server(s) Stochastic Gradient Descent (SGD): ),- 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 1. Training data is split to N subsets Batch size= 𝑂𝐶 𝑥 . 2. Each worker has a model replica (copy) 𝑥 . 𝑥 . 3. Each replica is trained on a data subset (-) … 𝒉 . 4. Synchronization in parameter server(s) (=) 𝒉 . (<) Worker 1 𝒉 . Scalability: Worker N 1. Computing time decreases with N Worker 2 2. Communication can be the bottleneck Data 1 B samples 3. This work: quantizing gradient to three Data N (i.e., ternary ) levels {-1, 0, 1} (<2bits) Data 2

  11. Communication Bottleneck 1 10 flop computation = t comp communication AlexNet × flops n total 0 10 time (s) × × 2 | W | g ( n ) = t comm − 1 10 b = g ( n ) log 2 n ( ) − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers Credit: Alexander Ulanov

  12. Background - Distributed Deep Learning Weeks or months -> hours Communication Bottleneck! 1 10 computation communication total 0 10 time (s) − 1 10 − 2 10 1 2 4 8 16 32 64 128 256 512 # of workers data

  13. � � An Alternative Setting Parameter server ()) 𝒉 . = & 𝒉 . 𝑥 . 1. Only exchange gradients 𝒉 . 𝒉 . 2. Gradient quantization can reduce 𝒉 . communication in both directions (-) 𝒉 . …… (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

  14. � � Gradient Quantization for Communication Reduction Parameter server 32 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . conv 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

  15. � � Gradient Quantization for Communication Reduction Parameter server < 2 bits ()) 𝒉 . = & 𝒉 . 𝒉 . Only exchange 𝒉 . 𝒉 . quantized (-) 𝒉 . …… gradients (=) 𝒉 . Worker 1 (<) 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . Worker N Worker 2 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 . 𝒙 ./- ← 𝒙 . − 𝜃 . 𝒉 .

  16. Stochastic Gradients without Bias Batch Gradient Descent 𝐷 𝒙 ≜ 1 + 𝑜 & 𝑅 𝒜 ) , 𝒙 ),- TernGrad 𝒙 ./- = 𝒙 . − 𝜃 . + ()) 𝑜 & 𝑕 . (B) 𝒙 ./- = 𝒙 . − 𝜃 . A 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕 . ),- (B) E 𝑢𝑓𝑠𝑜𝑏𝑠𝑗𝑨𝑓 𝑕 . = 𝛼𝐷 𝒙 No bias SGD (B) 𝒙 ./- = 𝒙 . − 𝜃 . A 𝑕 . 𝐽 is randomly drawn from [1, n ] (B) = 𝛼𝐷 𝒙 E 𝑕 . No bias

  17. TernGrad is Simple Example: ()) : [0.30, -1.20, …, 0.9] 𝒉 . g t = ternarize ( g t ) = s t · sign ( g t ) � b t ˜ 𝑡 . : 1.20 s t , || g t || ∞ , max ( abs ( g t )) Signs: [1, -1, …, 1] P.R -.< P.S 𝑄 𝑐 .N = 1|𝑕 . : [ -.< , -.< ,…, -.< ] ⇢ P ( b tk = 1 | g t ) = | g tk | /s t 𝒄 . : [ 0,1 ,…, 1 ] P ( b tk = 0 | g t ) = 1 � | g tk | /s t V : [0, -1, …, 1]*1.20 ()) 𝒉 . E z , b { ˜ g t } = E z , b { s t · sign ( g t ) � b t } = E z { s t · sign ( g t ) � E b { b t | z t }} = E z { g t } = r w C ( w t ) No bias

  18. TernGrad is Simple

  19. Convergence Standard SGD almost truly converges under assumptions (Fisk 1965, Metivier 1981&1983, Bottou 1998) Assumption 1: || w − w ∗ || 2 > ✏ ( w � w ∗ ) T r w C ( w ) > 0 . C ( w ) has a single minimum w * and 8 ✏ > 0 , inf Assumption 2: ⇢P + ∞ t =0 � 2 t < + 1 Learning rate 𝛿 . decreases neither very fast nor very slow P + ∞ t =0 � t = + 1 Assumption 3 (gradient bound): Assumption 3 (gradient bound): ≤ A + B || w − w ∗ || 2 E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Standard SGD almost-truly converges TernGrad almost-truly converges ≤ E {|| g || ∞ · || g || 1 } ≤ A + B || w − w ∗ || 2 || g || 2 � E Stronger gradient bound in TernGrad

  20. Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD Layer n Approach: Collect Layer n-1 1. Split gradients to buckets Collect Ternarize … 2. Do TernGrad bucket by bucket Ternarize Layer 2 Layer 1 When bucket size == 1, TernGrad is floating SGD Layer-wise ternarizing

  21. Closing Bound Gap Two methods to push the gradient bound of TernGrad closer to the bound of standard SGD ⇢ g i | g i | ≤ c σ conv f ( g i ) = Iteration # sign ( g i ) · c σ | g i | > c σ c =2.5 works well for all tested datasets, DNNs and optimizers. fc Iteration # Suppose Gaussian distribution: 1. Change length 1.0%-1.5% (a) original (b) clipped 2. Change direction 2 ° -3 ° 3. Small bias with variance reduced Gradient clipping

  22. Gradient Histograms conv Iteration # fc Iteration # (a) original (b) clipped (c) ternary (d) final Average gradients are not ternary, but quantized with 2 N +1 levels ( N : worker number). R< Z[\ ] (<=/-) > 1 , unless 𝑂 ≥ 2 RP Communication reduction:

  23. Integration with Manifold Optimizers (All experiments: All hyper-parameters are tuned for standard SGD and fixed in TernGrad ) LeNet (total mini-batch size 64): close accuracy & randomness in TernGrad results in small variance 100.00% (a) momentum SGD (b) vanilla SGD baseline TernGrad 99.50% Accuracy 99.00% 98.50% N workers 98.00% 2 4 8 16 32 64 2 4 8 16 32 64

  24. Integration with Manifold Optimizers CIFAR-10, mini-batch size 64 per worker Adam: D. P. Kingma, 2014 Tune hyper-parameters specifically for TernGrad may reduce accuracy gap TernGrad works with manifold optimizers: Vanilla SGD, Momentum, Adam

  25. Scaling to Large-scale Deep Learning TernGrad : Randomness & (1) decrease randomness in dropout or regularization (2) use smaller weight decay No new hyper-parameters added AlexNet N. S. Keskar, et al., ICLR 2017

  26. Convergence Curves (c) gradient sparsity of terngrad in fc6 (a) top-1 accuracy vs iteration (b) training loss vs iteration 70% 8 80% 60% 6 50% baseline 60% baseline 40% terngrad 4 terngrad 40% 30% 20% 2 20% 10% 0% 0 0% 0 50000 100000 150000 0 50000 100000 150000 0 50000 100000 150000 AlexNet trained on 4 workers with mini-batch size 512 Coverages within the same epochs under the same base learning rate

  27. Scaling to Large-scale Deep Learning GoogLeNet accuracy loss is <2% on avgerage. Tune hyper-parameters specifically for TernGrad may reduce accuracy gap

  28. Performance Model Training throughput on GPU cluster with Ethernet and PCI switch 100000 AlexNet FP32 AlexNet TernGrad GoogLeNet FP32 GoogLeNet TernGrad TernGrad gives higher speedup when VggNet-A FP32 VggNet-A TernGrad 80000 1. using more workers 4000 Images/sec 2. using smaller communication Images/sec 3000 60000 bandwidth (Ethernet vs InfiniBand) 2000 3. training DNNs with more fully- 40000 1000 connected layers (VggNet vs 0 GoogLeNet) 20000 0 1 2 4 8 32 64 128 256 512 16 # of GPUs (a)

Recommend


More recommend