crossbow scaling deep learning on multi gpu servers
play

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk


  1. Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk> CASTOR Software Days – Stockholm, Sweden – October 2019

  2. Machine Learning with Deep Neural Networks (DNN) Revolutionised solutions in vision , speech recognition , … DNN models are trained by giving examples (instead of programming) topics ` words ` labels ` hello audience ` text ` audio When DNN output is wrong, tweak its parameters ` images Peter Pietzuch - Imperial College London 2

  3. Training DNNs • Obtain DNN model that minimises classification error • Use Stochastic Gradient Descent (SGD) for training: • 1. Begin with random model • 2. Consider mini-batch of training data Error • 3. Iteratively calculate gradients & update model parameters w • converge lowest error random optimal Model parameters w Peter Pietzuch - Imperial College London 3

  4. Training DNNs on GPUs • GPUs are good at parallelising gradient computation Peter Pietzuch - Imperial College London 4

  5. Training DNNs in Parallel with GPUs • With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights) gradient gradient gradient model model model 1 1 0 1 0 1 0 1 0 GPU 1 GPU 2 GPU 3 0 1 1 1 1 But model replicas 0 0 0 0 1 1 1 1 1 would diverge 1 0 0 0 1 1 0 over time… 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 1 Shard 1 Shard 2 Shard 3 Mini-batch (of training data) Peter Pietzuch - Imperial College London 5

  6. Model Synchronisation among GPUs • Parameter server : Maintains global model Global model model • GPUs: 1. Send gradients to update global model 2. Synchronise local gradient gradient gradient model replicas with model model model global model GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 6

  7. The Problem with Large Batch Sizes Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K Peter Pietzuch - Imperial College London 7

  8. Why Use Large Batch Sizes? average gradient gradient gradient weights E.g. ~32 to 256 labelled images A batch A bigger batch An even bigger batch dataset Keep work per GPU constant to scale Peter Pietzuch - Imperial College London 8

  9. What is the Best Batch Size on a GPU? • ResNet-32 on NVIDIA Titan X GPU 1200 Time to accuracy (sec) TensorFlow 1134 1000 800 600 400 445 379 361 354 200 302 0 32 64 128 256 512 1024 Batch size b Peter Pietzuch - Imperial College London 9

  10. Training DNNs Favours Small Batch Sizes We want frequent, less “noisy” updates w GPU GPU grad grad small small batch batch w weights w GPU grad large GPU grad batch w avg weights Peter Pietzuch - Imperial College London 10

  11. Statistical Efficiency Needs Small Batch Sizes 1 b=64 small batch sizes b=128 b=256 Test accuracy (%) 0.8 b=512 b=1024 b=2048 0.6 b=4096 0.4 large batch sizes 0.2 0 0 20 40 60 80 100 120 140 Epochs Peter Pietzuch - Imperial College London 11

  12. Hardware Efficiency Needs Large Batch Sizes Batch size ½ b Batch size ½ b GPU 1 Compute Update Compute Average gradient replica gradient Batch size ½ b Batch size ½ b GPU 2 Compute Update Compute gradient replica gradient Keep work per GPU constant → increase batch size with #GPUs Peter Pietzuch - Imperial College London 12

  13. Tension between Hardware & Statistical Efficiency Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K • Practitioners increase batch size due to hardware efficiency • But best batch size depends on both hardware & statistical efficiency Peter Pietzuch - Imperial College London 13

  14. Current Practice: Hyper-Parameter Tuning • Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency • Linear scaling rule : "When mini-batch size is multiplied by k , multiply learning rate by k ” • Goyal et al. (2017) • Drawbacks – Manual, labour-intensive process – Highly model specific – not portable and does not work for some models – Less effective for very large batch sizes… Peter Pietzuch - Imperial College London 14

  15. Limits of Hyper-Parameter Tuning “When mini-batch size is multiplied by k , multiply learning rate by k ” ResNet-50, 32 images/GPU 40 Top-1 Validation Error 35 30 25 20 8,192 256 1,024 4,096 16,384 65,536 Mini-batch size Peter Pietzuch - Imperial College London 15

  16. Fundamental Challenge of GPU Scaling • “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.” – J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018 • How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small? Peter Pietzuch - Imperial College London 16

  17. (3) How to reduce (2) How to synchronise (1) How to increase scheduling & model replicas? hardware efficiency synchronisation with small batches? overheads? Reusable data buffers buffer- 1 model task 1 Model buffer- 2 model task 2 task model model model model Currently in use model task 3 GPU-1 GPU-2 buffer- 3 buffer- 4 GPU Peter Pietzuch - Imperial College London 17

  18. Problem: Small Batch Sizes Underutilise GPUs 100 GPU Occupancy (%) 80 Under-used resources 60 40 20 0 0 20 40 60 80 100 CDF (%) Peter Pietzuch - Imperial College London 18

  19. How to Process Small Batches Efficiently? One batch per GPU → Not enough data and instruction parallelism for every operator 100 GPU Util. (%) grad grad batch batch weights weights 0 Operations One per GPU Peter Pietzuch - Imperial College London 19

  20. Idea: Train Multiple Model Replicas per GPU One learning process (or learner ) per GPU stream 1 GPU grad Learner Stream batch weights Learner Stream Scheduler Learner Stream Peter Pietzuch - Imperial College London 20

  21. Effect of Training Multiple Model Replicas per GPU Throughput increase (%) 100 80 60 40 20 Regained resources 0 32 64 128 256 512 1024 Batch size b • But now we must synchronise a large number of learners/model replicas... Peter Pietzuch - Imperial College London 21

  22. (2) How to synchronise (1) How to increase model replicas? efficiency with small batches? model task 1 Model model task 2 model model model model model task 3 GPU 1 GPU 2 GPU • Train multiple model replicas per GPU Peter Pietzuch - Imperial College London 22

  23. Problem: Why not Synchronous Parallel SGD? All learners always start from the same point Limited exploration of parameter space Peter Pietzuch - Imperial College London 23

  24. Idea: Maintain Independent Model Replicas Replica X’s trajectory Initial weights Average model trajectory Replica Y’s trajectory • Benefits: – Increased exploration of space through parallelism – Each model replica uses small batch size Peter Pietzuch - Imperial College London 24

  25. Crossbow: Synchronous Model Averaging Allow learners to diverge but correct trajectories based on average model Accelerate average model trajectory with momentum to find minima faster correction Momentum-accelerated correction Peter Pietzuch - Imperial College London 25

  26. GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 26

  27. GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas Average Reference Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 27

  28. GPUs with Synchronous Model Averaging • Ensures consistent view of average model • Takes GPU bandwidth into account during synchronisation Synchronous Average Reference Model Averaging Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 28

Recommend


More recommend