Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk> CASTOR Software Days – Stockholm, Sweden – October 2019

Machine Learning with Deep Neural Networks (DNN) Revolutionised solutions in vision , speech recognition , … DNN models are trained by giving examples (instead of programming) topics ` words ` labels ` hello audience ` text ` audio When DNN output is wrong, tweak its parameters ` images Peter Pietzuch - Imperial College London 2

Training DNNs • Obtain DNN model that minimises classification error • Use Stochastic Gradient Descent (SGD) for training: • 1. Begin with random model • 2. Consider mini-batch of training data Error • 3. Iteratively calculate gradients & update model parameters w • converge lowest error random optimal Model parameters w Peter Pietzuch - Imperial College London 3

Training DNNs on GPUs • GPUs are good at parallelising gradient computation Peter Pietzuch - Imperial College London 4

Training DNNs in Parallel with GPUs • With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights) gradient gradient gradient model model model 1 1 0 1 0 1 0 1 0 GPU 1 GPU 2 GPU 3 0 1 1 1 1 But model replicas 0 0 0 0 1 1 1 1 1 would diverge 1 0 0 0 1 1 0 over time… 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 1 Shard 1 Shard 2 Shard 3 Mini-batch (of training data) Peter Pietzuch - Imperial College London 5

Model Synchronisation among GPUs • Parameter server : Maintains global model Global model model • GPUs: 1. Send gradients to update global model 2. Synchronise local gradient gradient gradient model replicas with model model model global model GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 6

The Problem with Large Batch Sizes Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K Peter Pietzuch - Imperial College London 7

Why Use Large Batch Sizes? average gradient gradient gradient weights E.g. ~32 to 256 labelled images A batch A bigger batch An even bigger batch dataset Keep work per GPU constant to scale Peter Pietzuch - Imperial College London 8

What is the Best Batch Size on a GPU? • ResNet-32 on NVIDIA Titan X GPU 1200 Time to accuracy (sec) TensorFlow 1134 1000 800 600 400 445 379 361 354 200 302 0 32 64 128 256 512 1024 Batch size b Peter Pietzuch - Imperial College London 9

Training DNNs Favours Small Batch Sizes We want frequent, less “noisy” updates w GPU GPU grad grad small small batch batch w weights w GPU grad large GPU grad batch w avg weights Peter Pietzuch - Imperial College London 10

Statistical Efficiency Needs Small Batch Sizes 1 b=64 small batch sizes b=128 b=256 Test accuracy (%) 0.8 b=512 b=1024 b=2048 0.6 b=4096 0.4 large batch sizes 0.2 0 0 20 40 60 80 100 120 140 Epochs Peter Pietzuch - Imperial College London 11

Hardware Efficiency Needs Large Batch Sizes Batch size ½ b Batch size ½ b GPU 1 Compute Update Compute Average gradient replica gradient Batch size ½ b Batch size ½ b GPU 2 Compute Update Compute gradient replica gradient Keep work per GPU constant → increase batch size with #GPUs Peter Pietzuch - Imperial College London 12

Tension between Hardware & Statistical Efficiency Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K • Practitioners increase batch size due to hardware efficiency • But best batch size depends on both hardware & statistical efficiency Peter Pietzuch - Imperial College London 13

Current Practice: Hyper-Parameter Tuning • Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency • Linear scaling rule : "When mini-batch size is multiplied by k , multiply learning rate by k ” • Goyal et al. (2017) • Drawbacks – Manual, labour-intensive process – Highly model specific – not portable and does not work for some models – Less effective for very large batch sizes… Peter Pietzuch - Imperial College London 14

Limits of Hyper-Parameter Tuning “When mini-batch size is multiplied by k , multiply learning rate by k ” ResNet-50, 32 images/GPU 40 Top-1 Validation Error 35 30 25 20 8,192 256 1,024 4,096 16,384 65,536 Mini-batch size Peter Pietzuch - Imperial College London 15

Fundamental Challenge of GPU Scaling • “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.” – J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018 • How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small? Peter Pietzuch - Imperial College London 16

(3) How to reduce (2) How to synchronise (1) How to increase scheduling & model replicas? hardware efficiency synchronisation with small batches? overheads? Reusable data buffers buffer- 1 model task 1 Model buffer- 2 model task 2 task model model model model Currently in use model task 3 GPU-1 GPU-2 buffer- 3 buffer- 4 GPU Peter Pietzuch - Imperial College London 17

Problem: Small Batch Sizes Underutilise GPUs 100 GPU Occupancy (%) 80 Under-used resources 60 40 20 0 0 20 40 60 80 100 CDF (%) Peter Pietzuch - Imperial College London 18

How to Process Small Batches Efficiently? One batch per GPU → Not enough data and instruction parallelism for every operator 100 GPU Util. (%) grad grad batch batch weights weights 0 Operations One per GPU Peter Pietzuch - Imperial College London 19

Idea: Train Multiple Model Replicas per GPU One learning process (or learner ) per GPU stream 1 GPU grad Learner Stream batch weights Learner Stream Scheduler Learner Stream Peter Pietzuch - Imperial College London 20

Effect of Training Multiple Model Replicas per GPU Throughput increase (%) 100 80 60 40 20 Regained resources 0 32 64 128 256 512 1024 Batch size b • But now we must synchronise a large number of learners/model replicas... Peter Pietzuch - Imperial College London 21

(2) How to synchronise (1) How to increase model replicas? efficiency with small batches? model task 1 Model model task 2 model model model model model task 3 GPU 1 GPU 2 GPU • Train multiple model replicas per GPU Peter Pietzuch - Imperial College London 22

Problem: Why not Synchronous Parallel SGD? All learners always start from the same point Limited exploration of parameter space Peter Pietzuch - Imperial College London 23

Idea: Maintain Independent Model Replicas Replica X’s trajectory Initial weights Average model trajectory Replica Y’s trajectory • Benefits: – Increased exploration of space through parallelism – Each model replica uses small batch size Peter Pietzuch - Imperial College London 24

Crossbow: Synchronous Model Averaging Allow learners to diverge but correct trajectories based on average model Accelerate average model trajectory with momentum to find minima faster correction Momentum-accelerated correction Peter Pietzuch - Imperial College London 25

GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 26

GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas Average Reference Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 27

GPUs with Synchronous Model Averaging • Ensures consistent view of average model • Takes GPU bandwidth into account during synchronisation Synchronous Average Reference Model Averaging Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 28

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk

Flock of birds Multi-bird Scaling route servers easily Antonio M. Moreiras IX.br CGI.br is

Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! Im

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Scaling-Up Deep Learning For Autonomous Vehicles JOSE M. ALVAREZ | | San Jose 2019 1 NVIDIA

Network virtualisation using Crossbow Technology Uro Nedi, MSc OpenSolaris Contributor

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning

Workload Prediction for Adaptive Power Scaling Using Deep Learning Steve Tarsa, Amit Kumar, &

Transfer and Multi-Task Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Class

Deep multi-task learning with evolving weights Machine learning - computer vision published in

Identifying beneficial task relations for multi-task learning in deep neural networks Author:

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical

Scaling up Deep Learning Based Super Resolution Algorithms Xiaoyong Zhu Microsoft Cloud AI Group

High quality ultrasonic multi-line transmission through deep learning Sanketh Vedula Technion,

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Deep Multi-Task and Meta-Learning CS 330 Course Logistics Information & Resources Chelsea

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Router Plugins (Formerly Crossbow) A Software Architecture for Next Generation Routers John

Structured Losses Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

3D Deep Learning on Geometric Forms Hao Su Many 3D representations are available Candidates:

3D Deep Learning on Geometric Forms Hao Su Many 3D representations are available Candidates:

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk

Flock of birds Multi-bird Scaling route servers easily Antonio M. Moreiras IX.br CGI.br is

Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! Im

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Scaling-Up Deep Learning For Autonomous Vehicles JOSE M. ALVAREZ | | San Jose 2019 1 NVIDIA

Network virtualisation using Crossbow Technology Uro Nedi, MSc OpenSolaris Contributor

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning

Workload Prediction for Adaptive Power Scaling Using Deep Learning Steve Tarsa, Amit Kumar, &amp;

Transfer and Multi-Task Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Class

Deep multi-task learning with evolving weights Machine learning - computer vision published in

Identifying beneficial task relations for multi-task learning in deep neural networks Author:

Accelerate Deep Learning Training at Scale on GPUs Maggie Zhang ( ) maggiez@nvidia.com

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical

Scaling up Deep Learning Based Super Resolution Algorithms Xiaoyong Zhu Microsoft Cloud AI Group

High quality ultrasonic multi-line transmission through deep learning Sanketh Vedula Technion,

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Deep Multi-Task and Meta-Learning CS 330 Course Logistics Information &amp; Resources Chelsea

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Router Plugins (Formerly Crossbow) A Software Architecture for Next Generation Routers John

Structured Losses Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

3D Deep Learning on Geometric Forms Hao Su Many 3D representations are available Candidates:

3D Deep Learning on Geometric Forms Hao Su Many 3D representations are available Candidates:

Workload Prediction for Adaptive Power Scaling Using Deep Learning Steve Tarsa, Amit Kumar, &

Deep Multi-Task and Meta-Learning CS 330 Course Logistics Information & Resources Chelsea