Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu - PowerPoint PPT Presentation

Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu Zhang * , Michael J. Freedman Princeton University, *Google AI MLSys 2020

Resource allocation today 16 GPUs 2200 images/sec 32 GPUs 4000 images/sec 64 GPUs 5000 images/sec Dataset Model Hardware Users rely on manual trial-and-error process to find resource efficient cluster size 2

Manual trial-and-error resource allocation Cumbersome: difficult to estimate scaling behavior ? Diverse hardware topologies, communication algorithm etc. Time-consuming: each trial restarts entire program Need to reload libraries, rebuild model, prepare input pipeline etc. Can take minutes of device idle time Static allocation: vulnerable to stragglers 3

? Cumbersome Time-consuming Static allocation Today, users often under- or over-allocate resources 4

Resource Elasticity in Distributed Deep Learning Autoscaling to dynamically search for a resource efficient cluster Leads to shorter job completion times and lower costs up to 45% reduction up to 85.1% reduction in GPU time 5

Resource elasticity is not a new idea Distributed Cloud Cluster Distributed computing services management deep learning 6

Why is resource elasticity not adopted yet? Hurdle #1: Lack of applicable scaling heuristics Hurdle #2: Existing frameworks assume static allocation Hurdle #3: How to scale the batch size? 7

Hurdle #1: Lack of applicable scaling heuristics Existing heuristics are based on dynamic resource demands E.g. request more containers if CPU utilization exceeds X% E.g. kill a worker if it has been idle for X seconds In deep learning workloads, however Resource utilization is typically consistent across batches, which are short Workers are rarely idle 8

Hurdle #2: Existing frameworks assume static allocation Models are structured as static graphs Communication operations are hard-coded into these graphs PyTorch has “dynamic” graphs, but dynamic only in inputs [Abadi et al., 2015] Synchronization primitives assume fixed # devices E.g. TensorFlow’s SyncReplicasOptimizer , MultiWorkerMirroredStrategy 9

Hurdle #3: How to scale the batch size? 1) Fix per device batch size, vary global batch size Preserves per device efficiency Large batch sizes may compromise convergence behavior Per device: 32 Per device: 32 [Keskar et al., 2016; Goyal et al., 2017; Hoffer et al., 2017] Global: 128 Global: 256 2) Fix global batch size, vary per device batch size Preserves convergence behavior Sacrifices per device efficiency and overall performance Per device: 32 Per device: 16 Global: 128 Global: 128 10

Why is resource elasticity not adopted yet? Hurdle #1: Lack of applicable scaling heuristics Hurdle #2: Existing frameworks assume static allocation Hurdle #3: How to scale the batch size ? 11

Autoscaling System Scaling heuristics, integration, straggler mitigation

Autoscaling engine for distributed deep learning Add 2 workers Replace worker 1 Worker 1: 434 images/sec Worker 2: 608 images/sec Worker 3: 592 images/sec 13

Hurdle #1: Lack of applicable scaling heuristics Design custom scaling heuristics based on: 1) Throughput scaling efficiency 2) Utility vs cost ... Autoscaling engine can run with custom, pluggable heuristics 14

Scaling heuristics: Throughput scaling efficiency Intuition: measure extra per worker throughput relative to existing per worker throughput Num workers: 4 Num workers: 5 Throughput: 400 img/s Throughput: 480 img/s Throughput scaling efficiency ( s k,d ) = (480 - 400) / (400 / 4) = 0.8 15

Scaling heuristics: Throughput scaling efficiency Intuition: measure extra per worker throughput relative to existing per worker throughput s k,d = 1 perfect scaling s k,d = 0 no improvement Throughput: 400 img/s → 480 img/s s k,d < 0 negative scaling Efficiency ( s k,d ): (480 - 400) / (400 / 4) = 0.8 Scaling condition #1: 16

Scaling heuristics: Utility vs cost Intuition: compare user-provided utility function to dollar cost of job $ Cost = Total compute time × Price per device per time unit Job completion time Scaling condition #2: 17

Scaling in action e.g. Find the latest point at which the scaling condition passes Throughput Throughput Num workers Num workers 18

Hurdle #2: Existing frameworks assume static allocation Give each worker the illusion of local training Workers independently apply black-box function f that synchronizes gradients gradients Replace function when switching to new allocation grads = f(grads) grads = f 2 (grads) e.g. Horovod allreduce Portable across different frameworks 19

Hurdle #3: How to scale the batch size? User provides an upper batch size limit Increase global batch size, fixing per device batch size, until limit Per device: 32 Per device: 32 Per device: 22 Global: 128 Global: 256 Global: 256 Finding an optimal batch size for arbitrary workloads is an open problem [Hoffer et al., 2018; Shallue et al., 2018; Smith et al., 2018] 20

Straggler mitigation comes almost for free Once we detect a straggler, replace it using the same mechanisms Refer to paper for details of straggler detection 21

Evaluation Job completion time, GPU time, idle time

Experiment setup GPU cluster: 8 machines CPU cluster: 60 machines 16 Intel Xeon CPUs @ 2.6 GHz (960 total) 8 NVIDIA V100 GPUs (64 total) 64 Intel Xeon CPUs (2.2GHz) 64GB memory 1 Gbps network 250GB memory 16 Gbps network 23

Autoscaling reduces job completion time ResNet-50 on CIFAR-10 ResNet-50 on ImageNet 45% Avg reduction: 19.4%; Max: 45.0% Avg reduction: 8.23%; Max: 16.0% 24

Autoscaling reduces GPU time ResNet-50 on CIFAR-10 ResNet-50 on ImageNet 85.1% Avg increase : 7.39%; Max: 14.7% Avg reduction: 58.6%; Max: 85.1% 25

Autoscaling finds target configuration quickly ResNet-50 on CIFAR-10 ResNet-50 on ImageNet Avg: 264s; Max: 583s Avg: 61.0s; Max: 78.4s 26

Autoscaling finds target configuration quickly <2% of total time <6% of total time (train until convergence) Avg: 264s; Max: 583s Avg: 61.0s; Max: 78.4s 27

Autoscaling has short idle times Average idle time during transition (seconds) 28

Resource Elasticity in Distributed Deep Learning Autoscaling to dynamically search for a resource efficient cluster Leads to shorter job completion times and lower costs up to 45% reduction up to 85.1% reduction in GPU time 29

Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu - PowerPoint PPT Presentation

Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu Zhang * , Michael J. Freedman Princeton University, *Google AI MLSys 2020 Resource allocation today 16 GPUs 2200 images/sec 32 GPUs 4000 images/sec 64 GPUs 5000 images/sec

Topic 7: Demand and Elasticity 1 Market vs. firms demand 1 2 Elasticity and revenue 3

Topic 7: Demand and Elasticity Market vs. firms demand 1 2 Elasticity and revenue 3

Trade Elasticity Jos e de Sousa and Isabelle Mejean Topics in International Trade University

ECONOMICS FOR BUSINESS Chapter 3 Elasticity of Demand and Supply Delivered by: Sithari

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

14 Docker CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew Milano February 22,

Big Data overview, issues, challenges and opportunities C. Onime (onime@ictp.it) 1 Outline

Adaptive Techniques for Learning over Graphs ICASSP2017 PhD Final Oral Exam Dimitris Berberidis

Outline CGI CS3157: Advanced CGI security CGI Graphics Programming

Device Programming Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014::

SYSC3601 Microprocessor Systems Unit 6: Input/Output (I/O) Systems SYSC3601 1 Microprocessor

CDA 4253 FPGA System Design The PicoBlaze Microcontroller Hao Zheng Comp Sci & Eng U of

An Empirical Evaluation of Entropy- based Traffic Anomaly Detection George Nychis, Vyas Sekar,

Sambuz

Useful Links

Newsletter

Mail Us

Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu - PowerPoint PPT Presentation

Resource Elasticity in Distributed Deep Learning Andrew Or , Haoyu Zhang * , Michael J. Freedman Princeton University, *Google AI MLSys 2020 Resource allocation today 16 GPUs 2200 images/sec 32 GPUs 4000 images/sec 64 GPUs 5000 images/sec

Topic 7: Demand and Elasticity 1 Market vs. firms demand 1 2 Elasticity and revenue 3

Topic 7: Demand and Elasticity Market vs. firms demand 1 2 Elasticity and revenue 3

Trade Elasticity Jos e de Sousa and Isabelle Mejean Topics in International Trade University

ECONOMICS FOR BUSINESS Chapter 3 Elasticity of Demand and Supply Delivered by: Sithari

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

14 Docker CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew Milano February 22,

Big Data overview, issues, challenges and opportunities C. Onime (onime@ictp.it) 1 Outline

Adaptive Techniques for Learning over Graphs ICASSP2017 PhD Final Oral Exam Dimitris Berberidis

Outline CGI CS3157: Advanced CGI security CGI Graphics Programming

Device Programming Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014::

SYSC3601 Microprocessor Systems Unit 6: Input/Output (I/O) Systems SYSC3601 1 Microprocessor

CDA 4253 FPGA System Design The PicoBlaze Microcontroller Hao Zheng Comp Sci &amp; Eng U of

An Empirical Evaluation of Entropy- based Traffic Anomaly Detection George Nychis, Vyas Sekar,

Sambuz

Useful Links

Newsletter

Mail Us

CDA 4253 FPGA System Design The PicoBlaze Microcontroller Hao Zheng Comp Sci & Eng U of