Distributed Training Khoa Le & Somin Wadhwa Background The - PowerPoint PPT Presentation

Distributed Training Khoa Le & Somin Wadhwa

Background

The Problem?

Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes

Traditional Machine Learning Task

Distributed Machine Learning Task

Obvious solution: - Utilize Tensorflow’s native support for Distributed Training. Issues: - New concepts….but not a lot of documentation! - Simply didn’t scale well.. :(

Data Parallel Approach:

Updating gradients: Parameter Server Approach - What should be the right ratio of workers/parameter servers? - Increased complexity leads to increase in the amount of information being passed on...

Better way to update gradients: Ring-AllReduce

Horovod: - Created a standalone package based on Baidu’s Ring-Allreduce algorithm, fully integrated with Tensorflow. - Replace the actual Ring-Allreduce algorithm with NVIDIA’s native NCCL (i.e Ring-Allreduce across multiple machines. - Added support for models that fit inside a single server, potentially on multiple GPUs, whereas the original version only supported models that fit on a single GPU. (How??) - API improvements.

Benchmarking:

Motivation for CROSSBOW ● To reduce training time, systems exploit data parallelism across many GPUs to speed up training. ● How: parallel synchronous SGD.

Motivation for CROSSBOW ● To utilise many GPUs effectively, a large batch size is needed. ● Why: communication overhead to move data to and from the GPU dominates if batch size is too small

Motivation for CROSSBOW ● However, large batch sizes reduce statistical efficiency. ● Why: small batches ensure faster training and are more likely to find solutions with better accuracy

Motivation for CROSSBOW ● Typical solutions: dynamically adjusting batch sizes, or other hyper- parameters such as learning rate, momentum ● Not always work well because of time consuming model-specific methodology ● Need a DL system that can effectively trains with small batch sizes (2-32), while still scaling to many GPUs => CROSSBOW: single-server, multi-GPU DL system that improve statistical efficiency when increasing the number of GPUs, irrespective of the batch size.

Key contributions of CROSSBOW ● Synchronous model averaging ● Auto-tuning the number of learners ● Concurrent task engine

SMA with Learners (Important Concepts) ● Learner : an entity that trains a single model replica independently with a given batch size ● In contrast with parallel S-SGD, model replicas with learners evolve independently because they are not reset to a single global model after each batch

SMA with Learners (Important Concepts) Parallel Synchronous SGD VS SMA with learners

SMA with Learners (Important Concepts) ● Synchronize local model of learners by maintaining a Central Average Model ● Prevent learners from diverging by applying a Correction to the model ● Use momentum to make the central model converge faster than the learners

SMA with Learners (Algorithm) ● Input/output ● Initialization ● Iterative Learning process ● Update to central average model

SMA with Learners (Expansion) ● To achieve high hardware utilisation, we can execute multiple learners per GPU ● Local Reference Model VS Central Average Model

CROSSBOW System Design ● Must share GPU efficiently ● Decide on # of learners/GPU at runtime ● Global SMA synchronization needs to be efficient

SMA with Learners (Crossbow Implementation) ● Data pre-processors: prepares training dataset into batches ● Task manager: controls the pools of model replicas, input batches and learner streams ● Task scheduler: assigns learning tasks to GPUs based on the available resources ● Auto-tuner

SMA with Learners (Crossbow Implementation)

SMA with Learners (Crossbow Implementation) ● Learner Streams : hold a learning task and a corresponding local synchronization task ● Synchronization Streams : holds a global synchronization task ● Overlaps the synchronisation tasks from one iteration with the learning tasks of the next

SMA with Learners (Crossbow Implementation) ● Concurrency! ● Use All-reduce (hello Horovod!) for inter-GPU operations: evenly distributes the computation of the update for the average model among the GPUs. ● Schedule new learning tasks on a first- come-first-serve basis

Choosing the number of learners ● Too few, a GPU is under-utilised, wasting resources ● Too many, the execution of otherwise independent learners is partially sequentialized on a GPU, leading to a slow-down => Tune the number of learners per GPU based on the training throughput at runtime

Tuning Learners (CROSSBOW Implementation) ● Auto-tuner: measures the training throughput by considering the rate at which learning tasks complete, as recorded by the task manager ● Server with homogeneous GPUs, measure only the throughput of a single GPU to adapt the number of learners for all GPUs

Tuning Learners (CROSSBOW Implementation) ● Adding a learner to a GPU requires allocating a new model replica and a corresponding learner stream ● Places temporarily a global execution barrier between two successive iterations ● Also locks the resources pools, preventing access by the task scheduler or manager during resizing

Memory MGMT (CROSSBOW Implementation) ● CROSSBOW uses double buffering to create a pipeline between data pre- processors and the task scheduler ● Offline memory plan to reuse the output buffers of operators using reference counters, which reduces the memory footprint of a learner by up to 50% ● For multiple learners/GPU, enables the sharing of some of the output buffers among learners on the same GPU using an online memory plan to avoid over-allocate memory

Scalability Results

Statistical Efficiency VS Hardware Efficiency

Selecting number of learners

SMA vs other

Synchronization efficiency

● Pros: - It introduces an alternative synchronization strategy (SMA), which allows training with small batch size to achieve better hardware efficiency - System design provides efficient concurrent execution of learning and synchronisation tasks on GPUs ● Cons: - It lacks automatic differentiation and other more advanced user primitives when compared to TensorFlow - It’s only tested for a single multi-GPU server. Distribution of CROSSBOW across cluster would see more challenges, such as heterogeneous resources.

Imperative Programming: - NumPy - Matlab Declarative Programming: - Caffe - TensorFlow (sort of….mixture of both)

MXNet (mix-net) Programming Interface: - Symbol: Used to generate compute graph (compositions of symbol range from simple operators to complex ConvLayers). - Supports auto-diff, in addition to load, save, visualize etc.

MXNet (mix-net) Programming Interface: - NDArray: Computations work seamlessly with the Symbol. - Fills the gap between declarative symbolic expression & the host language. - Complex symbolic expressions are often evaluated efficiently because MXNet also uses lazy execution of NDarray. (So?)

MXNet (mix-net) Programming Interface: - K-V Store: Distributed key-value store for data-sync over multiple nodes. - Weight updating function is registered to the KVStore. - Each worker repeatedly pulls the newest weight from the store. - Pushes out the locally computed gradient.

MXNet Implementation: - Graph Computation: Suggest straightforward implementations like at inference time, only forward pass is needed, to extract features we can simply skip the last layers, multiple operators can be grouped into one etc. - Memory Allocation: Simple idea, reuse non-intersecting variables. To reduce complexity in determining such an allocation, use of a heuristic is proposed.

Discussions

Distributed Training Khoa Le & Somin Wadhwa Background The - PowerPoint PPT Presentation

Distributed Training Khoa Le & Somin Wadhwa Background The Problem? Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes Traditional Machine Learning Task Distributed Machine Learning Task Obvious solution:

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

Time-Varying Volatility Financial Markets, Day 2, Class 2 Jun Pan Shanghai Advanced Institute of

CONSTITUTIVE MODELING AND SIMULATION OF THE SUPERELASTIC EFFECT IN SHAPE-MEMORY ALLOYS Panos

FOR PETES SAKE: STOP WATCHING THE CLOCK! PARKER WOODROOF UNIVERSITY OF CENTRAL ARKANSAS

Learner-Centered Education: An example using crossword puzzles, Jeopardy, and student-written

Scaling data and KNN Regression Nathan George Data Science Professor DataCamp Machine Learning

SMA Real-time Software Attila Kovcs SAO SMA Advisory Committee Meeting Cambridge, 1718

The ellipsoid method We have learned that the Markowitz mean-variance optimization problem is a

Low-Rank Matrix Approximation with Stability Dongsheng Li 1 , Chao Chen 2 , Qin (Christine) Lv 3 ,

Distributed Training Khoa Le & Somin Wadhwa Background The - PowerPoint PPT Presentation

Distributed Training Khoa Le & Somin Wadhwa Background The Problem? Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes Traditional Machine Learning Task Distributed Machine Learning Task Obvious solution:

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

Time-Varying Volatility Financial Markets, Day 2, Class 2 Jun Pan Shanghai Advanced Institute of

CONSTITUTIVE MODELING AND SIMULATION OF THE SUPERELASTIC EFFECT IN SHAPE-MEMORY ALLOYS Panos

FOR PETES SAKE: STOP WATCHING THE CLOCK! PARKER WOODROOF UNIVERSITY OF CENTRAL ARKANSAS

Learner-Centered Education: An example using crossword puzzles, Jeopardy, and student-written

Scaling data and KNN Regression Nathan George Data Science Professor DataCamp Machine Learning

SMA Real-time Software Attila Kovcs SAO SMA Advisory Committee Meeting Cambridge, 1718

The ellipsoid method We have learned that the Markowitz mean-variance optimization problem is a

Low-Rank Matrix Approximation with Stability Dongsheng Li 1 , Chao Chen 2 , Qin (Christine) Lv 3 ,

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges