Distributed Training Khoa Le & Somin Wadhwa
Background
The Problem?
Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes
Traditional Machine Learning Task
Distributed Machine Learning Task
Obvious solution: - Utilize Tensorflow’s native support for Distributed Training. Issues: - New concepts….but not a lot of documentation! - Simply didn’t scale well.. :(
Data Parallel Approach:
Updating gradients: Parameter Server Approach - What should be the right ratio of workers/parameter servers? - Increased complexity leads to increase in the amount of information being passed on...
Better way to update gradients: Ring-AllReduce
Horovod: - Created a standalone package based on Baidu’s Ring-Allreduce algorithm, fully integrated with Tensorflow. - Replace the actual Ring-Allreduce algorithm with NVIDIA’s native NCCL (i.e Ring-Allreduce across multiple machines. - Added support for models that fit inside a single server, potentially on multiple GPUs, whereas the original version only supported models that fit on a single GPU. (How??) - API improvements.
Benchmarking:
Motivation for CROSSBOW ● To reduce training time, systems exploit data parallelism across many GPUs to speed up training. ● How: parallel synchronous SGD.
Motivation for CROSSBOW ● To utilise many GPUs effectively, a large batch size is needed. ● Why: communication overhead to move data to and from the GPU dominates if batch size is too small
Motivation for CROSSBOW ● However, large batch sizes reduce statistical efficiency. ● Why: small batches ensure faster training and are more likely to find solutions with better accuracy
Motivation for CROSSBOW ● Typical solutions: dynamically adjusting batch sizes, or other hyper- parameters such as learning rate, momentum ● Not always work well because of time consuming model-specific methodology ● Need a DL system that can effectively trains with small batch sizes (2-32), while still scaling to many GPUs => CROSSBOW: single-server, multi-GPU DL system that improve statistical efficiency when increasing the number of GPUs, irrespective of the batch size.
Key contributions of CROSSBOW ● Synchronous model averaging ● Auto-tuning the number of learners ● Concurrent task engine
SMA with Learners (Important Concepts) ● Learner : an entity that trains a single model replica independently with a given batch size ● In contrast with parallel S-SGD, model replicas with learners evolve independently because they are not reset to a single global model after each batch
SMA with Learners (Important Concepts) Parallel Synchronous SGD VS SMA with learners
SMA with Learners (Important Concepts) ● Synchronize local model of learners by maintaining a Central Average Model ● Prevent learners from diverging by applying a Correction to the model ● Use momentum to make the central model converge faster than the learners
SMA with Learners (Algorithm) ● Input/output ● Initialization ● Iterative Learning process ● Update to central average model
SMA with Learners (Expansion) ● To achieve high hardware utilisation, we can execute multiple learners per GPU ● Local Reference Model VS Central Average Model
CROSSBOW System Design ● Must share GPU efficiently ● Decide on # of learners/GPU at runtime ● Global SMA synchronization needs to be efficient
SMA with Learners (Crossbow Implementation) ● Data pre-processors: prepares training dataset into batches ● Task manager: controls the pools of model replicas, input batches and learner streams ● Task scheduler: assigns learning tasks to GPUs based on the available resources ● Auto-tuner
SMA with Learners (Crossbow Implementation)
SMA with Learners (Crossbow Implementation) ● Learner Streams : hold a learning task and a corresponding local synchronization task ● Synchronization Streams : holds a global synchronization task ● Overlaps the synchronisation tasks from one iteration with the learning tasks of the next
SMA with Learners (Crossbow Implementation) ● Concurrency! ● Use All-reduce (hello Horovod!) for inter-GPU operations: evenly distributes the computation of the update for the average model among the GPUs. ● Schedule new learning tasks on a first- come-first-serve basis
Choosing the number of learners ● Too few, a GPU is under-utilised, wasting resources ● Too many, the execution of otherwise independent learners is partially sequentialized on a GPU, leading to a slow-down => Tune the number of learners per GPU based on the training throughput at runtime
Tuning Learners (CROSSBOW Implementation) ● Auto-tuner: measures the training throughput by considering the rate at which learning tasks complete, as recorded by the task manager ● Server with homogeneous GPUs, measure only the throughput of a single GPU to adapt the number of learners for all GPUs
Tuning Learners (CROSSBOW Implementation) ● Adding a learner to a GPU requires allocating a new model replica and a corresponding learner stream ● Places temporarily a global execution barrier between two successive iterations ● Also locks the resources pools, preventing access by the task scheduler or manager during resizing
Memory MGMT (CROSSBOW Implementation) ● CROSSBOW uses double buffering to create a pipeline between data pre- processors and the task scheduler ● Offline memory plan to reuse the output buffers of operators using reference counters, which reduces the memory footprint of a learner by up to 50% ● For multiple learners/GPU, enables the sharing of some of the output buffers among learners on the same GPU using an online memory plan to avoid over-allocate memory
Scalability Results
Statistical Efficiency VS Hardware Efficiency
Selecting number of learners
SMA vs other
Synchronization efficiency
● Pros: - It introduces an alternative synchronization strategy (SMA), which allows training with small batch size to achieve better hardware efficiency - System design provides efficient concurrent execution of learning and synchronisation tasks on GPUs ● Cons: - It lacks automatic differentiation and other more advanced user primitives when compared to TensorFlow - It’s only tested for a single multi-GPU server. Distribution of CROSSBOW across cluster would see more challenges, such as heterogeneous resources.
Imperative Programming: - NumPy - Matlab Declarative Programming: - Caffe - TensorFlow (sort of….mixture of both)
MXNet (mix-net) Programming Interface: - Symbol: Used to generate compute graph (compositions of symbol range from simple operators to complex ConvLayers). - Supports auto-diff, in addition to load, save, visualize etc.
MXNet (mix-net) Programming Interface: - NDArray: Computations work seamlessly with the Symbol. - Fills the gap between declarative symbolic expression & the host language. - Complex symbolic expressions are often evaluated efficiently because MXNet also uses lazy execution of NDarray. (So?)
MXNet (mix-net) Programming Interface: - K-V Store: Distributed key-value store for data-sync over multiple nodes. - Weight updating function is registered to the KVStore. - Each worker repeatedly pulls the newest weight from the store. - Pushes out the locally computed gradient.
MXNet Implementation: - Graph Computation: Suggest straightforward implementations like at inference time, only forward pass is needed, to extract features we can simply skip the last layers, multiple operators can be grouped into one etc. - Memory Allocation: Simple idea, reuse non-intersecting variables. To reduce complexity in determining such an allocation, use of a heuristic is proposed.
Discussions
Recommend
More recommend