Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria
Context: Distributed Training ● Models are getting too big to fit on just one GPU ○ Turing-NLG (Microsoft) has 17 billion parameters ● As model training is iterative, communication between different nodes becomes a bottleneck ● Distributed Training can be split broadly into two different types: ○ Data Parallel ○ Model Parallel ● Even these approaches result in far from optimal parallelization performance
Model Size through the years Source: Microsoft
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent
Context ● Decentralized algorithms treated as a compromise; decentralized communication something we resort to and pay a price for ● Current analysis: decentralized PSGD offers no performance advantage over centralized PSGD assuming decentralized network topology ● Popular ML systems (TensorFlow, PyTorch, etc.) built to support centralized execution Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.
Parallel Stochastic Gradient Descent ● Centralized network topology, e.g. parameter-server model ● Communication bottleneck at central node(s), performance decreases with increasing network latency ● convergence rate ● communication overhead Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) . 2014.
Parallel Stochastic Gradient Descent Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems . 2010.
Decentralized PSGD ● Requires either (all nodes access shared database) or e.g. (data parallel approach) ● Implies asymptotically ● Communication topology represented by undirected graph with doubly-stochastic weight matrix W
Decentralized PSGD Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems . 2017.
Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations
Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions: ● Lipschitz continuous gradients (bounded curvature/Hessian spectral radius) ● Weight matrix has bounded spectral gap ● Bounded variance w.r.t. local data sample ● Globally bounded variance
Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD
Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup
Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)
Analysis: Ring network
Analysis: Ring network (i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes
Proof:
Intuition ● Graph has Laplacian with spectrum
Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum
Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem,
Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem, ●
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Some facts from spectral graph theory
Intuition Recall:
Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?”
Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”
Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”
Intuition Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,
Evaluation: Image processing
Evaluation: Image processing
Evaluation: EA(M)-SGD
Evaluation: NLP
Beyond Data and Model Parallelism for Deep Neural Networks Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University
Motivation ● Data and Model parallelization have become the go-to choices for distributed training ● These limited options result in suboptimal parallelization performance ● A more comprehensive parallelization search space may lead to more optimal parallelization strategies
Motivation ● Data and Model parallelization have become the go-to choices for distributed training ● These limited options result in suboptimal parallelization performance ● A more comprehensive parallelization search space may lead to more optimal parallelization strategies
Proposed Solution SOAP Sample Operation Attribute Parameter
Proposed Solution SOAP Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work
Proposed Solution SOAP Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)
Proposed Solution SOAP Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)
Proposed Solution SOAP Sample Operation AttributeParameter
Proposed Solution Execution Simulator ● Allows FlexFlow to search much broader search space without needing to actually execute parallelization strategies ● Assumes operation O on device D takes constant time ● Takes a device topology D, operator graph G, and parallelization strategy S to predict runtime ● Uses MCMC sampling to iteratively propose strategy S* for allocated time budget
Results - Per-iteration Throughput
Results - Communication Overhead
Results - Novelty
PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan et. al. Microsoft, CMU, Stanford
Intra Batch Parallelism ● Data Parallelism ○ Communication between workers is a bottleneck.
Intra Batch Parallelism ● Model Parallelism ○ Unused resources ○ Partitioning the model left to the programmer
Inter Batch Parallelism: GPipe (Huang et. al.) ● Uses pipelining in the context of model-parallel training for very large models ● Does not specify a partitioning algorithm ● Splits a minibatch into m microbatches
Introducing PipeLine Parallelism ● Combination of inter-batch and intra-batch parallelism ● Model layers are mapped to stages. Each stage consists of consecutive layers and is mapped to a separate GPU.
Introducing PipeLine Parallelism ● Multiple minibatches inserted together to take advantage of all machines.
Three Challenges ● Work Partitioning ○ How to partition the DNN model into stages? ● Work Scheduling ○ How does scheduling work in this bi-directional pipeline? ● Effective Learning ○ How to use correct and updated weights for faster learning?
Challenge 1: Work Partitioning ● Goals: ○ Each stage performs roughly same amount of computation ○ Inter-Stage communication is minimized ● Profiling: ○ Computation Time (Forward/Backward) ○ Size of layer outputs ● Partitioning Algorithm: ○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy
Challenge 2: Work Scheduling ● Bidirectional Pipeline ○ Each active minibatch in the pipeline may be in a different stage ● Alternative between Forward and Backward Pass (1F1B)
Recommend
More recommend