distributed training ii
play

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - PowerPoint PPT Presentation

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria Context: Distributed Training Models are getting too big to fit on just one GPU Turing-NLG (Microsoft) has 17 billion parameters As model training is


  1. Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria

  2. Context: Distributed Training ● Models are getting too big to fit on just one GPU ○ Turing-NLG (Microsoft) has 17 billion parameters ● As model training is iterative, communication between different nodes becomes a bottleneck ● Distributed Training can be split broadly into two different types: ○ Data Parallel ○ Model Parallel ● Even these approaches result in far from optimal parallelization performance

  3. Model Size through the years Source: Microsoft

  4. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent

  5. Context ● Decentralized algorithms treated as a compromise; decentralized communication something we resort to and pay a price for ● Current analysis: decentralized PSGD offers no performance advantage over centralized PSGD assuming decentralized network topology ● Popular ML systems (TensorFlow, PyTorch, etc.) built to support centralized execution Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.

  6. Parallel Stochastic Gradient Descent ● Centralized network topology, e.g. parameter-server model ● Communication bottleneck at central node(s), performance decreases with increasing network latency ● convergence rate ● communication overhead Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) . 2014.

  7. Parallel Stochastic Gradient Descent Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems . 2010.

  8. Decentralized PSGD ● Requires either (all nodes access shared database) or e.g. (data parallel approach) ● Implies asymptotically ● Communication topology represented by undirected graph with doubly-stochastic weight matrix W

  9. Decentralized PSGD Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems . 2017.

  10. Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations

  11. Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions: ● Lipschitz continuous gradients (bounded curvature/Hessian spectral radius) ● Weight matrix has bounded spectral gap ● Bounded variance w.r.t. local data sample ● Globally bounded variance

  12. Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD

  13. Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup

  14. Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)

  15. Analysis: Ring network

  16. Analysis: Ring network (i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes

  17. Proof:

  18. Intuition ● Graph has Laplacian with spectrum

  19. Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum

  20. Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem,

  21. Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem, ●

  22. Some facts from spectral graph theory

  23. Some facts from spectral graph theory

  24. Some facts from spectral graph theory

  25. Some facts from spectral graph theory

  26. Some facts from spectral graph theory

  27. Some facts from spectral graph theory

  28. Some facts from spectral graph theory

  29. Some facts from spectral graph theory

  30. Intuition Recall:

  31. Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?”

  32. Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”

  33. Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”

  34. Intuition Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,

  35. Evaluation: Image processing

  36. Evaluation: Image processing

  37. Evaluation: EA(M)-SGD

  38. Evaluation: NLP

  39. Beyond Data and Model Parallelism for Deep Neural Networks Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University

  40. Motivation ● Data and Model parallelization have become the go-to choices for distributed training ● These limited options result in suboptimal parallelization performance ● A more comprehensive parallelization search space may lead to more optimal parallelization strategies

  41. Motivation ● Data and Model parallelization have become the go-to choices for distributed training ● These limited options result in suboptimal parallelization performance ● A more comprehensive parallelization search space may lead to more optimal parallelization strategies

  42. Proposed Solution SOAP Sample Operation Attribute Parameter

  43. Proposed Solution SOAP Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work

  44. Proposed Solution SOAP Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)

  45. Proposed Solution SOAP Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)

  46. Proposed Solution SOAP Sample Operation AttributeParameter

  47. Proposed Solution Execution Simulator ● Allows FlexFlow to search much broader search space without needing to actually execute parallelization strategies ● Assumes operation O on device D takes constant time ● Takes a device topology D, operator graph G, and parallelization strategy S to predict runtime ● Uses MCMC sampling to iteratively propose strategy S* for allocated time budget

  48. Results - Per-iteration Throughput

  49. Results - Communication Overhead

  50. Results - Novelty

  51. PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan et. al. Microsoft, CMU, Stanford

  52. Intra Batch Parallelism ● Data Parallelism ○ Communication between workers is a bottleneck.

  53. Intra Batch Parallelism ● Model Parallelism ○ Unused resources ○ Partitioning the model left to the programmer

  54. Inter Batch Parallelism: GPipe (Huang et. al.) ● Uses pipelining in the context of model-parallel training for very large models ● Does not specify a partitioning algorithm ● Splits a minibatch into m microbatches

  55. Introducing PipeLine Parallelism ● Combination of inter-batch and intra-batch parallelism ● Model layers are mapped to stages. Each stage consists of consecutive layers and is mapped to a separate GPU.

  56. Introducing PipeLine Parallelism ● Multiple minibatches inserted together to take advantage of all machines.

  57. Three Challenges ● Work Partitioning ○ How to partition the DNN model into stages? ● Work Scheduling ○ How does scheduling work in this bi-directional pipeline? ● Effective Learning ○ How to use correct and updated weights for faster learning?

  58. Challenge 1: Work Partitioning ● Goals: ○ Each stage performs roughly same amount of computation ○ Inter-Stage communication is minimized ● Profiling: ○ Computation Time (Forward/Backward) ○ Size of layer outputs ● Partitioning Algorithm: ○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy

  59. Challenge 2: Work Scheduling ● Bidirectional Pipeline ○ Each active minibatch in the pipeline may be in a different stage ● Alternative between Forward and Backward Pass (1F1B)

Recommend


More recommend