Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - PowerPoint PPT Presentation

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria

Context: Distributed Training ● Models are getting too big to fit on just one GPU ○ Turing-NLG (Microsoft) has 17 billion parameters ● As model training is iterative, communication between different nodes becomes a bottleneck ● Distributed Training can be split broadly into two different types: ○ Data Parallel ○ Model Parallel ● Even these approaches result in far from optimal parallelization performance

Model Size through the years Source: Microsoft

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent Xiangru Lian et. al. University of Rochester, ETH Zurich, UC Davis, IBM, Tencent

Context ● Decentralized algorithms treated as a compromise; decentralized communication something we resort to and pay a price for ● Current analysis: decentralized PSGD offers no performance advantage over centralized PSGD assuming decentralized network topology ● Popular ML systems (TensorFlow, PyTorch, etc.) built to support centralized execution Lee, G. M. "A survey on trust computation in the internet of things." Information and Communications Magazine 33.2 (2016): 10-27.

Parallel Stochastic Gradient Descent ● Centralized network topology, e.g. parameter-server model ● Communication bottleneck at central node(s), performance decreases with increasing network latency ● convergence rate ● communication overhead Li, Mu, et al. "Scaling distributed machine learning with the parameter server." 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) . 2014.

Parallel Stochastic Gradient Descent Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in neural information processing systems . 2010.

Decentralized PSGD ● Requires either (all nodes access shared database) or e.g. (data parallel approach) ● Implies asymptotically ● Communication topology represented by undirected graph with doubly-stochastic weight matrix W

Decentralized PSGD Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems . 2017.

Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations

Analysis: D-PSGD Convergence rate for Decentralized PSGD: * Assuming learning rate , number of iterations Further assumptions: ● Lipschitz continuous gradients (bounded curvature/Hessian spectral radius) ● Weight matrix has bounded spectral gap ● Bounded variance w.r.t. local data sample ● Globally bounded variance

Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD

Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup

Analysis: D-PSGD Convergence rate for Decentralized PSGD: convergence rate, same as C-PSGD approximate solution has complexity shared between nodes linear speedup Communication overhead (for sparse, i.e. very decentralized networks)

Analysis: Ring network

Analysis: Ring network (i.e. shared database); same convergence rate if sharing across nodes (partitioned data); same convergence rate if sharing across nodes

Proof:

Intuition ● Graph has Laplacian with spectrum

Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum

Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem,

Intuition ● Graph has Laplacian with spectrum ● Weight matrix has spectrum ● By Perron-Frobenius Theorem, ●

Some facts from spectral graph theory

Intuition Recall:

Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?”

Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?”

Intuition Recall: ~ “What’s the worst possible bottleneck between two clusters?” ~ “How uniformly are edges distributed between nodes?” “How random is the network?”

Intuition Number of iterations Degenerate spectrum; densely connected, Broad spectrum; weakly connected,

Evaluation: Image processing

Evaluation: EA(M)-SGD

Evaluation: NLP

Beyond Data and Model Parallelism for Deep Neural Networks Zhihao Jia, Matei Zaharia, Alex Aiken Stanford University

Motivation ● Data and Model parallelization have become the go-to choices for distributed training ● These limited options result in suboptimal parallelization performance ● A more comprehensive parallelization search space may lead to more optimal parallelization strategies

Proposed Solution SOAP Sample Operation Attribute Parameter

Proposed Solution SOAP Sample Operation Attribute Parameter Standard data (sample) and model (parameter) approaches from previous work

Proposed Solution SOAP Sample Operation Attribute Parameter Different operations performed in a DNN (e.g., MatMul, Convolution, etc)

Proposed Solution SOAP Sample Operation Attribute Parameter Attributes of a particular variable (e.g., Length of a sequence, Number of channels, etc)

Proposed Solution SOAP Sample Operation AttributeParameter

Proposed Solution Execution Simulator ● Allows FlexFlow to search much broader search space without needing to actually execute parallelization strategies ● Assumes operation O on device D takes constant time ● Takes a device topology D, operator graph G, and parallelization strategy S to predict runtime ● Uses MCMC sampling to iteratively propose strategy S* for allocated time budget

Results - Per-iteration Throughput

Results - Communication Overhead

Results - Novelty

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan et. al. Microsoft, CMU, Stanford

Intra Batch Parallelism ● Data Parallelism ○ Communication between workers is a bottleneck.

Intra Batch Parallelism ● Model Parallelism ○ Unused resources ○ Partitioning the model left to the programmer

Inter Batch Parallelism: GPipe (Huang et. al.) ● Uses pipelining in the context of model-parallel training for very large models ● Does not specify a partitioning algorithm ● Splits a minibatch into m microbatches

Introducing PipeLine Parallelism ● Combination of inter-batch and intra-batch parallelism ● Model layers are mapped to stages. Each stage consists of consecutive layers and is mapped to a separate GPU.

Introducing PipeLine Parallelism ● Multiple minibatches inserted together to take advantage of all machines.

Three Challenges ● Work Partitioning ○ How to partition the DNN model into stages? ● Work Scheduling ○ How does scheduling work in this bi-directional pipeline? ● Effective Learning ○ How to use correct and updated weights for faster learning?

Challenge 1: Work Partitioning ● Goals: ○ Each stage performs roughly same amount of computation ○ Inter-Stage communication is minimized ● Profiling: ○ Computation Time (Forward/Backward) ○ Size of layer outputs ● Partitioning Algorithm: ○ Partitioning of layers into stages ○ Replication Factor (Number of workers for each stage) ○ Optimal number of minibatches to keep workers busy

Challenge 2: Work Scheduling ● Bidirectional Pipeline ○ Each active minibatch in the pipeline may be in a different stage ● Alternative between Forward and Backward Pass (1F1B)

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - PowerPoint PPT Presentation

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria Context: Distributed Training Models are getting too big to fit on just one GPU Turing-NLG (Microsoft) has 17 billion parameters As model training is

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

t ts r rs r

Cost-effec)ve storage Services with Scality RING Status 2012

A Truthful (1-)-Optimal Mechanism for On-demand

Sustainable Energy Systems for Communities Prof Alan Brent Chair in Sustainable Energy Systems

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

Column Generation Algorithms for the Capacitated m -Ring-Star Problem 1 Edna Ayako Hoshino and Cid

Using Big Data To Solve Economic and Social Problems Professor Raj Chetty Head Section Leader

Combining dual price smoothing and piecewise linear penalty function stabilization in column

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, - PowerPoint PPT Presentation

Distributed Training II Benjamin Glickenhaus, Brendan Shanahan, Yash Bidasaria Context: Distributed Training Models are getting too big to fit on just one GPU Turing-NLG (Microsoft) has 17 billion parameters As model training is

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

t ts r rs r

Cost-effec)ve storage Services with Scality RING Status 2012

A Truthful (1-)-Optimal Mechanism for On-demand

Sustainable Energy Systems for Communities Prof Alan Brent Chair in Sustainable Energy Systems

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

Column Generation Algorithms for the Capacitated m -Ring-Star Problem 1 Edna Ayako Hoshino and Cid

Using Big Data To Solve Economic and Social Problems Professor Raj Chetty Head Section Leader

Combining dual price smoothing and piecewise linear penalty function stabilization in column

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges