beyond data and model parallelism for deep neural networks
play

Beyond Data and Model Parallelism for Deep Neural Networks ZHIHAO - PowerPoint PPT Presentation

Beyond Data and Model Parallelism for Deep Neural Networks ZHIHAO JIA, MATEI ZAHARIA, ALEX AIKEN SYSML 2019 PRESENTED BY JULIUS LISCHEID Existing Parallelisation Approaches (1/2) DATA PARALLELISM MODEL PARALLELISM Replica of neural


  1. Beyond Data and Model Parallelism for Deep Neural Networks ZHIHAO JIA, MATEI ZAHARIA, ALEX AIKEN SYSML 2019 PRESENTED BY JULIUS LISCHEID

  2. Existing Parallelisation Approaches (1/2) DATA PARALLELISM MODEL PARALLELISM • Replica of neural network on each device • Disjoint subsets of neural network assigned to devices • Each device processes subset of training data • No parameter synchronisation, but requires • After each iteration, parameters are data transfers between operations synchronised • Works well for compute-heavy operations with few parameters (e.g. convolutions)

  3. Existing Parallelisation Approaches (2/2) EXPERT-DESIGNED STRATEGIES AUTOMATED FRAMEWORKS • A. Krizhevsky. One weird trick for parallelizing • A. Mirhoseini et al. Device Placement convolutional neural networks. CoRR 2014. Optimization with Reinforcement Learning. ICML 2017. • Data parallelism for convolutional layers, model • Reinforment learning for model parallelism parallelism for fully-connected layers • Y. Wu et al. Google’s neural machine translation • Z. Jia et al. Exploring hidden dimensions in system: bridging the gap between human and parallelizing convolutional neural networks. CoRR machine translation. CoRR 2016. 2018. • Data parallelism for compute nodes, model • Dynamic Programming for parallelisation of DNNs parallelism for intra-node computation with linear computation graphs • D. Narayanan et al. PipeDream: generalized pipeline parallelism for DNN training. SOSP 2019. • …

  4. The SOAP Search Space Samples (data parallelism) Operators (model parallelism) Attributes (e.g. pixels) Parameters (≈model parallelism)

  5. Hybrid Parallelism in SOAP Example parallelization strategies for 1D convolution

  6. FlexFlow • Trying out strategies on hardware is expensive due to long iteration times • Execution Optimizer uses simulator instead • Measures operator runtime on hardware • Estimates runtime of parallelisation strategies • Delta simulation algorithm uses incremental updates for acceleration • Execution optimizer explores search space with Markov Chain Monte Carlo algorithm

  7. Evaluation (1/2)

  8. Evaluation (2/2)

  9. Review (1/2) STRENGTHS/AGREEMENTS WEAKNESSES/DISAGREEMENTS • Expands search space for parallelisation • Unclear how much SOAP and execution strategies optimiser contribute to training acceleration • Proposes a way to efficiently explore that • Usefulness of Attribute dimension is search space questionable • Leads to an actual speed-up • More end-to-end performance benchmarks would have been useful

  10. Review (2/2) KEY TAKEAWAYS POTENTIAL IMPACT • Training performance of parallelisation • Usage of other search algorithms to explore strategies can be efficiently and accurately parallelisation search space in simulation predicted • Combination of parallelisation search space • The resulting speed-up allows for the with computation graph substitutions exploration of a wider search space (compare Tim’s presentation next week)

  11. Questions? P A S O

  12. Image Citations Images with beige background retrieved from Jia Zhihao’s SysML 19 talk: https://www.youtube.com/watch?v=81l6kkV-OkE All other images extracted from Z. Jia, M. Zaharia, and A. Aiken: Beyond Data and Model Parallelism for Deep Neural Networks, SYSML, 2019.

Recommend


More recommend