ml hpc optimizing optimizers for optimization
play

ML HPC: Optimizing Optimizers for Optimization Workshop on the - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N


  1. spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML โ†” HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N IKOLI D RYDEN , Y OSUKE O YAMA , C EDRIC R ENGGLI , AND OTHERS AT SPCL

  2. spcl.inf.ethz.ch @spcl_eth 20 TB/night Source: OpenAI 2

  3. spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning ๐‘”(๐‘ฆ) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples ๐‘ฆ โˆˆ ๐‘Œ โŠ‚ ๐’  label domain ๐‘ true label ๐‘š(๐‘ฆ) ๐‘” ๐‘ฆ : ๐‘Œ โ†’ ๐‘ 2 โ„“ ๐‘ก๐‘Ÿ ๐‘ฅ, ๐‘ฆ = ๐‘” ๐‘ฆ โˆ’ ๐‘š ๐‘ฆ ๐‘ฅ โˆ— = argmin ๐‘ฅโˆˆโ„ ๐‘’ ๐”ฝ ๐‘ฆ~๐’  โ„“ ๐‘ฅ, ๐‘ฆ network structure weights ๐‘ฅ (fixed) (learned) ๐‘“ ๐‘” ๐‘ฆ ๐‘— โ„“ ๐‘‘๐‘“ ๐‘ฅ, ๐‘ฆ = โˆ’ เท ๐‘š ๐‘ฆ ๐‘— โ‹… log ฯƒ ๐‘™ ๐‘“ ๐‘” ๐‘ฆ ๐‘™ ๐‘— 4

  4. spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning ๐‘”(๐‘ฆ) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples ๐‘ฆ โˆˆ ๐‘Œ โŠ‚ ๐’  label domain ๐‘ true label ๐‘š(๐‘ฆ) ๐‘” ๐‘ฆ : ๐‘Œ โ†’ ๐‘ 2 โ„“ ๐‘ก๐‘Ÿ ๐‘ฅ, ๐‘ฆ = ๐‘” ๐‘ฆ โˆ’ ๐‘š ๐‘ฆ 30k-billions 100MiB-32GiB and beyond โ‰ฅ TBs of random ๐‘ฅ โˆ— = argmin ๐‘ฅโˆˆโ„ ๐‘’ ๐”ฝ ๐‘ฆ~๐’  โ„“ ๐‘ฅ, ๐‘ฆ network structure weights ๐‘ฅ (fixed) (learned) access ๐‘“ ๐‘” ๐‘ฆ ๐‘— โ„“ ๐‘‘๐‘“ ๐‘ฅ, ๐‘ฆ = โˆ’ เท ๐‘š ๐‘ฆ ๐‘— โ‹… log ฯƒ ๐‘™ ๐‘“ ๐‘” ๐‘ฆ ๐‘™ ๐‘— 5

  5. spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast โ€“ trying everything imaginable โ€“ survey results from 252 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 8 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

  6. spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast โ€“ trying everything imaginable โ€“ survey results from 252 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! 9 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

  7. spcl.inf.ethz.ch @spcl_eth Computational Principles Data 10

  8. spcl.inf.ethz.ch @spcl_eth Computational Principles Data 11

  9. spcl.inf.ethz.ch @spcl_eth Example: Options for computing convolutional layers Direct Indirect 4 1 9 8 21.9 59.3 53.9 43.9 FFT Winograd 1 -1 0 5 9 9 8 -6.3 16.8 12.3 12 = Direct * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 โ„ฑ ๐’™ ๐‘ฅ เท ๐‘ฅ ๐‘  ร— ๐‘  ๐‘› โ€ฒ ร— ๐‘›โ€ฒ ๐‘‚ โ‹… ๐ผ โ€ฒ โ‹… ๐‘‹โ€ฒ ร— ๐ถ ๐‘ˆ โ‹… โ‹… ๐ป ๐‘ˆ โ‹… ๐ถ ๐ป โ‹… Element-wise product โ„ฑ ๐ผโ€ฒ ๐‘‚ ๐ท ๐‘—๐‘œ โ‹… ๐ฟ ๐‘ง โ‹… ๐ฟ ๐‘ฆ im2col = โ€ฆ ๐‘ฎ(๐’, ๐’”) ๐‘‹โ€ฒ im2col GEMM, Winograd Channel-wise Domain col2im ๐ผ + summation โ„ฑ โˆ’1 ๐ต ๐‘ˆ โ‹… โ‹… ๐ต ๐ท ๐‘—๐‘œ ๐‘‹ ๐‘› โ€ฒ = ๐‘› + ๐‘  โˆ’ 1 ร— ๐‘› ร— ๐‘› ๐ท ๐‘—๐‘œ ๐ท ๐‘๐‘ฃ๐‘ข ๐ท ๐‘๐‘ฃ๐‘ข Reshape ๐ฟ ๐‘ง ๐ท ๐‘๐‘ฃ๐‘ข โ‹… ๐ท ๐‘—๐‘œ ๐ฟ ๐‘ฆ ๐ท ๐‘—๐‘œ โ‹… ๐ฟ ๐‘ง โ‹… ๐ฟ ๐‘ฆ K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Intโ€™l Workshop on Frontiers in Handwriting Re cognition 2016 M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLRโ€™14 12 A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPRโ€™16

  10. spcl.inf.ethz.ch @spcl_eth Operator Design = * * * Separable convolution A.G. Howard et al. โ€œ MobileNets : Efficient Convolutional Neural Networks for Mobile Vision Applications,โ€ arXiv 2017. F.N. Iandola et al โ€œ Squeezenet: alexnet- level accuracy with 50x fewer parameters and <0.5MB model size,โ€ ICLR 2017. 13

  11. spcl.inf.ethz.ch @spcl_eth Transformers โ€“ Multi-Head Attention A. Vaswani et al. โ€œAttention Is All You Need,โ€ NeurIPS 2017. 14

  12. spcl.inf.ethz.ch @spcl_eth DNN Compilers โ–ช Use techniques from compiler construction: DNN โ†’ Graph โ†’ IR โ†’ Transformations โ†’ HW Mapping TensorFlow XLA Facebook Glow / TorchScript JIT 16

  13. spcl.inf.ethz.ch @spcl_eth DNN Compilers โ–ช Use techniques from compiler construction: DNN โ†’ Graph โ†’ IR โ†’ Transformations โ†’ HW Mapping Intel nGraph TVM Stack TensorFlow XLA Facebook Glow / TorchScript JIT 17

  14. spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โ€ฆ โ€ฆ Data Parallelism โ€ฆ โ€ฆ 18

  15. spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 19 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

  16. spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โ€ฆ โ€ฆ Data Parallelism โ€ฆ โ€ฆ Model Parallelism 1 3 Channel/Filter Spatial Layer PIpeline Parallelism โ€ฆ Proc 3 1 1 1 2 3 3 2 1 Proc 2 1 1 1 2 3 3 2 1 Proc 1 1 1 2 3 1 3 2 1 Idle Idle 20

  17. spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โ–ช Simple and efficient solution, easy to implement โ€ฆ โ€ฆ โ–ช Duplicate parameters at all processors Data Parallelism โ€ฆ โ–ช Affects generalization โ€ฆ โ–ช Parameters/domain can be distributed across processors โ–ช Good for: large inputs, wide networks Model Parallelism 1 โ–ช Complex communication per-layer 3 โ–ช Performance hinges on implementation Channel/Filter Spatial Layer โ–ช Parameters can be distributed across processors โ–ช Good for: deep models, few activations PIpeline Parallelism โ€ฆ โ–ช Sparse communication pattern (only pipeline stages) โ–ช Consistent model introduces idle- time โ€œBubbleโ€ 21

  18. spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism โ€ฆ โ€ฆ โ€ฆ Layer (pipeline) Parallelism โ–ช Layers/parameters can be distributed across processors โ–ช Can distribute minibatch โ–ช Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) โ–ช Enables arbitrary combinations of data, model, and pipeline parallelism โ€“ very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS โ€™12. 25 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

  19. spcl.inf.ethz.ch @spcl_eth Training is not just Training Data/compute redistribution Imbalanced workload over time Nontrivial gradient aggregation K. Osawa et al., โ€œSecond -order Optimization Method for Large Mini-batch: Training ResNet- 50 on ImageNet in 35 Epochsโ€, arXiv 2018. T. Karras et al., โ€œProgressive Growing of GANs for Improved Quality, Stability, and Variationโ€, arXiv 2017. 27

  20. spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search โ–ช Meta-optimization of hyper-parameters (momentum) and DNN architecture โ–ช Using Reinforcement Learning [1] (explore/exploit different configurations) โ–ช Genetic Algorithms with modified (specialized) mutations [2] โ–ช Particle Swarm Optimization [3] and other meta-heuristics โ–ช Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ€™17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ€™18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ€™18 28 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ€™19

  21. spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search โ–ช Meta-optimization of hyper-parameters (momentum) and DNN architecture โ–ช Using Reinforcement Learning [1] (explore/exploit different configurations) โ–ช Genetic Algorithms with modified (specialized) mutations [2] โ–ช Particle Swarm Optimization [3] and other meta-heuristics โ–ช Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ€™17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ€™18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ€™18 29 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ€™19

Recommend


More recommend