spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML โ HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N IKOLI D RYDEN , Y OSUKE O YAMA , C EDRIC R ENGGLI , AND OTHERS AT SPCL
spcl.inf.ethz.ch @spcl_eth 20 TB/night Source: OpenAI 2
spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning ๐(๐ฆ) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples ๐ฆ โ ๐ โ ๐ label domain ๐ true label ๐(๐ฆ) ๐ ๐ฆ : ๐ โ ๐ 2 โ ๐ก๐ ๐ฅ, ๐ฆ = ๐ ๐ฆ โ ๐ ๐ฆ ๐ฅ โ = argmin ๐ฅโโ ๐ ๐ฝ ๐ฆ~๐ โ ๐ฅ, ๐ฆ network structure weights ๐ฅ (fixed) (learned) ๐ ๐ ๐ฆ ๐ โ ๐๐ ๐ฅ, ๐ฆ = โ เท ๐ ๐ฆ ๐ โ log ฯ ๐ ๐ ๐ ๐ฆ ๐ ๐ 4
spcl.inf.ethz.ch @spcl_eth A brief intro to supervised deep learning ๐(๐ฆ) Cat Cat 0.54 1.00 Dog Dog 0.28 0.00 Airplane 0.07 Airplane 0.00 Horse 0.33 Horse 0.00 Banana 0.02 Banana 0.00 0.02 Truck 0.00 Truck layer-wise parameter update labeled samples ๐ฆ โ ๐ โ ๐ label domain ๐ true label ๐(๐ฆ) ๐ ๐ฆ : ๐ โ ๐ 2 โ ๐ก๐ ๐ฅ, ๐ฆ = ๐ ๐ฆ โ ๐ ๐ฆ 30k-billions 100MiB-32GiB and beyond โฅ TBs of random ๐ฅ โ = argmin ๐ฅโโ ๐ ๐ฝ ๐ฆ~๐ โ ๐ฅ, ๐ฆ network structure weights ๐ฅ (fixed) (learned) access ๐ ๐ ๐ฆ ๐ โ ๐๐ ๐ฅ, ๐ฆ = โ เท ๐ ๐ฆ ๐ โ log ฯ ๐ ๐ ๐ ๐ฆ ๐ ๐ 5
spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast โ trying everything imaginable โ survey results from 252 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 8 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast โ trying everything imaginable โ survey results from 252 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! 9 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch @spcl_eth Computational Principles Data 10
spcl.inf.ethz.ch @spcl_eth Computational Principles Data 11
spcl.inf.ethz.ch @spcl_eth Example: Options for computing convolutional layers Direct Indirect 4 1 9 8 21.9 59.3 53.9 43.9 FFT Winograd 1 -1 0 5 9 9 8 -6.3 16.8 12.3 12 = Direct * 0.1 -2 0 0 7 3 4 9.6 15.3 25.8 14 3 4 1.1 2 6 3 1 0.4 7.1 52.1 53.1 โฑ ๐ ๐ฅ เท ๐ฅ ๐ ร ๐ ๐ โฒ ร ๐โฒ ๐ โ ๐ผ โฒ โ ๐โฒ ร ๐ถ ๐ โ โ ๐ป ๐ โ ๐ถ ๐ป โ Element-wise product โฑ ๐ผโฒ ๐ ๐ท ๐๐ โ ๐ฟ ๐ง โ ๐ฟ ๐ฆ im2col = โฆ ๐ฎ(๐, ๐) ๐โฒ im2col GEMM, Winograd Channel-wise Domain col2im ๐ผ + summation โฑ โ1 ๐ต ๐ โ โ ๐ต ๐ท ๐๐ ๐ ๐ โฒ = ๐ + ๐ โ 1 ร ๐ ร ๐ ๐ท ๐๐ ๐ท ๐๐ฃ๐ข ๐ท ๐๐ฃ๐ข Reshape ๐ฟ ๐ง ๐ท ๐๐ฃ๐ข โ ๐ท ๐๐ ๐ฟ ๐ฆ ๐ท ๐๐ โ ๐ฟ ๐ง โ ๐ฟ ๐ฆ K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Intโl Workshop on Frontiers in Handwriting Re cognition 2016 M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLRโ14 12 A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPRโ16
spcl.inf.ethz.ch @spcl_eth Operator Design = * * * Separable convolution A.G. Howard et al. โ MobileNets : Efficient Convolutional Neural Networks for Mobile Vision Applications,โ arXiv 2017. F.N. Iandola et al โ Squeezenet: alexnet- level accuracy with 50x fewer parameters and <0.5MB model size,โ ICLR 2017. 13
spcl.inf.ethz.ch @spcl_eth Transformers โ Multi-Head Attention A. Vaswani et al. โAttention Is All You Need,โ NeurIPS 2017. 14
spcl.inf.ethz.ch @spcl_eth DNN Compilers โช Use techniques from compiler construction: DNN โ Graph โ IR โ Transformations โ HW Mapping TensorFlow XLA Facebook Glow / TorchScript JIT 16
spcl.inf.ethz.ch @spcl_eth DNN Compilers โช Use techniques from compiler construction: DNN โ Graph โ IR โ Transformations โ HW Mapping Intel nGraph TVM Stack TensorFlow XLA Facebook Glow / TorchScript JIT 17
spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โฆ โฆ Data Parallelism โฆ โฆ 18
spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 19 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โฆ โฆ Data Parallelism โฆ โฆ Model Parallelism 1 3 Channel/Filter Spatial Layer PIpeline Parallelism โฆ Proc 3 1 1 1 2 3 3 2 1 Proc 2 1 1 1 2 3 3 2 1 Proc 1 1 1 2 3 1 3 2 1 Idle Idle 20
spcl.inf.ethz.ch @spcl_eth Partitioning Computation? โช Simple and efficient solution, easy to implement โฆ โฆ โช Duplicate parameters at all processors Data Parallelism โฆ โช Affects generalization โฆ โช Parameters/domain can be distributed across processors โช Good for: large inputs, wide networks Model Parallelism 1 โช Complex communication per-layer 3 โช Performance hinges on implementation Channel/Filter Spatial Layer โช Parameters can be distributed across processors โช Good for: deep models, few activations PIpeline Parallelism โฆ โช Sparse communication pattern (only pipeline stages) โช Consistent model introduces idle- time โBubbleโ 21
spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism โฆ โฆ โฆ Layer (pipeline) Parallelism โช Layers/parameters can be distributed across processors โช Can distribute minibatch โช Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) โช Enables arbitrary combinations of data, model, and pipeline parallelism โ very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS โ12. 25 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch @spcl_eth Training is not just Training Data/compute redistribution Imbalanced workload over time Nontrivial gradient aggregation K. Osawa et al., โSecond -order Optimization Method for Large Mini-batch: Training ResNet- 50 on ImageNet in 35 Epochsโ, arXiv 2018. T. Karras et al., โProgressive Growing of GANs for Improved Quality, Stability, and Variationโ, arXiv 2017. 27
spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search โช Meta-optimization of hyper-parameters (momentum) and DNN architecture โช Using Reinforcement Learning [1] (explore/exploit different configurations) โช Genetic Algorithms with modified (specialized) mutations [2] โช Particle Swarm Optimization [3] and other meta-heuristics โช Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ18 28 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ19
spcl.inf.ethz.ch @spcl_eth Hyperparameter and Architecture search โช Meta-optimization of hyper-parameters (momentum) and DNN architecture โช Using Reinforcement Learning [1] (explore/exploit different configurations) โช Genetic Algorithms with modified (specialized) mutations [2] โช Particle Swarm Optimization [3] and other meta-heuristics โช Multi-level optimization Reinforcement Learning [1] Evolutionary Algorithms [4] Model-Based Optimization [5,6] [1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper- parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCOโ17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLRโ18 [5] R. Luo et al.: Neural Architecture Optimization, NeurIPSโ18 29 [6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLRโ19
Recommend
More recommend