Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning Rate Kaifeng Lyu * Sanjeev Arora Zhiyuan Li * Tsinghua University Princeton University & IAS Princeton University NeurIPS Dec, 2020 * οΌ these authors contribute equally
Se Settings gs β’ Normalization, e.g. BN: Scale invariance: π π₯ ! ; πͺ π = π ππ₯ # ; πͺ π βπ π₯ ! ; πͺ π = πβπ ππ₯ # ; πͺ π ; β $ π π₯ ! ; πͺ π = π $ β $ π ππ₯ # ; πͺ π , βπ > π β’ Stochastic Gradient Descent and Weight Decay: πͺ π : batch of training data at iteration t π₯ " : weights of neural net π " : Learning Rate (LR) at iteration t π: Weight Decay (WD) factor, aka β # regularization Our Contribution: 1. Several surprising incompatibilities between normalized nets and traditional analyses. 2. A new theory suggesting that LR doesnβt play the role assumed in most discussions, via a new analysis of SDEs arising from SGD in normalized nets. Our analysis show ππ½, i.e. WD*LR, is a better measure for the speed of learning, and thus we call ππ½ the intrinsic Learning Rate.
(that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β π , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10
(that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β 0 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10
(that we suspect! ) Con Convention on Wis Wisdo doms ms in in De Deep Le Lear arning ning (t Wisdom 1: As LR β 0 , Wisdom 2: To achieve the Wisdom 3: Modeling SGD via optimization dynamic best generalization, LR must SDE with a fixed Gaussian converges to a deterministic be large initially for quite a noise. Namely, as a diffusion path (Gradient Flow) along few epochs. process that mixes to some which training loss strictly Gibbs-like distribution. decreases. Figure from blog post by Holden Lee and Andrej Risteski Resnet with BN trained by full-batch GD Resnet with BN trained by SGD with diff. + WD on subsampled CIFAR10 LR schedules on CIFAR10
Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β as hessian β when π₯ β 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. Resnet with BN trained by SGD with diff. Resnet with BN trained by full-batch GD ( Izmailov et al., 2018) LR schedules on CIFAR10 + WD on subsampled CIFAR10. SWA rules out that SGD as a diffusion process Same setting, longer training! Same setting, longer training! is mixing to a unique global equilibrium.
Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β as hessian β when π₯ β 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. Resnet with BN trained by SGD with diff. Resnet with BN trained by full-batch GD ( Izmailov et al., 2018) LR schedules on CIFAR10 + WD on subsampled CIFAR10. SWA rules out that SGD as a diffusion process Same setting, longer training! Same setting, longer training! is mixing to a unique global equilibrium.
Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β as hessian β when π₯ β 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.
Conventional Wisdom ch challenged (Against Wisdom 1): Full batch (Against Wisdom 2) Small LR (Against Wisdom 3) Random gradient descent β gradient can generalize equally well as walk/SDE view of SGD is way flow. large LR off. There is no evidence of mixing as traditionally Loss β as hessian β when π₯ β 0 understood, at least within normal training times. Stochastic Weight Averaging improves acc. ( Izmailov et al., 2018) So So wh whatβs go going on on? SWA rules out that SGD as a diffusion process is mixing to a unique global equilibrium.
SD SDE-ba base sed d framework for mode deling ng SGD on n Normalized d Netw tworks β’ Standard SDE approximation: β’ New parametrization: Intrinsic Learning Rate Main Theorem: dynamics of direction " | $ dynamics of norm πΏ " = |π Effective Learning rate π # Three implications of main theorem: "π.π = π½ 1. The effective learning rate depends on norm of W, πΉ π πΏ π π . 2. π π alone determines the evolution of the system ) 3. norm convergence π( * # ) time (steps)
A A conj njectur ture abo bout ut mixing ng ti time in n func uncti tion n spa space β’ SDE(SGD) mixes slowly in parameter space, but it mixes fast in function space in experiments, i.e., shortly after norm convergence, π’ = π( . / + ) . β’ Fast Equilibrium Conjecture : Network from any non-pathological initialization would converge to function space equilibrium in π( . / + ) steps, i.e., β’ Cannot distinguish two networks (from different init) at equilibrium via (further training +) evaluation on any input data, but itβs possible to distinguish them via looking at their parameters (e.g., via Stochastic Weight Averaging).
Wh What hap appen ens in in real eal lif life e train ainin ing -- -- an an in inter erpretatio ion Our theory explains β¦ β’ Why test/train error sudden increases after every LR decay? Figure from βWide Resnetβ paper (Zagoruyko & Komodakis, 2017) Reason: Effective LR first divided by 10,but its stationary value only decrease by roughly ππ . β’ How many LR decays are necessary towards the best performance? β’ Only the last is necessary, if you donβt mind to train your network longer. β’ Whatβs benefit of early large LR then? β’ Faster convergence to the equilibrium (of small Intrinsic LR) . See more interpretations in our paper!
Recommend
More recommend