An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton Princeton Princeton & IAS Princeton & IAS ICLR April, 2019
Learning Lea g rate e in traditional optimi mization ๐ฅ !"# โ ๐ฅ ! โ ๐ โ โ๐ ๐ฅ ! Traditional: Start with some LR; decay over time. (extensive literature in optimization justifying this) Confusingly, exotic LR schedules also reported to work: triangular[Smith,2015 ], cosine [Loshchilov&Hutter, 2016] etc.; no justification in theory.
Th This is wo work: Ex Exponential LR LR sch chedules PreResNet32 on CIFAR10 with fixed LR converted to Exp LR with ๐ ! = 0.1ร1.481 ! , Result 1 (empirical): Possible to train todayโs deep architectures, while growing LR exponentially (i.e., at each iteration multiply by (1 + ๐) for some ๐ > 0) Result 2 (theory): Mathematical proof that exponential LR schedule can yield (in function space*) every net produced by existing training schedules. Raises questions for theory; highlights importance of trajectory analysis ( vs landscape) (* In all nets that use batch norm [Ioffe-Szegedyโ13] or any other normalization scheme.)
Ma Main Th Thm (original sch chedule has fixed LR) Learning Rate โGeneral training algorithm; fixed LRโ โ ! regularizer (aka Weight decay) Momentum Stochastic Loss at round t Thm: For nets w/ batch norm or layer norm, following is equivalent to above : weight decay 0, momentum ๐ฟ , and LR schedule ๐ ! = ๐๐ฝ $%!$# where ๐ฝ ( ๐ฝ < 1 ) is nonzero root of x 2 โ (1 + ฮณ โ ฮปฮท ) x + ฮณ = 0 , (proof shows nets in the new trajectory are equivalent in function space to nets in original trajectory)
Mo More e gen ener eral (ori riginal schedule has varyi ying LR) โ . Original schedule(Step Decay) : K phases. In phase ๐ฝ , iteration [ ๐ & , ๐ &"# โ 1], use LR ๐ & Thm : Step Decay can be realized with tapered exponential LR schedule Tapered Exponential LR schedule (TEXP): Exponential growing LR in each phase + when entering a new phase: Switching to a slower exponential growing rate; โข Dividing the current LR by some constant. โข (See details in the paper)
Ke Key conce cept in proof: Sc Scale-in invar arian iant tr traini ning ng loss loss ๐ ๐ โ ๐ = ๐ ๐ for every vector ๐ of net parameters and ๐ > 0 Observation: Batch norm + fixed final layer รจ Loss is scale invariant (True for Feed-forward nets, Resnets, DenseNets, etc.; see our appendix) Scale-invariant loss fn sufficient for state of the art deep learning![Hoffer et al,18] โ๐๐(๐๐พ ๐ ) โ๐๐(๐พ ๐ ) Lemma: Gradient for such losses satisfies ๐ ๐พ ๐ ๐พ ๐
Proof sketch ch when momentum=0 ๐๐(๐๐พ ๐ ) = ๐ "๐ ๐๐(๐พ ๐ ) ๐ฝ # = c ๐ ๐ฝ ๐ฝ # = c ๐ ๐ฝ โ๐ ๐ ๐ฝ๐๐(๐๐พ ๐ ) # = ๐ ๐พ ๐ ๐พ ๐ # ๐พ ๐%๐ # โ ๐ฝโฒ๐๐(๐พ ๐ # # ) ๐พ ๐%๐ = ๐พ ๐ # , ๐ฝโฒ โ (๐ ๐พ ๐ , ๐ ๐ ๐ฝ) # ๐พ ๐ , ๐ฝโฒ โ (๐ ๐พ ๐%๐ , ๐ ๐ ๐ฝ) ๐พ ๐%๐ โ๐๐(๐พ ๐ ) ๐ฝ ๐ฝ โ๐ฝ๐๐(๐๐พ ๐ ) ๐พ ๐%๐ ๐พ ๐ ๐พ ๐%๐ = ๐พ ๐%๐ โ ๐ฝ๐๐(๐พ ๐ )
Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , ๐ฝโฒ ๐พ ๐ ๐พ ๐#๐ , ๐ฝโฒ 0 โ ๐ฝโฒ๐๐(๐พ ๐ 0 0 ) ๐พ ๐"๐ = ๐พ ๐ 0 , ๐ฝโฒ โ (๐ ๐พ ๐"๐ , ๐ ๐ ๐ฝ) 0 , ๐ฝโฒ โ (๐ ๐พ ๐ , ๐ ๐ ๐ฝ) ๐พ ๐"๐ ๐พ ๐ ๐พ ๐ , ๐ฝ ๐พ ๐#๐ , ๐ฝ State = ๐พ ๐"๐ = ๐พ ๐"๐ โ ๐ฝ๐๐(๐พ ๐ )
Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , ๐ฝโฒ ๐พ ๐ ๐พ ๐#๐ , ๐ฝโฒ ๐๐ ๐ฎ Equivalent scaling: ๐ โ ๐ธ ๐ ๐ โ ๐ธ ๐ ๐ ๐ ๐ ๐ ๐ธ ๐ ๐ธ ๐ ๐พ ๐#๐ , ๐ฝ ๐พ ๐ , ๐ฝ ๐๐ ๐ฎ
Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case (๐ = ๐ โ ๐๐ฝ) ๐ $๐ ๐พ โ ๐ฝ๐๐ ๐ฎ (๐พ), ๐ฝ ๐พ โ ๐ $๐ ๐ฝ๐๐ ๐ฎ (๐พ), ๐ $๐ ๐ฝ ๐พ, ๐ $๐ ๐ฝ ๐พ, ๐ฝ ๐ โ ๐ธ ๐ ๐ ๐ "๐ ๐ธ ๐ ๐๐ ๐ฎ ๐ธ ๐
(๐ = ๐ โ ๐๐ฝ) = ๐ โ ๐ธ ๐ ๐ "๐ ๐ ๐๐ ๐ฎ ๐ธ ๐ ๐ธ ๐ Theorem: GD + WD + constant LR= GD + Exp LR. ๐ "๐ โ ๐ธ ๐ ๐ "๐๐ โ ๐ฏ๐ฌ ๐$๐ ๐ "๐ โ ๐๐ ๐ฎ$๐ โ ๐ธ ๐ ๐ "๐ โ โฏ โ ๐๐ ๐ โ ๐ธ ๐ ๐ = ๐ธ ๐ ๐ "๐ ๐๐ ๐ โ ๐ธ ๐ ๐ "๐ ๐ ๐ธ ๐ โ โฏ โ ๐ฏ๐ฌ ๐ = 2 ๐พ ๐ 2 Proof: ๐พ ๐ ๐ "๐ ๐ธ ๐ 2 ๐ "๐ ๐ "๐ ๐พ ๐ ๐ธ ๐ โ ๐ธ ๐ ๐ "๐ ๐ธ ๐ 2 ๐พ ๐ = ๐พ ๐ ๐พ ๐ ๐พ ๐ ๐พ ๐
Con Conclusion ons โข Scale-Invariance (provided by BN) makes the training procedure incredibly robust to LR schedules, even to exponentially growing schedules. โข Space of good LR schedules in current architectures is vast (hopeless to search for best schedule??) โข Current ways of thinking about training/optimization should be rethought; โข should focus on trajectory, not landscape.
Recommend
More recommend