an exponential lr schedule for deep learning an
play

An Exponential LR Schedule for Deep Learning An Exponential LR - PowerPoint PPT Presentation

An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton


  1. An Exponential LR Schedule for Deep Learning An Exponential LR Schedule for Deep Learning (Strange Behavior of Normalization Layers) (Strange Behavior of Normalization Layers) Zhiyuan Li Zhiyuan Li Sanjeev Arora Sanjeev Arora Princeton Princeton Princeton & IAS Princeton & IAS ICLR April, 2019

  2. Learning Lea g rate e in traditional optimi mization ๐‘ฅ !"# โ† ๐‘ฅ ! โˆ’ ๐œƒ โ‹… โˆ‡๐‘€ ๐‘ฅ ! Traditional: Start with some LR; decay over time. (extensive literature in optimization justifying this) Confusingly, exotic LR schedules also reported to work: triangular[Smith,2015 ], cosine [Loshchilov&Hutter, 2016] etc.; no justification in theory.

  3. Th This is wo work: Ex Exponential LR LR sch chedules PreResNet32 on CIFAR10 with fixed LR converted to Exp LR with ๐œƒ ! = 0.1ร—1.481 ! , Result 1 (empirical): Possible to train todayโ€™s deep architectures, while growing LR exponentially (i.e., at each iteration multiply by (1 + ๐‘‘) for some ๐‘‘ > 0) Result 2 (theory): Mathematical proof that exponential LR schedule can yield (in function space*) every net produced by existing training schedules. Raises questions for theory; highlights importance of trajectory analysis ( vs landscape) (* In all nets that use batch norm [Ioffe-Szegedyโ€™13] or any other normalization scheme.)

  4. Ma Main Th Thm (original sch chedule has fixed LR) Learning Rate โ€œGeneral training algorithm; fixed LRโ€ โ„“ ! regularizer (aka Weight decay) Momentum Stochastic Loss at round t Thm: For nets w/ batch norm or layer norm, following is equivalent to above : weight decay 0, momentum ๐›ฟ , and LR schedule ๐œƒ ! = ๐œƒ๐›ฝ $%!$# where ๐›ฝ ( ๐›ฝ < 1 ) is nonzero root of x 2 โˆ’ (1 + ฮณ โˆ’ ฮปฮท ) x + ฮณ = 0 , (proof shows nets in the new trajectory are equivalent in function space to nets in original trajectory)

  5. Mo More e gen ener eral (ori riginal schedule has varyi ying LR) โˆ— . Original schedule(Step Decay) : K phases. In phase ๐ฝ , iteration [ ๐‘ˆ & , ๐‘ˆ &"# โˆ’ 1], use LR ๐œƒ & Thm : Step Decay can be realized with tapered exponential LR schedule Tapered Exponential LR schedule (TEXP): Exponential growing LR in each phase + when entering a new phase: Switching to a slower exponential growing rate; โ€ข Dividing the current LR by some constant. โ€ข (See details in the paper)

  6. Ke Key conce cept in proof: Sc Scale-in invar arian iant tr traini ning ng loss loss ๐‘€ ๐‘‘ โ‹… ๐œ„ = ๐‘€ ๐œ„ for every vector ๐œ„ of net parameters and ๐‘‘ > 0 Observation: Batch norm + fixed final layer รจ Loss is scale invariant (True for Feed-forward nets, Resnets, DenseNets, etc.; see our appendix) Scale-invariant loss fn sufficient for state of the art deep learning![Hoffer et al,18] โˆ’๐›‚๐Œ(๐๐œพ ๐’– ) โˆ’๐›‚๐Œ(๐œพ ๐’– ) Lemma: Gradient for such losses satisfies ๐’…๐œพ ๐’– ๐œพ ๐’–

  7. Proof sketch ch when momentum=0 ๐›‚๐Œ(๐๐œพ ๐’– ) = ๐ "๐Ÿ ๐›‚๐Œ(๐œพ ๐’– ) ๐œฝ # = c ๐Ÿ‘ ๐œฝ ๐œฝ # = c ๐Ÿ‘ ๐œฝ โˆ’๐ ๐Ÿ‘ ๐œฝ๐›‚๐Œ(๐๐œพ ๐’– ) # = ๐’…๐œพ ๐’– ๐œพ ๐’– # ๐œพ ๐’–%๐Ÿ # โˆ’ ๐œฝโ€ฒ๐›‚๐Œ(๐œพ ๐’– # # ) ๐œพ ๐’–%๐Ÿ = ๐œพ ๐’– # , ๐œฝโ€ฒ โ†’ (๐’…๐œพ ๐’– , ๐’… ๐Ÿ‘ ๐œฝ) # ๐œพ ๐’– , ๐œฝโ€ฒ โ†’ (๐’…๐œพ ๐’–%๐Ÿ , ๐’… ๐Ÿ‘ ๐œฝ) ๐œพ ๐’–%๐Ÿ โˆ’๐›‚๐Œ(๐œพ ๐’– ) ๐œฝ ๐œฝ โˆ’๐œฝ๐›‚๐Œ(๐๐œพ ๐’– ) ๐œพ ๐’–%๐Ÿ ๐œพ ๐’– ๐œพ ๐’–%๐Ÿ = ๐œพ ๐’–%๐Ÿ โˆ’ ๐œฝ๐›‚๐Œ(๐œพ ๐’– )

  8. Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , ๐œฝโ€ฒ ๐œพ ๐’– ๐œพ ๐’–#๐Ÿ , ๐œฝโ€ฒ 0 โˆ’ ๐œฝโ€ฒ๐›‚๐Œ(๐œพ ๐’– 0 0 ) ๐œพ ๐’–"๐Ÿ = ๐œพ ๐’– 0 , ๐œฝโ€ฒ โ†’ (๐’…๐œพ ๐’–"๐Ÿ , ๐’… ๐Ÿ‘ ๐œฝ) 0 , ๐œฝโ€ฒ โ†’ (๐’…๐œพ ๐’– , ๐’… ๐Ÿ‘ ๐œฝ) ๐œพ ๐’–"๐Ÿ ๐œพ ๐’– ๐œพ ๐’– , ๐œฝ ๐œพ ๐’–#๐Ÿ , ๐œฝ State = ๐œพ ๐’–"๐Ÿ = ๐œพ ๐’–"๐Ÿ โˆ’ ๐œฝ๐›‚๐Œ(๐œพ ๐’– )

  9. Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case " " , ๐œฝโ€ฒ ๐œพ ๐’– ๐œพ ๐’–#๐Ÿ , ๐œฝโ€ฒ ๐‡๐„ ๐ฎ Equivalent scaling: ๐ โˆ˜ ๐šธ ๐Ÿ‘ ๐ โˆ˜ ๐šธ ๐Ÿ‘ ๐’… ๐Ÿ‘ ๐’… ๐Ÿ‘ ๐šธ ๐Ÿ ๐šธ ๐Ÿ ๐œพ ๐’–#๐Ÿ , ๐œฝ ๐œพ ๐’– , ๐œฝ ๐‡๐„ ๐ฎ

  10. Warm-up: Wa up: Equivalence ce of of mo momen mentum-fr free ca case (๐‡ = ๐Ÿ โˆ’ ๐๐œฝ) ๐‡ $๐Ÿ ๐œพ โˆ’ ๐œฝ๐›‚๐Œ ๐ฎ (๐œพ), ๐œฝ ๐œพ โˆ’ ๐‡ $๐Ÿ ๐œฝ๐›‚๐Œ ๐ฎ (๐œพ), ๐‡ $๐Ÿ ๐œฝ ๐œพ, ๐‡ $๐Ÿ ๐œฝ ๐œพ, ๐œฝ ๐‡ โˆ˜ ๐šธ ๐Ÿ ๐‡ ๐‡ "๐Ÿ ๐šธ ๐Ÿ‘ ๐‡๐„ ๐ฎ ๐šธ ๐Ÿ‘

  11. (๐‡ = ๐Ÿ โˆ’ ๐๐œฝ) = ๐‡ โˆ˜ ๐šธ ๐Ÿ ๐‡ "๐Ÿ ๐‡ ๐‡๐„ ๐ฎ ๐šธ ๐Ÿ‘ ๐šธ ๐Ÿ‘ Theorem: GD + WD + constant LR= GD + Exp LR. ๐‡ "๐’– โˆ˜ ๐šธ ๐Ÿ‘ ๐‡ "๐Ÿ‘๐’– โˆ˜ ๐‘ฏ๐‘ฌ ๐’–$๐Ÿ ๐‡ "๐Ÿ โˆ˜ ๐‡๐„ ๐ฎ$๐Ÿ โˆ˜ ๐šธ ๐Ÿ‘ ๐‡ "๐Ÿ‘ โˆ˜ โ‹ฏ โˆ˜ ๐‡๐„ ๐Ÿ โˆ˜ ๐šธ ๐Ÿ‘ ๐‡ = ๐šธ ๐Ÿ‘ ๐‡ "๐Ÿ‘ ๐‡๐„ ๐Ÿ โˆ˜ ๐šธ ๐Ÿ‘ ๐‡ "๐Ÿ ๐‡ ๐šธ ๐Ÿ โˆ˜ โ‹ฏ โˆ˜ ๐‘ฏ๐‘ฌ ๐Ÿ = 2 ๐œพ ๐Ÿ‘ 2 Proof: ๐œพ ๐Ÿ’ ๐‡ "๐Ÿ ๐šธ ๐Ÿ‘ 2 ๐‡ "๐Ÿ ๐‡ "๐Ÿ‘ ๐œพ ๐Ÿ ๐šธ ๐Ÿ โˆ˜ ๐šธ ๐Ÿ‘ ๐‡ "๐Ÿ ๐šธ ๐Ÿ‘ 2 ๐œพ ๐Ÿ = ๐œพ ๐Ÿ ๐œพ ๐Ÿ ๐œพ ๐Ÿ‘ ๐œพ ๐Ÿ’

  12. Con Conclusion ons โ€ข Scale-Invariance (provided by BN) makes the training procedure incredibly robust to LR schedules, even to exponentially growing schedules. โ€ข Space of good LR schedules in current architectures is vast (hopeless to search for best schedule??) โ€ข Current ways of thinking about training/optimization should be rethought; โ€ข should focus on trajectory, not landscape.

Recommend


More recommend