optimizer benchmarking needs to account for
play

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning - PowerPoint PPT Presentation

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning Prabhu Teja S * 1, 2 Florian Mai * 1, 2 Thijs Vogels 2 Martin Jaggi 2 Franois Fleuret 1, 2 1 Idiap Research Institute, 2 EPFL, Switzerland * Equal Contribution prabhu.teja,


  1. Optimizer Benchmarking Needs to Account for Hyperparameter Tuning Prabhu Teja S * 1, 2 Florian Mai * 1, 2 Thijs Vogels 2 Martin Jaggi 2 François Fleuret 1, 2 1 Idiap Research Institute, 2 EPFL, Switzerland * Equal Contribution prabhu.teja, florian.mai@idiap.ch

  2. The problem of optimizer evaluation Expected loss L ( θ ) Optimizer A Optimizer B θ ⋆ B θ ⋆ A Hyperparameter θ F igure: Two optimizers A & B with hyperparameter θ . Which one do we prefer in practice? prabhu.teja, florian.mai@idiap.ch 1

  3. The problem of optimizer evaluation Expected loss L ( θ ) Optimizer A Optimizer B θ ⋆ B θ ⋆ A Hyperparameter θ F igure: Two optimizers A & B with hyperparameter θ . Which one do we prefer in practice? 1. The absolute performance of the optimizer → L ( θ ⋆ A ) , L ( θ ⋆ B ) 2. Difficulty of finding good hyperparameter configuration ≈ θ ⋆ A , θ ⋆ B . prabhu.teja, florian.mai@idiap.ch 1

  4. The Problem of Optimizer Evaluation: SGD vs Adam 1. SGD often achieves better peak performance than Adam in previous literature 2. We take into cognizance the cost of automatic Hyperparameter Optimization (HPO), and find: 1.0 SGD (l.r. schedule tuned, fixed mom. and w.d.) 12% Probability of being the best 13% 0.8 SGD (tuned l.r., fixed mom. and w.d.) 17% Adam (all params. tuned) 0.6 0.4 Adam (only l.r. tuned) 58% 0.2 0.0 10 20 30 40 50 60 Budget for hyperparameter optimization (# models trained) Our method eliminates human biases arising from manual hyperparameter tuning. prabhu.teja, florian.mai@idiap.ch 2

  5. Revisiting the notion of an Optimizer Definition An optimizer is a pair M = ( U Θ , p Θ ) , which applies its update rule U ( S t ; Θ) at each step t depending on its current state S t . Its hyperparameters Θ = ( θ 1 , . . . , θ N ) have a prior probability distribution p Θ : (Θ → R ) defined. p Θ should be specified by the optimizer designer, e.g., Adam’s ǫ > 0 and close to 0 = ⇒ ǫ ∼ Log-uniform ( − 8 , 0 ) prabhu.teja, florian.mai@idiap.ch 3

  6. HPO aware optimizer benchmarking Algorithm 1 Benchmark with ‘expected quality at budget’ input: optimizer O , cross-task hyperparameter prior p Θ , task T , tuning budget B Initialize list ← [ ] . for R repetitions do Perform random search with budget B : – S ← sample B elements from p Θ . – list ← [ best ( S ) , . . . list ] . return mean ( list ), var ( list ), or other statistics prabhu.teja, florian.mai@idiap.ch 4

  7. Calibrated task independent priors p Θ Optimizer Tunable parameters Cross-task prior SGD Learning rate ?? Momentum Weight decay Poly decay ( p ) Adagrad Learning rate Adam Learning rate β 1 , β 2 ǫ prabhu.teja, florian.mai@idiap.ch 5

  8. Calibrated task independent priors p Θ Optimizer Tunable parameters Cross-task prior SGD Learning rate ?? Momentum Weight decay Poly decay ( p ) Adagrad Learning rate Adam Learning rate β 1 , β 2 ǫ Sample a large number of points and their performance from a large range of admissible values Maximum Likelihood Estimate (MLE) of the prior’s parameters using the top 20% performant values from the previous step. prabhu.teja, florian.mai@idiap.ch 5

  9. Calibrated task independent priors p Θ Optimizer Tunable parameters Cross-task prior SGD Learning rate Log-normal(-2.09, 1.312) Momentum U [ 0 , 1 ] Weight decay Log-uniform(-5, -1) Poly decay ( p ) U [ 0 . 5 , 5 ] Adagrad Learning rate Log-normal(-2.004, 1.20) Adam Learning rate Log-normal(-2.69, 1.42) 1- Log-uniform(-5, -1) β 1 , β 2 Log-uniform(-8, 0) ǫ Sample a large number of points and their performance from a large range of admissible values Maximum Likelihood Estimate (MLE) of the prior’s parameters using the top 20% performant values from the previous step. prabhu.teja, florian.mai@idiap.ch 5

  10. The importance of Recipes Optimizer label Tunable parameters SGD-M C W C SGD( γ, µ = 0 . 9 , λ = 10 − 5 ) SGD-M C D SGD( γ, µ = 0 . 9 , λ = 10 − 5 ) + Poly Decay( p ) SGD-MW SGD( γ, µ, λ ) Adam( γ , β 1 = 0 . 9, β 2 = 0 . 999, ǫ = 10 − 8 ) Adam-LR Adam Adam( γ, β 1 , β 2 , ǫ ) SGD( γ, µ, λ ) is SGD with γ learning rate, µ momentum, λ weight decay coefficient. Adagrad( γ ) is Adagrad with γ learning rate, Adam( γ, β 1 , β 2 , ǫ ) is Adam with learning rate γ , momentum parameters β 1 , β 2 , and normalization parameter ǫ prabhu.teja, florian.mai@idiap.ch 6

  11. Performance at a budget ❈■❋❆❘ ✶✵ ■▼❉❜ ▲❙❚▼ 90 90 85 80 ❚❡st ❆❝❝✉r❛❝② ❚❡st ❆❝❝✉r❛❝② 80 70 75 60 70 50 65 40 60 ❇✉❞❣❡t ✶ ❇✉❞❣❡t ✹ ❇✉❞❣❡t ✶✻ ❇✉❞❣❡t ✻✹ ❇✉❞❣❡t ✶ ❇✉❞❣❡t ✹ ❇✉❞❣❡t ✶✻ ❇✉❞❣❡t ✻✹ Performance of Adam-LR , Adam , SGD-M C W C , SGD-MW , SGD-M C D at various hyperparameter search budgets prabhu.teja, florian.mai@idiap.ch 7

  12. Summarizing our findings 1 . 00 Aggregated relative performance 0 . 95 0 . 90 0 . 85 Adam Adam-LR 0 . 80 SGDM C W C SGD-Decay 0 . 75 0 20 40 60 80 100 # hyperparameter configurations (budget) Summary statistics: 1 o ( k , p ) S ( o , k ) = � o ′ ∈O o ′ ( k , p ) , |P| max p ∈P where o ( k , p ) denotes the expected performance of optimizer o ∈ O on test problem p ∈ P after k iterations of hyperparameter search. prabhu.teja, florian.mai@idiap.ch 8

  13. Our findings 1. Support the hypothesis that adaptive gradient methods are easier to tune than non-adaptive methods The substantial value of the adaptive gradient methods, specifically Adam, is its amenability to hyperparameter search. prabhu.teja, florian.mai@idiap.ch 9

  14. Our findings 1. Support the hypothesis that adaptive gradient methods are easier to tune than non-adaptive methods The substantial value of the adaptive gradient methods, specifically Adam, is its amenability to hyperparameter search. 2. Tuning optimizers’ hyperparameters apart from the learning rate becomes more useful as the available tuning budget goes up. Even with relatively large tuning budget, tuning only the learning rate of Adam is the safer choice, as it achieves good results with high probability. prabhu.teja, florian.mai@idiap.ch 9

  15. THANK YOU prabhu.teja, florian.mai@idiap.ch 10

Recommend


More recommend