towards efficient automatic end to end learning
play

Towards efficient automatic end-to-end learning Frank Hutter - PowerPoint PPT Presentation

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany Based on joint work with great students and collaborators (named throughout) Frank Hutter: Towards efficient automatic learning What will this partial


  1. Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany Based on joint work with great students and collaborators (named throughout) Frank Hutter: Towards efficient automatic learning

  2. What will this partial learning curve converge to? ? validation set accuracy epoch 2 Frank Hutter: Towards efficient automatic learning

  3. One Problem of Deep Learning Performance is very sensitive to many hyperparameters – Architectural choices Units per layer dog cat … Kernel size # convolutional layers # fully connected layers – Optimization algorithm, learning rates , momentum, batch normalization, batch siz es, dropout rates, weight decay, … – Data augmentation & preprocessing 3 Frank Hutter: Towards efficient automatic learning

  4. Towards True End-to-end Learning Current deep learning practice Expert chooses Deep architecture & learning hyperparameters “end -to- end” AutoML: true end-to-end learning Meta-level Learning learning & End-to-end learning box optimization 4 Frank Hutter: Towards efficient automatic learning

  5. The standard approach: blackbox optimization Train DNN Validation DNN hyperparameter performance f(  ) setting  and validate it max f(  ) Blackbox  optimizer Grid search, random search, population-based & evolutionary f(  ) methods, ..., Bayesian optimization  5 Frank Hutter: Towards efficient automatic learning

  6. The standard approach: blackbox optimization Train DNN Validation DNN hyperparameter performance f(  ) setting  and validate it max f(  ) Blackbox  optimizer Too slow for tuning DNNs  Need more fine-grained actions 6 Frank Hutter: Towards efficient automatic learning

  7. ways in which we can go beyond blackbox optimization 7 Frank Hutter: Towards efficient automatic learning

  8. 1. We can use transfer learning Transfer learning from other datasets D  f(  , D ) , need scalable model Using Gaussian process models • Bardenet et al, ICML 2013: Collaborative hyperparameter tuning • Swersky et al, NIPS 2013: Multi-task Bayesian optimization • Yogatama & Mann, AISTATS 2014: Efficient transfer learning method for automatic hyperparameter tuning Using other models • Feurer et al, AAAI 2015: Initializing Bayesian hyperparameter optimization via meta-learning • Springenberg et al, NIPS 2016: Bayesian optimization with robust Bayesian neural networks 8 Frank Hutter: Towards efficient automatic learning

  9. Example: BO with robust Bayesian neural nets [Springenberg, Klein, Falkner, Hutter; NIPS 2016] https://github.com/automl/RoBO Well-calibrated uncertainty estimates Scalable for multitask problems 9 Frank Hutter: Towards efficient automatic learning

  10. 2. We can reason over cheaper subsets of data Large datasets: start from small subsets of size s  f(  , s) , need model that extrapolates well s min s max Example: SVM error surface, trained on data subsets of size s s max /128 s max /16 s max /4 s max log(C) log(C) log(C) log(C) log(  ) log(  ) log(  ) log(  ) 10 Frank Hutter: Towards efficient automatic learning

  11. Example: Fast Bayesian optimization on large datasets [Klein, Falkner, Bartels, Hennig, Hutter, arXiv 2016] • Automatically choose dataset size for each evaluation – Trading off information gain about global optimum vs. cost • Entropy Search – Based on a probability distribution of where the minimum lies: • Strategy: pick configuration & data size pair ( x, s ) to maximally decrease entropy of p min per time spent 11 Frank Hutter: Towards efficient automatic learning

  12. Example: Fast Bayesian optimization on large datasets [Klein, Falkner, Bartels, Hennig, Hutter, under review at AISTATS 2016] 10x-1000x speedup for SVMs, 5x-10x for DNNs https://github.com/automl/RoBO Error Budget of optimizer [s] 12 Frank Hutter: Towards efficient automatic learning

  13. 3. We can model the online performance of DNNs Graybox optimization  f(  , t) Example: DNN learning curves with different hyperparameter settings time t • Swersky et al, arXiv 2014: Freeze-Thaw Bayesian optimization • Domhan et al, AutoML 2014 & IJCAI 2015: Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves 13 Frank Hutter: Towards efficient automatic learning

  14. Example: extrapolating learning curves [Domhan, Springenberg, Hutter; AutoML 2014 & IJCAI 2015] K = 11 parametric models Parametric model, e.g. Convex Combination of these models: Maximum Likelihood fit: MCMC: to quantify model uncertainty K        2 ~ ( 0 , ) ( | ) , y w f t N t k k k  1 k ? 14 Frank Hutter: Towards efficient automatic learning

  15. Example: extrapolating learning curves validation set accuracy P ( y m > y best | y 1: n ) y best continue P ( y m > y best | y 1: n ) ³ 5% training… y m y epoch 1: n 15 Frank Hutter: Towards efficient automatic learning

  16. Example: extrapolating learning curves validation set accuracy y best P ( y m > y best | y 1: n ) P ( y m > y best | y 1: n ) < 5% Terminate! y m y epoch 1: n 16 Frank Hutter: Towards efficient automatic learning

  17. Example: extrapolating learning curves 17 Frank Hutter: Towards efficient automatic learning

  18. Easy to include in a Bayesian neural network [Klein, Falkner, Springenberg, Hutter; Bayesian Deep Learning Workshop 2016] 18 Frank Hutter: Towards efficient automatic learning

  19. 4. We can change hyperparameters on the fly hyper • Common practice: change SGD learning rates over time • Can automate & improve with reinforcement learning – Daniel et al, AAAI 2016: Learning step size controllers for robust neural network training – Hansen, arXiv 2016: Using deep Q-Learning to control optimization hyperparameters – Andrychowicz et al, arXiv 2016: Learning to learn by gradient descent by gradient descent 19 Frank Hutter: Towards efficient automatic learning

  20. 5. We can automatically gain scientific insights [Hutter, Hoos, Leyton-Brown; ICML 2014] Marginal loss Hyperparameter 1 Hyperparameter 2 Hyperparameter 3 One way to inspect the model: functional ANOVA explains performance variation due to each subset of hyperparameters Possible future insights: 1. How stable are good hyperparameter settings across datasets ? 2. Which hyperparameters need to change as the dataset grows ? 3. Which factors affect empirical convergence rates of SGD ? 20 Frank Hutter: Towards efficient automatic learning

  21. Learning to optimize, to plan, etc. • Algorithm configuration often speeds up solvers a lot – 500x for software verification [Hutter, Babic, Hoos, Hu, FMCAD 2007] – 50x for MIP [Hutter, Hoos, Leyton-Brown, CPAIOR 2011] – 100x for finding better domain encoding in AI planning [Vallati, Hutter, Chrpa, McCluskey, IJCAI 2015] • Algorithm portfolios won many competitions – E.g., SATzilla won SAT competitions 2007, 2009, 2012 (every time we entered) [Xu, Hutter, Hoos, Leyton-Brown, JAIR 2008] – E.g., Cedalion won IPC 2014 Planning & Learning Track [Seipp, Siefert, Helmert, Hutter, AAAI 2015] 21 Frank Hutter: Towards efficient automatic learning

  22. Conclusion: moving beyond hand-designed learners Some ways of making this efficient Transfer learning: exploit strong priors Exploit cheaper, approximate blackboxes Graybox: partial feedback during evaluation Whitebox: hyperparameter control (RL) https://github.com/automl/RoBO 22 Frank Hutter: Towards efficient automatic learning

Recommend


More recommend