NeurIPS 2018 Tutorial on Automatic Machine Learning Learning to Learn Learning Learning Learning Learning Learning automl.org/events -> AutoML Tutorial -> Slides Frank Hutter Joaquin Vanschoren Eindhoven University of Technology University of Freiburg j.vanschoren@tue.nl fh@cs.uni-freiburg.de @joavanschoren � 1
Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff ectively: less trial-and-error, less data Task 1 Task 2 Task 3 meta- meta- meta- learning learning learning Learning Learning Learning Learning episodes Models Models Model Models Models Models Models Models Models performance performance performance � 2
Learning to learn Inductive bias : all assumptions added to the training data to learn e ff ectively If prior tasks are similar , we can transfer prior knowledge to new tasks (if not it may actually harm learning) Task 2 Task 1 Task 3 New Task training data inductive bias Learning Learning Learning Learning prior beliefs constraints representations Models Models Models Models Models Models Models Models Models Models Models Models model parameters performance performance performance performance � 3
Meta-learning C ollect meta-data about learning episodes and learn from them Meta-learner learns a (base-)learning algorithm, end-to-end } Task 2 Task 1 Task 3 New Task Learning Learning Learning Learning Learning Learning Learners Learners Learners meta-data meta-learner Models Models Models Models Models Models base-learner Models Models Models Models Models performance performance performance Models optimize performance � 4
Three approaches for increasingly similar tasks 1. Transfer prior knowledge about what generally works well 2. Reason about model performance across tasks 3. Start from models trained earlier on similar tasks … Task j New Task Task 1 Learning Learning Learning Learning meta-learner Learners Learners Models Models Models Models Models Models Models Models Models performance performance performance � 5
1. Learning from prior evaluations Con fj gurations: settings that uniquely de fj ne the model (algorithm, pipeline, neural architecture, hyper-parameters, …) … Task j New Task Task 1 Learning Learning Learning Learning meta-learner con fj gurations Learners Learners (hyperparameters) λ i Models Models Models Models Models Models Models Models Models performance performance performance performances P i,j Similar tasks suit similar con fj gurations � 6
Leite et al. 2012 Abdulrahman et al. 2018 Top-K recommendation • Build a global (multi-objective) ranking, recommend the top-K • Requires fj xed selection of candidate con fj gurations (portfolio) • Can be used as a warm start for optimization techniques Tasks } Global ranking Learning Learning (task independent) New Task λ i Learning (discrete) 1. λ a 2. λ b Models meta-learner Models Models 3. λ c λ a..k 4. λ d warm 5. λ e Models P i,j performance Models start Models 6. … (multi-objective) performance � 7
Wistuba et al. 2015 Warm-starting with plugin estimators • What if prior con fj gurations are not optimal? • Per task, fj t a di ff erentiable plugin estimator on all evaluated con fj gurations • Do gradient descent to fj nd optimized con fj gurations, recommend those Tasks λ i } Learning New Task Learning Learning per task: λ * i Models meta-learner Models Models warm Models P i,j Models performance start Models task 1: λ * 1 task 2: λ * 2 task 3: λ * 3 performance � 8 …
1 van Rijn & Hutter 2018 2 Probst et al. 2018 Con fj guration space design 3 Wistuba et al. 2015 • Functional ANOVA : select hyperparameters that cause variance in the evaluations 1 • Tunability : improvement from tuning a hyperparameter vs. using a good default 2 Learning } • Search space pruning : exclude regions yielding bad performance on similar tasks 3 Tasks importance Learning New Task Learning λ i constraints HP1 HP2 HP3 HP4 Models meta-learner Models priors Models P Models P i,j Models performance Models λ 2 performance � 9 λ 1
Leite et al. 2012 Active testing • Task are similar if observed relative performance of con fj gurations is similar • Tournament-style selection, warm-start with overall best con fj gurations λ best Learning } • Next candidate λ c : the one that beats current λ best on similar tasks (from portfolio) Tasks Learning Learning λ i New Task (discrete) Relative landmark on λ a, λ b, task t j : Models Models meta-learner Models λ c Update: P i,j performance Models Models Sim ( t j, t new ) = Corr ([ RL a,b,j ],[ RL a,b,new ]) Models Select λ c > RL λ best on similar tasks performance � 10
Rasmussen 2014 Bayesian optimization (refresh) • Learns how to learn within a single task (short-term memory) • Surrogate model: probabilistic regression model of con fj guration performance • Can we transfer what we learned to new tasks (long term memory)? Task P Learning Learning Learning Surrogate model λ i Acquisition function Models Models λ ∈ Λ Models performance � 11
1 Wistuba et al. 2018 2 Feurer et al. 2018 Surrogate model transfer • If task j is similar to the new task, its surrogate model S j will do well • Sum up all S j predictions, weighted by task similarity (relative landmarks) 1 • Build combined Gaussian process, weighted by current performance on new task 2 Tasks P i,j } Learning New Task Learning λ i Learning per task t j : P S j Models meta-learner Models Models S = ∑ w j S j λ i S 1 Models performance Models + Models S 2 + S 3 performance � 12
Perrone et al. 2018 Warm-started multi-task learning • Bayesian linear regression (BLR) surrogate model on every task • Learn a suitable basis expansion ϕ z ( λ ) , joint representation for all tasks • Scales linearly in # observations, transfers info on con fj guration space Tasks } warm-start (pre-train) Learning Learning New Task φ z ( λ ) λ i ( λ i, P i,j ) Learning BLR hyperparameters Models meta-learner Models Models Bayesian optimization λ i P P Models BLR P i,j performance Models Models surrogate φ z ( λ ) i λ i performance � 13
1 Swersky et al. 2013 2 Springenberg et al. 2016 Multi-task Bayesian optimization 3 Golovin et al. 2017 • Multi-task Gaussian processes: train surrogate model on t tasks simultaneously 1 • If tasks are similar: transfers useful info • Not very scalable Independent GP predictions Multi-task GP predictions • Bayesian Neural Networks as surrogate model 2 • Multi-task, more scalable • Stacking Gaussian Process regressors (Google Vizier) 3 • Sequential tasks, each similar to the previous one • Transfers a prior based on residuals of previous GP � 14
1 Ramachandran et al. 2018 2 Leite et al. 2005 Other techniques 3 van Rijn et al. 2015 • Transfer learning with multi-armed bandits 1 • View every task as an arm, learn to `pull` observations from the most similar tasks • Reward: accuracy of con fj gurations recommended based on these observations • Transfer learning curves 2,3 • Learn a partial learning curve on a new task, fj nd best matching earlier curves • Predict the most promising con fj gurations based on earlier curves � 15
2. Reason about model performance across tasks Meta-features: measurable properties of the tasks (number of instances and features, class imbalance, feature skewness,…) … Task N meta-features New Task Task 1 similar m j ? m j Learning Learning Learning Learning configurations meta-learner Learning Learning λ i Models Models Models Models Models Models Models Models Models performance performance performance performances P i,j � 16
1 Vanschoren 2018 2 Kim et al. 2017 Meta-features • Hand-crafted (interpretable) meta-features 1 • Number of instances, features, classes, missing values, outliers,… • Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,… • Information-theoretic : class entropy, mutual information, noise-signal ratio,… • Model-based : properties of simple models trained on the task • Landmarkers : performance of fast algorithms trained on the task • Domain speci fj c task properties • Learning a joint task representation • Deep metric learning: learn a representation h mf using a ground truth distance 2 • With Siamese Network: • Similar task, similar representation � 17
1 Gomes et al. 2012, Reif et al. 2012 2 Feurer et al. 2015 Warm-starting from similar tasks • Find k most similar tasks, warm-start search with best 𝛴 i • Genetic hyperparameter search 1 • Auto-sklearn: Bayesian optimization (SMAC) 2 • Scales well to high-dimensional con fj guration spaces P i,j } Genetic optimization Tasks m j New Task Learning Learning λ i Learning best λ i on meta-learner Bayesian optimization similar tasks λ 1..k Models λ 2 P Models Models λ 1 Models Models λ 3 Models λ 4 performance performance � 18 λ
Recommend
More recommend