bayesian optimization and automated machine learning
play

Bayesian Optimization and Automated Machine Learning Jungtaek Kim - PowerPoint PPT Presentation

Bayesian Optimization and Automated Machine Learning Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea


  1. Bayesian Optimization and Automated Machine Learning Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea June 12, 2018 1/28

  2. Table of Contents Bayesian Optimization Global Optimization Bayesian Optimization Background: Gaussian Process Regression Acquisition Function Synthetic Examples bayeso Automated Machine Learning Automated Machine Learning Previous Works AutoML Challenge 2018 Automated Machine Learning for Soft Voting in an Ensemble of Tree-based Classifiers AutoML Challenge 2018 Result References 2/28

  3. Bayesian Optimization 3/28

  4. Global Optimization From Wikipedia ( https://en.wikipedia.org/wiki/Local_optimum ) ◮ A method to find global minimum or maximum of given target function: x ∗ = arg min L ( x ) , or x ∗ = arg max L ( x ) . 4/28

  5. Target Functions in Bayesian Optimization ◮ Usually an expensive black-box function. ◮ Unknown functional forms or local geometric features such as saddle points, global optima, and local optima. ◮ Uncertain function continuity. ◮ High-dimensional and mixed-variable domain space. 5/28

  6. Bayesian Approach ◮ In Bayesian inference, given a prior knowledge for parameters, p ( θ | λ ) and a likelihood over dataset, conditional to parameters, p ( D| θ , λ ), the posterior distribution: p ( θ |D , λ ) = p ( D| θ , λ ) p ( θ | λ ) p ( D| θ , λ ) p ( θ | λ ) = � p ( D| λ ) p ( D| θ , λ ) p ( θ | λ ) d θ where θ is a vector of parameters, D is an observed dataset, and λ is a vector of hyperparameters. ◮ Produce an uncertainty as well as a prediction. 6/28

  7. Bayesian Optimization ◮ A powerful strategy for finding the extrema of objective functions that are expensive to evaluate, ◮ where one does not have a closed-form expression for the objective function, ◮ but where one can obtain observations at sampled values. ◮ Since we do not know a target function, optimize acquisition function, instead of the target function. ◮ Compute acquisition function using outputs of Bayesian regression model. 7/28

  8. Bayesian Optimization Algorithm 1 Bayesian Optimization Input: Initial data D 1: I = { ( x i , y i ) 1: I } . 1: for t = 1 , 2 , . . . , do Predict a function f ∗ ( x |D 1: I + t − 1 ) considered as an objective 2: function. Find x I + t that maximizes an acquisition function, 3: x I + t = arg max x a ( x |D 1: I + t − 1 ). Sample the true objective function, y I + t = f ( x I + t ) + ǫ I + t . 4: Update on D 1: I + t = {D 1: I + t − 1 , ( x t , y t ) } . 5: 6: end for 8/28

  9. Background: Gaussian Process ◮ A collection of random variables, any finite number of which have a joint Gaussian distribution. [Rasmussen and Williams, 2006] ◮ Generally, Gaussian process (GP): f ∼ GP ( m ( x ) , k ( x , x ′ )) where m ( x ) = E [ f ( x )] k ( x , x ′ ) = E [( f ( x ) − m ( x ))( f ( x ′ ) − m ( x ′ ))] . 9/28

  10. Background: Gaussian Process Regression 1.0 0.5 0.0 y −0.5 −1.0 −3 −2 −1 0 1 2 3 x 10/28

  11. Background: Gaussian Process Regression ◮ One of basic covariance functions, the squared-exponential covariance function in one dimension: � − 1 � x − x ′ � 2 = σ 2 + σ 2 � x , x ′ � � k f exp n δ xx ′ , 2 l 2 where σ f is the signal standard deviation, l is the length scale and σ n is the noise standard deviation. [Rasmussen and Williams, 2006] ◮ Posterior mean function and covariance function: µ ∗ = K ( X ∗ , X )( K ( X , X ) + σ 2 n I ) − 1 y , Σ ∗ = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X , X ) + σ 2 n I ) − 1 K ( X , X ∗ ) . 11/28

  12. Background: Gaussian Process Regression ◮ If non-zero mean prior is given, posterior mean and covariance functions: µ ∗ = K ( X ∗ , X )( K ( X , X ) + σ 2 I )( y − µ ( X )) + µ ( X ) Σ ∗ = K ( X ∗ , X ∗ ) + K ( X ∗ , X )( K ( X , X ) + σ 2 I ) − 1 K ( X , X ∗ ) . 12/28

  13. Acquisition Functions ◮ A function that acquires a next point to evaluate for an expensive black-box function. ◮ Traditionally, the probability of improvement (PI) [Kushner, 1964], the expected improvement (EI) [Mockus et al., 1978], and GP upper confidence bound (GP-UCB) [Srinivas et al., 2010] are used. ◮ Several functions such as entropy search [Hennig and Schuler, 2012] and a combination of existing functions [Kim and Choi, 2018b] have been recently proposed. 13/28

  14. Traditional Acquisition Functions (Minimization Case) ◮ PI [Kushner, 1964] a PI ( x |D , λ ) = Φ( Z ) , ◮ EI [Mockus et al., 1978] � if σ ( x ) > 0 ( f ( x + ) − µ ( x ))Φ( Z )+ σ ( x ) φ ( Z ) a EI ( x |D , λ ) = if σ ( x )=0 , 0 ◮ GP-UCB [Srinivas et al., 2010] a UCB ( x |D , λ ) = − µ ( x ) + βσ ( x ) , where f ( x +) − µ ( x ) � if σ ( x ) > 0 Z = σ ( x ) if σ ( x )=0 0 µ ( x ) := µ ( x |D , λ ) , σ ( x ) := σ ( x |D , λ ) . 14/28

  15. Synthetic Examples 20 20 20 10 10 y y y 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 acq. 5 acq. acq. 1 2 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 x x x (a) Iteration 1 (b) Iteration 2 (c) Iteration 3 20 20 20 y y 10 10 y 10 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 acq. acq. acq. 0.02 0.25 0.001 0.00 0.00 0.000 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 x x x (d) Iteration 4 (e) Iteration 5 (f) Iteration 6 Figure 1: y = 4 . 0 cos( x ) + 0 . 1 x + 2 . 0 sin( x ) + 0 . 4( x − 0 . 5) 2 . EI is used to optimize. 15/28

  16. bayeso ◮ Simple, but essential Bayesian optimization package. ◮ Written in Python . ◮ Licensed under the MIT license. ◮ https://github.com/jungtaekkim/bayeso 16/28

  17. Automated Machine Learning 17/28

  18. Automated Machine Learning ◮ Attempt to find automatically the optimal machine learning model without human intervention. ◮ Usually include feature transformation, algorithm selection, and hyperparameter optimization. ◮ Given a training dataset D train and a validation dataset D val , the optimal hyperparameter vector λ ∗ for an automated machine learning system: λ ∗ = AutoML ( D train , D val , Λ) where AutoML is an automated machine learning system and λ ∈ Λ. 18/28

  19. Previous Works ◮ Bayesian optimization and hyperparameter optimization ◮ GPyOpt [The GPyOpt authors, 2016] ◮ SMAC [Hutter et al., 2011] ◮ BayesOpt [Martinez-Cantin, 2014] ◮ bayeso ◮ SigOpt API [Martinez-Cantin et al., 2018] ◮ Automated machine learning framework ◮ auto-sklearn [Feurer et al., 2015] ◮ Auto-WEKA [Thornton et al., 2013] ◮ Our previous work [Kim et al., 2016] 19/28

  20. AutoML Challenge 2018 ◮ Two phases: feedback phase and AutoML challenge phase. ◮ In the feedback phase, provide five datasets for binary classification. ◮ Given training/validation/test datasets, after submitting a code or prediction file, validation measure is posted in the leaderboard. ◮ In the AutoML challenge phase, determine challenge winners, comparing a normalized area under the ROC curve (AUC) metric for blind datasets: Normalized AUC = 2 · AUC − 1 . 20/28

  21. AutoML Challenge 2018 Figure 2: Datasets of feedback phase in AutoML Challenge 2018. Train. #, Valid. #, Test #, Feature #, Chrono., and Budget stand for training dataset size, validation dataset size, test dataset size, the number of features, chronological order, and time budget, respectively. Time budget shows in seconds. 21/28

  22. Background: Soft Majority Voting ◮ An ensemble method to construct a classifier using a majority vote of k base classifiers. ◮ Class assignment of soft majority voting classifier: k w j p ( j ) � c i = arg max i j =1 for 1 ≤ i ≤ n where n is the number of instances, arg max returns an index of maximum value in given vector, w j ∈ R ≥ 0 is a weight of base classifier j , and p ( j ) is a class i probability vector of base classifier j . 22/28

  23. Our AutoML System [Kim and Choi, 2018a] Automated Machine Learning System Bayesian Optimization Voting Classifier Dataset Prediction Gradient Boosting Extra-trees Random Forests Classifier Classifier Classifier Figure 3: Our automated machine learning system. Voting classifier constructed by three tree-based classifiers: gradient boosting, extra-trees, and random forests classifiers produces predictions, where voting classifier and tree-based classifiers are iteratively optimized by Bayesian optimization for the given time budget. 23/28

  24. Our AutoML System [Kim and Choi, 2018a] ◮ Written in Python . ◮ Use scikit-learn and our own Bayesian optimization package. ◮ Split training dataset to training (0.6) and validation (0.4) sets for Bayesian optimization. ◮ Optimize six hyperparameters: 1. extra-trees classifier weight/gradient boosting classifier weight for voting classifier, 2. random forests classifier weight/gradient boosting classifier weight for voting classifier, 3. the number of estimators for gradient boosting classifier, 4. the number of estimators for extra-trees classifier, 5. the number of estimators for random forests classifier, 6. maximum depth of gradient boosting classifier. ◮ Use GP-UCB. 24/28

Recommend


More recommend