Gradient Boosted Regression Trees scikit Peter Prettenhofer - PowerPoint PPT Presentation

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe) DataRobot Universit´ e de Li` ege, Belgium

Motivation

Outline 1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

About us Peter • @pprett • Python & ML ∼ 6 years • sklearn dev since 2010 Gilles • @glouppe • PhD student (Li` ege, Belgium) • sklearn dev since 2011 Chief tree hugger

Machine Learning 101 • Data comes as... • A set of examples { ( x i , y i ) | 0 ≤ i < n samples } , with • Feature vector x ∈ R n features , and • Response y ∈ R (regression) or y ∈ {− 1 , 1 } (classification) • Goal is to... • Find a function ˆ y = f ( x ) • Such that error L ( y , ˆ y ) on new (unseen) x is minimal

Classification and Regression Trees [Breiman et al, 1984] MedInc <= 5.04 MedInc <= 3.07 MedInc <= 6.82 AveRooms <= 4.31 AveOccup <= 2.37 AveOccup <= 2.74 MedInc <= 7.82 1.62 1.16 2.79 1.88 3.39 2.56 3.73 4.57 sklearn.tree.DecisionTreeClassifier|Regressor

Function approximation with Regression Trees 10 ground truth RT max_depth=1 8 RT max_depth=3 RT max_depth=20 6 4 2 y 0 2 4 6 8 0 2 4 6 8 10 x

Function approximation with Regression Trees 10 ground truth RT max_depth=1 8 RT max_depth=3 RT max_depth=20 6 4 Deprecated 2 y • Nowadays seldom used alone 0 • Ensembles: Random Forest, Bagging, or Boosting 2 (see sklearn.ensemble ) 4 6 8 0 2 4 6 8 10 x

Gradient Boosted Regression Trees Advantages • Heterogeneous data (features measured on different scale) • Supports different loss functions (e.g. huber) • Automatically detects (non-linear) feature interactions Disadvantages • Requires careful tuning • Slow to train (but fast to predict) • Cannot extrapolate

Boosting AdaBoost [Y. Freund & R. Schapire, 1995] • Ensemble: each member is an expert on the errors of its predecessor • Iteratively re-weights training examples based on errors 2 1 x 1 0 1 2 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 x 0 x 0 x 0 x 0 sklearn.ensemble.AdaBoostClassifier|Regressor

Boosting Huge success AdaBoost [Y. Freund & R. Schapire, 1995] • Viola-Jones Face Detector (2001) • Ensemble: each member is an expert on the errors of its predecessor • Iteratively re-weights training examples based on errors 2 1 x 1 0 1 2 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 • Freund & Schapire won the G¨ x 0 x 0 odel prize 2003 x 0 x 0 sklearn.ensemble.AdaBoostClassifier|Regressor

Ground truth tree 1 tree 2 tree 3 2.5 2.0 1.5 1.0 0.5 ∼ + + y 0.0 0.5 1.0 1.5 2.0 2 6 10 2 6 10 2 6 10 2 6 10 x x x x Gradient Boosting [J. Friedman, 1999] Statistical view on boosting • ⇒ Generalization of boosting to arbitrary loss functions

Gradient Boosting [J. Friedman, 1999] Statistical view on boosting • ⇒ Generalization of boosting to arbitrary loss functions Residual fitting Ground truth tree 1 tree 2 tree 3 2.5 2.0 1.5 1.0 0.5 ∼ + + y 0.0 0.5 1.0 1.5 2.0 2 6 10 2 6 10 2 6 10 2 6 10 x x x x sklearn.ensemble.GradientBoostingClassifier|Regressor

8 8 Squared error Zero-one loss Absolute error Log loss 7 7 Huber error Exponential loss 6 6 5 5 L ( y,f ( x )) L ( y,f ( x )) 4 4 3 3 2 2 1 1 0 0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 y − f ( x ) y · f ( x ) Functional Gradient Descent Least Squares Regression • Squared loss: L ( y i , f ( x i )) = ( y i − f ( x i )) 2 • The residual ∼ the (negative) gradient ∂ L ( y i , f ( x i )) ∂ f ( x i )

Functional Gradient Descent Least Squares Regression • Squared loss: L ( y i , f ( x i )) = ( y i − f ( x i )) 2 • The residual ∼ the (negative) gradient ∂ L ( y i , f ( x i )) ∂ f ( x i ) Steepest Descent • Regression trees approximate the (negative) gradient • Each tree is a successive gradient descent step 8 8 Squared error Zero-one loss Absolute error Log loss 7 7 Huber error Exponential loss 6 6 5 5 L ( y,f ( x )) L ( y,f ( x )) 4 4 3 3 2 2 1 1 0 0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 y − f ( x ) y · f ( x )

GBRT in scikit-learn How to use it >>> from sklearn.ensemble import GradientBoostingClassifier >>> from sklearn.datasets import make_hastie_10_2 >>> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33]) Implementation • Written in pure Python/Numpy (easy to extend). • Builds on top of sklearn.tree.DecisionTreeRegressor (Cython). • Custom node splitter that uses pre-sorting (better for shallow trees).

Example from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1) 10 ground truth RT max_depth=1 8 RT max_depth=3 GBRT max_depth=1 6 High bias - low variance 4 2 y 0 2 4 Low bias - high variance 6 8 0 2 4 6 8 10 x

Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’) 2.0 Test Train 1.5 Error Lowest test error 1.0 0.5 train-test gap 0.0 0 200 400 600 800 1000 n_estimators

Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’) 2.0 Test Train Regularization 1.5 GBRT provides a number of knobs to control overfitting • Tree structure Error Lowest test error 1.0 • Shrinkage • Stochastic Gradient Boosting 0.5 train-test gap 0.0 0 200 400 600 800 1000 n_estimators

Regularization: Tree structure • The max depth of the trees controls the degree of features interactions • Use min samples leaf to have a sufficient nr. of samples per leaf.

Regularization: Shrinkage • Slow learning by shrinking tree predictions with 0 < learning rate < = 1 • Lower learning rate requires higher n estimators 2.0 Test Train Test learning_rate=0.1 Train learning_rate=0.1 1.5 Error Requires more trees 1.0 Lower test error 0.5 0.0 0 200 400 600 800 1000 n_estimators

Regularization: Stochastic Gradient Boosting • Samples: random subset of the training set ( subsample ) • Features: random subset of features ( max features ) • Improved accuracy – reduced runtime 2.0 Train Test Train subsample=0.5, learning_rate=0.1 Test subsample=0.5, learning_rate=0.1 1.5 Error Subsample alone does poorly 1.0 Even lower test error 0.5 0.0 0 200 400 600 800 1000 n_estimators

Hyperparameter tuning 1. Set n estimators as high as possible (eg. 3000) 2. Tune hyperparameters via grid search. from sklearn.grid_search import GridSearchCV param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01], ’max_depth’: [4, 6], ’min_samples_leaf’: [3, 5, 9, 17], ’max_features’: [1.0, 0.3, 0.1]} est = GradientBoostingRegressor(n_estimators=3000) gs_cv = GridSearchCV(est, param_grid).fit(X, y) # best hyperparameter setting gs_cv.best_params_ 3. Finally, set n estimators even higher and tune learning rate .

Case Study California Housing dataset • Predict log( medianHouseValue ) • Block groups in 1990 census • 20.640 groups with 8 features (median income, median age, lat, lon, ...) • Evaluation: Mean absolute error on 80/20 split Challenges • Heterogeneous features • Non-linear interactions

Predictive accuracy & runtime Train time [s] Test time [ms] MAE - - 0.4635 Mean 0.006 0.11 0.2756 Ridge 28.0 2000.00 0.1888 SVR 26.3 605.00 0.1620 RF 192.0 439.00 0.1438 GBRT 0.5 Test Train 0.4 0.3 error 0.2 0.1 0.0 0 500 1000 1500 2000 2500 3000 n_estimators

Model interpretation Which features are important? >>> est.feature_importances_ array([ 0.01, 0.38, ...]) MedInc AveRooms Longitude AveOccup Latitude AveBedrms Population HouseAge 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative importance

Gradient Boosted Regression Trees scikit Peter Prettenhofer - PowerPoint PPT Presentation

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe) DataRobot Universit e de Li` ege, Belgium Motivation Motivation Outline 1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Introduction to Boosted Trees Tianqi Chen Oct. 22 2014 Outline Review of key concepts of

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Regression trees DAAG Chapter 11 Learning objectives In this section, we will learn about

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Neural Networks Part 3 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

Selected Topics in Optimization Some slides borrowed from

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Gradient Descent for L2 Penalized Logistic Regr. N 1 X log BernPMF( t n | ( w T ( x n )))

Dioptics: a common generalization of Learners Motivation Simple Essence gradient-based

Exploring the phases of Yang-Mills theory with adjoint matter through the gradient flow Camilo

Thermal Field Theory to All Orders in Gradient Expansion Peter Millington