Bootstrap: example Bootstrap: example k 2 ( x − μ ) e − ( x ) = ϕ s 2 k Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples the red lines are 5% and 95% quantiles also gives a measure of uncertainty of the predictions (for each point we can get these across bootstrap model predictions) 1 1 1 #Phi: N x D #Phi: N x D #Phi: N x D 2 2 2 #Phi_test: Nt x D #Phi_test: Nt x D #Phi_test: Nt x D 3 3 3 #y: N #y: N #y: N 4 4 4 #ws: B x D from previous code #ws: B x D from previous code #ws: B x D from previous code 5 5 5 y_hats = np.zeros((B, Nt)) y_hats = np.zeros((B, Nt)) y_hats = np.zeros((B, Nt)) 6 6 6 for b in range(B): for b in range(B): for b in range(B): 7 7 7 wb = ws[b,:] wb = ws[b,:] wb = ws[b,:] 8 8 8 y_hats[b,:] = np.dot(Phi_test, wb) y_hats[b,:] = np.dot(Phi_test, wb) y_hats[b,:] = np.dot(Phi_test, wb) 9 9 9 10 10 10 # get 95% quantiles # get 95% quantiles # get 95% quantiles 11 11 11 y_5 = np.quantile(y_hats, .05, axis=0) y_5 = np.quantile(y_hats, .05, axis=0) y_5 = np.quantile(y_hats, .05, axis=0) 12 12 12 y_95 = np.quantile(y_hats, .95, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) 3 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 = E [ z 2 2 ( E [ z E [ z 2 + + 2 z ] − ] + ]) z z 1 2 1 2 1 2 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 = E [ z 2 2 ( E [ z E [ z 2 + + 2 z ] − ] + ]) z z 1 2 1 2 1 2 = E [ z 2 E [ z 2 E [2 z E [ z 1 2 E [ z 2 2 2 E [ z ] E [ z ] + ] + ] − ] − ] − ] z 1 2 1 2 1 2 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 = E [ z 2 2 ( E [ z E [ z 2 + + 2 z ] − ] + ]) z z 1 2 1 2 1 2 = E [ z 2 E [ z 2 E [2 z E [ z 1 2 E [ z 2 2 2 E [ z ] E [ z ] + ] + ] − ] − ] − ] z 1 2 1 2 1 2 = Var( z ) + Var( z ) + 2Cov( z , z ) 1 2 1 2 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 = E [ z 2 2 ( E [ z E [ z 2 + + 2 z ] − ] + ]) z z 1 2 1 2 1 2 = E [ z 2 E [ z 2 E [2 z E [ z 1 2 E [ z 2 2 2 E [ z ] E [ z ] + ] + ] − ] − ] − ] z 1 2 1 2 1 2 = Var( z ) + Var( z ) + 2Cov( z , z ) 1 2 1 2 for uncorrelated variables this term is zero 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance 4 . 2
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance σ 2 , … , z μ are uncorrelated random variables with mean and variance z 1 B 1 ∑ b ˉ = μ the average has mean and variance z z b B 4 . 2
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance σ 2 , … , z μ are uncorrelated random variables with mean and variance z 1 B 1 ∑ b ˉ = μ the average has mean and variance z z b B 1 ∑ b 1 1 1 2 2 Var( ) = Var( ) = Bσ = ∑ b z z σ b b B 2 B 2 B B 4 . 2
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance σ 2 , … , z μ are uncorrelated random variables with mean and variance z 1 B 1 ∑ b ˉ = μ the average has mean and variance z z b B 1 ∑ b 1 1 1 2 2 Var( ) = Var( ) = Bσ = ∑ b z z σ b b B 2 B 2 B B use this to reduce the variance of our models (bias remains the same) 1 ∑ b f ^ ^ ( x ) = ( x ) regression: average the model predictions f b B 4 . 2
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance σ 2 , … , z μ are uncorrelated random variables with mean and variance z 1 B 1 ∑ b ˉ = μ the average has mean and variance z z b B 1 ∑ b 1 1 1 2 2 Var( ) = Var( ) = Bσ = ∑ b z z σ b b B 2 B 2 B B use this to reduce the variance of our models (bias remains the same) 1 ∑ b f ^ ^ ( x ) = ( x ) regression: average the model predictions f b B issue: model predictions are not uncorrelated (trained using the same data) bagging (bootstrap aggregation) use bootstrap samples to reduce correlation 4 . 2
Bagging Bagging for classification for classification averaging makes sense for regression, how about classification? 4 . 3
Bagging Bagging for classification for classification averaging makes sense for regression, how about classification? wisdom of crowds > 0 , … , z ∈ {0, 1} μ = .5 + ϵ are IID Bernoulli random variables with mean z 1 B 1 ∑ b ˉ = p ( > ˉ .5) for goes to 1 as B grows z z we have z b B 4 . 3
Bagging Bagging for classification for classification averaging makes sense for regression, how about classification? wisdom of crowds > 0 , … , z ∈ {0, 1} μ = .5 + ϵ are IID Bernoulli random variables with mean z 1 B 1 ∑ b ˉ = p ( > ˉ .5) for goes to 1 as B grows z z we have z b B mode of iid classifiers that are better than chance is a better classifier use voting 4 . 3
Bagging Bagging for classification for classification averaging makes sense for regression, how about classification? wisdom of crowds > 0 , … , z ∈ {0, 1} μ = .5 + ϵ are IID Bernoulli random variables with mean z 1 B 1 ∑ b ˉ = p ( > ˉ .5) for goes to 1 as B grows z z we have z b B mode of iid classifiers that are better than chance is a better classifier use voting crowds are wiser when individuals are better than random votes are uncorrelated 4 . 3
Bagging for classification Bagging for classification averaging makes sense for regression, how about classification? wisdom of crowds > 0 , … , z ∈ {0, 1} μ = .5 + ϵ are IID Bernoulli random variables with mean z 1 B 1 ∑ b ˉ = p ( > ˉ .5) for goes to 1 as B grows z z we have z b B mode of iid classifiers that are better than chance is a better classifier use voting crowds are wiser when individuals are better than random votes are uncorrelated use bootstrap samples to reduce correlation bagging (bootstrap aggregation) 4 . 3
Bagging decision trees Bagging decision trees example setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label 4 . 4
Bagging decision trees Bagging decision trees example setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable ! 4 . 4
Bagging decision trees Bagging decision trees example setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable ! voting for the most probably class averaging probabilities B 4 . 4
Random forests Random forests further reduce the correlation between decision trees 4 . 5
Random forests Random forests further reduce the correlation between decision trees feature sub-sampling only a random subset of features are available for split at each step further reduce the dependence between decision trees 4 . 5
Random forests Random forests further reduce the correlation between decision trees feature sub-sampling only a random subset of features are available for split at each step further reduce the dependence between decision trees D magic number? this is a hyper-parameter, can be optimized using CV 4 . 5
Random forests Random forests further reduce the correlation between decision trees feature sub-sampling only a random subset of features are available for split at each step further reduce the dependence between decision trees D magic number? this is a hyper-parameter, can be optimized using CV Out Of Bag (OOB) samples: the instances not included in a bootsrap dataset can be used for validation simultaneous validation of decision trees in a forest no need to set aside data for cross validation 4 . 5
Example Example: spam detection : spam detection Dataset N=4601 emails binary classification task : spam - not spam D=57 features: 48 words: percentage of words in the email that match these words e.g., business,address,internet, free, George (customized per user) 6 characters: again percentage of characters that match these ch; , ch( ,ch[ ,ch! ,ch$ , ch# average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT 4 . 6
Example: spam detection Example : spam detection Dataset N=4601 emails binary classification task : spam - not spam D=57 features: 48 words: percentage of words in the email that match these words e.g., business,address,internet, free, George (customized per user) 6 characters: again percentage of characters that match these an example of ch; , ch( ,ch[ ,ch! ,ch$ , ch# feature engineering average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT 4 . 6
Example Example: spam detection : spam detection Dataset N=4601 emails binary classification task : spam - not spam D=57 features: 48 words: percentage of words in the email that match these words e.g., business,address,internet, free, George (customized per user) 6 characters: again percentage of characters that match these an example of ch; , ch( ,ch[ ,ch! ,ch$ , ch# feature engineering average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT average value of these features in the spam and non-spam emails 4 . 6
Example Example: spam detection : spam detection decision tree after pruning 4 . 7
Example Example: spam detection : spam detection decision tree after pruning misclassification rate on test data 4 . 7
Example Example: spam detection : spam detection decision tree after pruning number of leaves (17) in optimal pruning decided based on cross-validation error cv error test error misclassification rate on test data 4 . 7
Example Example: spam detection : spam detection Bagging and Random Forests do much better than a single decision tree! 4 . 8
Example Example: spam detection : spam detection Bagging and Random Forests do much better Out Of Bag (OOB) error can be used for parameter tuning than a single decision tree! (e.g., size of the forest) 4 . 8 Winter 2020 | Applied Machine Learning (COMP551)
Summary so far... Summary so far... Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models 5
Summary so far... Summary so far... Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models Random forests: Bagging + further de-corelation of features at each split OOB validation instead of CV destroy interpretability of decision trees perform well in practice can fail if only few relevant features exist (due to feature-sampling) 5
Adaptive bases Adaptive bases several methods can be classified as learning these bases adaptively f ( x ) = ( x ; v ) ∑ d w ϕ d d d decision trees generalized additive models boosting neural networks in boosting each basis is a classifier or regression function ( weak learner, or base learner ) create a strong learner by sequentially combining week learners 6 . 1
Forward stagewise additive modelling Forward stagewise additive modelling 6 . 2
Forward stagewise additive modelling Forward stagewise additive modelling T model f ( x ) = { t } { t } ϕ ( x ; v ) ∑ t =1 w a simple model, such as decision stump (decision tree with one node) 6 . 2
Forward stagewise additive modelling Forward stagewise additive modelling T model f ( x ) = { t } { t } ϕ ( x ; v ) ∑ t =1 w a simple model, such as decision stump (decision tree with one node) N { t } { t } ( n ) ( n ) J ({ w , v } ) = L ( y , f ( x )) cost ∑ n =1 t so far we have seen L2 loss, log loss and hinge loss optimizing this cost is difficult given the form of f 6 . 2
Forward stagewise additive modelling Forward stagewise additive modelling T model f ( x ) = { t } { t } ϕ ( x ; v ) ∑ t =1 w a simple model, such as decision stump (decision tree with one node) N { t } { t } ( n ) ( n ) J ({ w , v } ) = L ( y , f ( x )) cost ∑ n =1 t so far we have seen L2 loss, log loss and hinge loss optimizing this cost is difficult given the form of f optimization idea add one weak-learner in each stage t, to reduce the error of previous stage 1. find the best weak learner { t } { t } N ( n ) { t −1} ( n ) ( n ) , w = arg min L ( y , f ( x ) + wϕ ( x ; v )) v , w ∑ n =1 v 2. add it to the current model { t } { t −1} ( n ) { t } ( n ) { t } ( x ) = ( x ) + ϕ ( x ; v ) f f w 6 . 2 Winter 2020 | Applied Machine Learning (COMP551)
loss & forward stagewise linear loss & forward stagewise linear model model L 2 { t } { t } model ( x ) = consider weak learners that are individual features ϕ w x d { t } 7 . 1
loss & forward stagewise linear loss & forward stagewise linear model model L 2 { t } { t } model ( x ) = consider weak learners that are individual features ϕ w x d { t } cost using L2 loss for regression 2 ( y ) ) 1 ∑ n =1 ( n ) N ( n ) { t −1} ( n ) arg min − ( f ( x ) + at stage t w x d , w d 2 d d 7 . 1
loss & forward stagewise linear loss & forward stagewise linear model model L 2 { t } { t } model ( x ) = consider weak learners that are individual features ϕ w x d { t } cost using L2 loss for regression residual r ( n ) 2 ( y ) ) 1 ∑ n =1 ( n ) N ( n ) { t −1} ( n ) arg min − ( f ( x ) + at stage t w x d , w d 2 d d 7 . 1
loss & forward stagewise linear loss & forward stagewise linear model model L 2 { t } { t } model ( x ) = consider weak learners that are individual features ϕ w x d { t } cost using L2 loss for regression residual r ( n ) 2 ( y ) ) 1 ∑ n =1 ( n ) N ( n ) { t −1} ( n ) arg min − ( f ( x ) + at stage t w x d , w d 2 d d ( n ) ( n ) ∑ n x r = optimization recall : optimal weight for each d is w d d d ( n ) 2 ∑ n x d pick the feature that most significantly reduces the residual 7 . 1
loss & forward stagewise linear loss & forward stagewise linear model model L 2 { t } { t } model ( x ) = consider weak learners that are individual features ϕ w x d { t } cost using L2 loss for regression residual r ( n ) 2 ( y ) ) 1 ∑ n =1 ( n ) N ( n ) { t −1} ( n ) arg min − ( f ( x ) + at stage t w x d , w d 2 d d ( n ) ( n ) ∑ n x r = optimization recall : optimal weight for each d is w d d d ( n ) 2 ∑ n x d pick the feature that most significantly reduces the residual { t } { t } ( x ) = ∑ t the model at time-step t: f αw x d { t } d { t } α using a small helps with test error is this related to L1-regularized linear regression? 7 . 1
loss & forward stagewise linear loss & forward stagewise linear model model L example 2 using small learning rate L2 Boosting has a similar regularization path to lasso α = .01 lasso boosting at each time-step only one feature is updated / added d { t } { t } w d ∑ d ∣ w ∣ t d 7 . 2
loss & forward stagewise linear loss & forward stagewise linear model model L example 2 using small learning rate L2 Boosting has a similar regularization path to lasso α = .01 lasso boosting at each time-step only one feature is updated / added d { t } { t } w d ∑ d ∣ w ∣ t d we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K) the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter 7 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Exponential loss Exponential loss & AdaBoost & AdaBoost loss functions for binary classification y ∈ {−1, +1} ^ = sign( f ( x )) predicted label is y 8 . 1
Exponential loss Exponential loss & AdaBoost & AdaBoost loss functions for binary classification y ∈ {−1, +1} ^ = sign( f ( x )) predicted label is y misclassification loss L ( y , f ( x )) = I ( yf ( x ) > 0) (0-1 loss) 8 . 1
Exponential loss Exponential loss & AdaBoost & AdaBoost y ∈ {−1, +1} loss functions for binary classification ^ = sign( f ( x )) predicted label is y misclassification loss L ( y , f ( x )) = I ( yf ( x ) > 0) (0-1 loss) log-loss − yf ( x ) L ( y , f ( x )) = log ( 1 + e ) (aka cross entropy loss or binomial deviance) 8 . 1
Exponential loss & AdaBoost Exponential loss & AdaBoost y ∈ {−1, +1} loss functions for binary classification ^ = sign( f ( x )) predicted label is y misclassification loss L ( y , f ( x )) = I ( yf ( x ) > 0) (0-1 loss) log-loss − yf ( x ) L ( y , f ( x )) = log ( 1 + e ) (aka cross entropy loss or binomial deviance) Hinge loss L ( y , f ( x )) = max(0, 1 − yf ( x )) support vector loss 8 . 1
Exponential loss & AdaBoost Exponential loss & AdaBoost y ∈ {−1, +1} loss functions for binary classification ^ = sign( f ( x )) predicted label is y misclassification loss L ( y , f ( x )) = I ( yf ( x ) > 0) (0-1 loss) log-loss − yf ( x ) L ( y , f ( x )) = log ( 1 + e ) (aka cross entropy loss or binomial deviance) Hinge loss L ( y , f ( x )) = max(0, 1 − yf ( x )) support vector loss yet another loss function is exponential loss L ( y , f ( x )) = e − yf ( x ) note that the loss grows faster than the other surrogate losses (more sensitive to outliers) 8 . 1
Exponential loss & AdaBoost Exponential loss & AdaBoost y ∈ {−1, +1} loss functions for binary classification ^ = sign( f ( x )) predicted label is y misclassification loss L ( y , f ( x )) = I ( yf ( x ) > 0) (0-1 loss) log-loss − yf ( x ) L ( y , f ( x )) = log ( 1 + e ) (aka cross entropy loss or binomial deviance) Hinge loss L ( y , f ( x )) = max(0, 1 − yf ( x )) support vector loss yet another loss function is exponential loss L ( y , f ( x )) = e − yf ( x ) note that the loss grows faster than the other surrogate losses (more sensitive to outliers) useful property when working with additive models: { t −1} { t } { t } { t −1} { t } { t } L ( y , f ( x ) + ϕ ( x , v )) = L ( y , f ( x )) ⋅ L ( y , w ϕ ( x , v )) w treat this as a weight q for an instance instances that are not properly classified before receive a higher weight 8 . 1
Exponential loss & AdaBoost Exponential loss & AdaBoost using exponential loss cost { t } { t } N ( n ) { t −1} ( n ) { t } ( n ) { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = ∑ n =1 L ( y , f ( x ) + ϕ ( x , v )) = ∑ n L ( y , w ϕ ( x , v )) w q t loss for this instance at previous stage ( n ) { t −1} ( n ) L ( y , f ( x )) 8 . 2
Exponential loss & AdaBoost Exponential loss & AdaBoost using exponential loss cost { t } { t } N ( n ) { t −1} ( n ) { t } ( n ) { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = ∑ n =1 L ( y , f ( x ) + ϕ ( x , v )) = ∑ n L ( y , w ϕ ( x , v )) w q t loss for this instance at previous stage ( n ) { t −1} ( n ) L ( y , f ( x )) discrete AdaBoost : assume this is a simple classifier, so its output is +/- 1 8 . 2
Exponential loss & Exponential loss & AdaBoost AdaBoost using exponential loss cost { t } { t } N ( n ) { t −1} ( n ) { t } ( n ) { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = ∑ n =1 L ( y , f ( x ) + ϕ ( x , v )) = ∑ n L ( y , w ϕ ( x , v )) w q t loss for this instance at previous stage ( n ) { t −1} ( n ) L ( y , f ( x )) discrete AdaBoost : assume this is a simple classifier, so its output is +/- 1 objective is to find the weak learner minimizing the cost above optimization ( n ) { t } ( n ) { t } { t } { t } ( n ) − y ϕ ( x , v ) J ({ w , v } ) = ∑ n w q e t 8 . 2
Exponential loss & Exponential loss & AdaBoost AdaBoost using exponential loss cost { t } { t } N ( n ) { t −1} ( n ) { t } ( n ) { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = ∑ n =1 L ( y , f ( x ) + ϕ ( x , v )) = ∑ n L ( y , w ϕ ( x , v )) w q t loss for this instance at previous stage ( n ) { t −1} ( n ) L ( y , f ( x )) discrete AdaBoost : assume this is a simple classifier, so its output is +/- 1 objective is to find the weak learner minimizing the cost above optimization ( n ) { t } ( n ) { t } { t } { t } ( n ) − y ϕ ( x , v ) J ({ w , v } ) = ∑ n w q e t − w { t } ∑ n w { t } ∑ n ( n ) ϕ ( x ( n ) I ( y ( n ) ( n ) { t } ( n ) I ( y ( n ) { t } = e = ϕ ( x , v )) + = , v )) q e q 8 . 2
Exponential loss & Exponential loss & AdaBoost AdaBoost using exponential loss cost { t } { t } N ( n ) { t −1} ( n ) { t } ( n ) { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = ∑ n =1 L ( y , f ( x ) + ϕ ( x , v )) = ∑ n L ( y , w ϕ ( x , v )) w q t loss for this instance at previous stage ( n ) { t −1} ( n ) L ( y , f ( x )) discrete AdaBoost : assume this is a simple classifier, so its output is +/- 1 objective is to find the weak learner minimizing the cost above optimization ( n ) { t } ( n ) { t } { t } { t } ( n ) − y ϕ ( x , v ) J ({ w , v } ) = ∑ n w q e t − w { t } ∑ n w { t } ∑ n ( n ) ϕ ( x ( n ) I ( y ( n ) ( n ) { t } ( n ) I ( y ( n ) { t } = e = ϕ ( x , v )) + = , v )) q e q − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted intances 8 . 2
Exponential loss & Exponential loss & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives 8 . 3
Exponential loss & Exponential loss & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives still need to find the optimal w { t } 8 . 3
Exponential loss & Exponential loss & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives still need to find the optimal w { t } 1−ℓ { t } weight-normalized misclassification error ∂ J { t } 1 = log = 0 setting gives w { t } ( n ) ( n ) ( n ) 2 ℓ { t } I ( ϕ ( x ∂ w { t } ; v ) = y ) ∑ n q { t } ℓ = ∑ n ( n ) q 8 . 3
Exponential loss & Exponential loss & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives still need to find the optimal w { t } 1−ℓ { t } weight-normalized misclassification error ∂ J { t } 1 = log = 0 setting gives w { t } ( n ) ( n ) ( n ) 2 ℓ { t } I ( ϕ ( x ∂ w { t } ; v ) = y ) ∑ n q { t } ℓ = ∑ n ( n ) q { t } { t } since weak learner is better than chance and so w ≥ 0 ℓ < .5 8 . 3
Exponential loss Exponential loss & & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives still need to find the optimal w { t } 1−ℓ { t } weight-normalized misclassification error ∂ J { t } 1 = log = 0 setting gives w { t } ( n ) ( n ) ( n ) 2 ℓ { t } I ( ϕ ( x ∂ w { t } ; v ) = y ) ∑ n q { t } ℓ = ∑ n ( n ) q { t } { t } since weak learner is better than chance and so w ≥ 0 ℓ < .5 { t } ( n ) ( n ) { t } ( n ),{ t +1} ( n ),{ t } − w ϕ ( x ; v ) = we can now update instance weights q for next iteration y q q e (multiply by the new loss) 8 . 3
Exponential loss Exponential loss & & AdaBoost AdaBoost cost { t } { t } ( n ) ( n ) { t } ( n ) { t } J ({ w , v } ) = L ( y , w ϕ ( x , v )) ∑ n q t − w { t } ∑ n − w { t } ∑ n ( n ) ϕ ( x w { t } ( n ) ( n ) I ( y ( n ) { t } = e + ( e − ) = , v )) q e q does not depend on { t } assuming the weak learner should minimize this cost ≥ 0 w the weak learner this is classification with weighted instances v { t } this gives still need to find the optimal w { t } 1−ℓ { t } weight-normalized misclassification error ∂ J { t } 1 = log = 0 setting gives w { t } ( n ) ( n ) ( n ) 2 ℓ { t } I ( ϕ ( x ∂ w { t } ; v ) = y ) ∑ n q { t } ℓ = ∑ n ( n ) q { t } { t } since weak learner is better than chance and so w ≥ 0 ℓ < .5 { t } ( n ) ( n ) { t } ( n ),{ t +1} ( n ),{ t } − w ϕ ( x ; v ) = we can now update instance weights q for next iteration y q q e (multiply by the new loss) since w > 0, the weight q of misclassified points increase and the rest decrease 8 . 3
Exponential loss Exponential loss & & AdaBoost AdaBoost overall algorithm for discrete AdaBoost 8 . 4
Exponential loss Exponential loss & & AdaBoost AdaBoost overall algorithm for discrete AdaBoost {1} {1} ϕ ( x ; v ) w 8 . 4
Exponential loss Exponential loss & & AdaBoost AdaBoost overall algorithm for discrete AdaBoost {2} {2} ϕ ( x ; v ) w {1} {1} ϕ ( x ; v ) w 8 . 4
Exponential loss Exponential loss & & AdaBoost AdaBoost overall algorithm for discrete AdaBoost { T } { T } ϕ ( x ; v ) w {3} {3} ϕ ( x ; v ) w {2} {2} ϕ ( x ; v ) w {1} {1} ϕ ( x ; v ) w 8 . 4
Exponential loss & Exponential loss & AdaBoost AdaBoost { t } { t } f ( x ) = sign ( ϕ ( x ; v ) ) ∑ t w overall algorithm for discrete AdaBoost { T } { T } ϕ ( x ; v ) w {3} {3} ϕ ( x ; v ) w {2} {2} ϕ ( x ; v ) w {1} {1} ϕ ( x ; v ) w 8 . 4
Exponential loss & Exponential loss & AdaBoost AdaBoost { t } { t } f ( x ) = sign ( ϕ ( x ; v ) ) ∑ t w overall algorithm for discrete AdaBoost ( n ) = 1 : ∀ n initialize q { T } { T } ϕ ( x ; v ) w N for t=1:T fit the simple classifier to the weighted dataset { t } ϕ ( x , v ) { t } ( n ) ( n ) ( n ) { t } = I ( ϕ ( x ; v ) = y ) ∑ n q ℓ : {3} {3} ϕ ( x ; v ) w ( n ) ∑ n q { t } = 1−ℓ { t } 1 : log w 2 ℓ { t } {2} {2} ϕ ( x ; v ) w ( n ) = q { t } ( n ) ( n ) { t } ( n ) − w ϕ ( x ; v ) : ∀ n y q e return { t } { t } f ( x ) = sign( ϕ ( x ; v )) ∑ t w {1} {1} ϕ ( x ; v ) w 8 . 4
AdaBoost AdaBoost example each weak learner is a decision stump (dashed line) { t } { t } ^ = sign( ϕ ( x ; v )) ∑ t y w circle size is proportional to q n ,{ t } green is the decision boundary of f { t } { T } { T } ϕ ( x ; v ) t = 2 w t = 1 t = 3 {3} {3} ϕ ( x ; v ) w t = 6 t = 150 t = 10 {2} {2} ϕ ( x ; v ) w {1} {1} ϕ ( x ; v ) w 8 . 5
AdaBoost AdaBoost example ( n ) ( n ) features are samples from standard Gaussian , … , x x 1 10 8 . 6
AdaBoost AdaBoost example ( n ) ( n ) features are samples from standard Gaussian , … , x x 1 10 2 label ( n ) I ( ( n ) = > 9.34) ∑ d y x d 8 . 6
AdaBoost AdaBoost example ( n ) ( n ) features are samples from standard Gaussian , … , x x 1 10 2 label ( n ) I ( ( n ) = > 9.34) ∑ d y x d N=2000 training examples 8 . 6
AdaBoost AdaBoost example ( n ) ( n ) features are samples from standard Gaussian , … , x x 1 10 2 label ( n ) I ( ( n ) = > 9.34) ∑ d y x d N=2000 training examples 8 . 6
AdaBoost AdaBoost example ( n ) ( n ) features are samples from standard Gaussian , … , x x 1 10 2 label ( n ) I ( ( n ) = > 9.34) ∑ d notice that test error does not increase y x d N=2000 training examples AdaBoost is very slow to overfit 8 . 6 Winter 2020 | Applied Machine Learning (COMP551)
application: application: Viola-Jones face detection Viola-Jones face detection Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate 9 image source: David Lowe slides
application: application: Viola-Jones face detection Viola-Jones face detection Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate 100% detection cumulative FP rate FP rate 9 image source: David Lowe slides
application: application: Viola-Jones face detection Viola-Jones face detection Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate 100% detection cumulative FP rate FP rate cascade is applied over all image subwindows 9 image source: David Lowe slides
application: application: Viola-Jones face detection Viola-Jones face detection Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate 100% detection cumulative FP rate FP rate cascade is applied over all image subwindows fast enough for real-time (object) detection 9 image source: David Lowe slides
Gradient boosting Gradient boosting idea fit the weak learner to the gradient of the cost 10 . 1
Gradient boosting Gradient boosting idea fit the weak learner to the gradient of the cost ⊤ ( N ) ⊤ let f { t } { t } (1) { t } ( N ) (1) y = [ y = ( x ), … , f ( x ) ] , … , y [ f and true labels ] 10 . 1
Gradient boosting Gradient boosting idea fit the weak learner to the gradient of the cost ⊤ ( N ) ⊤ let f { t } { t } (1) { t } ( N ) (1) y = [ y = ( x ), … , f ( x ) ] , … , y [ f and true labels ] ignoring the structure of f ^ f L ( f , y ) = arg min if we use gradient descent to minimize the loss f 10 . 1
Gradient boosting Gradient boosting idea fit the weak learner to the gradient of the cost ⊤ ( N ) ⊤ let f { t } { t } (1) { t } ( N ) (1) y = [ y = ( x ), … , f ( x ) ] , … , y [ f and true labels ] ignoring the structure of f ^ f L ( f , y ) = arg min if we use gradient descent to minimize the loss f ^ ^ { T } {0} T { t } { t } f f f f g = = − ∑ t =1 write as a sum of steps w ∂ { t −1} L ( f , y ) ∂ f gradient vector its role is similar to residual 10 . 1
Gradient boosting Gradient boosting idea fit the weak learner to the gradient of the cost ⊤ ( N ) ⊤ let f { t } { t } (1) { t } ( N ) (1) y = [ y = ( x ), … , f ( x ) ] , … , y [ f and true labels ] ignoring the structure of f ^ f L ( f , y ) = arg min if we use gradient descent to minimize the loss f ^ ^ { T } {0} T { t } { t } f f f f g = = − ∑ t =1 write as a sum of steps w ∂ { t } { t −1} { t } { t −1} L ( f w g L ( f , y ) = arg min − ) w w ∂ f gradient vector we can look for the optimal step size its role is similar to residual 10 . 1
Recommend
More recommend