Ensemble method for supervised learning Using an explicit loss function Ricco RAKOTOMALALA Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Outline 1. Preamble 2. Gradient boosting for regression 3. Gradient boosting for classification 4. Regularization (shrinkage, stochastic gradient boosting) 5. T ools and software 6. Conclusion – Pros and cons 7. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Boosting and Gradient Descent Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
BOOSTING is an ensemble method which aggregates classifiers learned sequentially on a sample for which the weights of individuals are adjusted at each step. The classifiers are weighted according to their performance. [RAK, page 28]. Input: B number of models, ALGO learning algorithm, Ω training set, with size = n, y target attribute, X matrix with p predictive attributes. MODELES = { } All the instances have the same weight ω 1 i = 1/n For b = 1 to B Do Fit the model M b from Ω(ω b ) using ALGO ( ω b weighting system at the step b) Add M b into MODELES n ˆ b Calculate the weighted error rate for M b : I y y b i i i If b > 0.5 or b = 0, STOP the process i 1 Else 1 Calculate b ln b ˆ b 1 b exp . I y y The weights are updated b i i b i i And normalized so that the sum is equal to 1 End For A weighted ( b ) vote is used for prediction B f ( x ) sign M ( x ) (this is an additive model) b b b 1 Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Gradient descent is an iterative technique that allows to approach the solution of an optimization problem. In supervised learning, the construction of the model is often to determine the parameters that enable to optimize (max or min) an objective function (ex. Perceptron – Least squares criterion, pages 11 et 12). f() is a classifier with some parameters j() is a cost function comparing the observed value of the target and the prediction of the model for an observation n J ( y , f ) j y , f ( x ) J() is an overall loss function, additively calculated from all i i observations i 1 The aim is to minimize J() with regard to f() i.e. the parameters of f(). f b () is the version of classifier at step ‘’b’’ is the learning rate which enables to lead the process f ( x ) f ( x ) j ( y , f ( x )) is the gradient i.e. the first order partial derivative of the b i b 1 i i i cost function with regard to the classifier j ( y , f ( x )) i i j ( y , f ( x )) i i f ( x ) i Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
We can show that ADABOOST consists in to optimize an exponential loss function i.e. each classifier M t learned from the weighted sample resulting from M t-1 allows to minimize an overall loss function [BIS, page 659 ; HAS, page 343] y {-1, +1} n J() is the overall loss function J ( f ) exp y f ( x ) f() is the aggregate classifier composed of a linear i i combination of the base classifiers M b i 1 The aggregate classifier at step ‘’b’’ is corrected with the individual classifier M b learned from the b f f M reweighted sample. M b is the gradient here i.e. b b 1 b 2 each intermediate model allows to reduce the loss of the global model. The "gradient" classifier comes from a sample where b b 1 the weights of individuals depend on the performance exp . I y M ( i ) i i b 1 i b 1 of the previous model (idea of iterative corrections) GRADIENT BOOSTING : generalize the approach with other loss functions Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Gradient Boosting = Gradient Descent + Boosting Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
The regression is a supervised learning process which estimates the relationship between a quantitative dependent variable and a set of independent variables. is the error term. It represents the inadequacy of the model. y M ( x ) M is any kind of model, we use regression tree. i 1 i 1 i e is the residual. Estimated value of the error. High value (in e y M ( x ) absolute value) reflects a bad prediction. i 1 i 1 i The aim is to model this residual with a second classifier M2 and associate it with the previous one for a better prediction. e M ( x ) We can proceed in the same way for the residual e 2 , etc. i 1 2 i 2 i ˆ The role of M 2 is (additively) compensate the y M ( x ) M ( x ) i 1 i 2 i inadequacy of M 1 , thereafter we can learn M 3 , etc. Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
1 The sum of the squares of 2 j ( y , f ( x )) y f ( x ) i i i i 2 errors is a well-known overall n indicator of quality in regression J ( y , f ) j y , f ( x ) i i i 1 1 Calculation of the gradient. It is actually 2 y f ( x ) i i j ( y , f ( x )) 2 equal to the residual, but with an opposite i i f ( x ) y i i sign i.e. residual = negative gradient f ( x ) f ( x ) i i Thus, we have an iterative process for the f ( x ) f ( x ) M ( x ) b i b 1 i b i construction of the additive model. f ( x ) y f ( x ) b 1 i i b 1 i Modeling the residuals in step "b" j ( y , f ( x ) (regression tree M b ) corresponds to a i i f ( x ) 1 b 1 i f ( x ) gradient. Ultimately, we minimize the i overall cost function J() f ( x ) j ( y , f ( x )) b 1 i i i The learning rate is equal to 1 here. Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
We have an iterative process where, at each step, we use the negative value of the gradient: - j(y,f) [WIK] Or, more simply, FOR m = 1, …, B (B : parameter of the algorithm) The trivial tree corresponds to a tree with only the root. The prediction is equal to the mean of the target attribute Y . Must be calculated for all the individuals of the training sample (i = 1, …, n) Fit the trivial tree f 0 () j() = square of the error REPEAT UNTIL CONVERGENCE negative gradient = residual Calculate negative gradient - j(y,f) Fit a regression tree M b for - j(y,f) The depth of the trees is a f b = f b-1 + b .M b possible parameter b is chosen at each step n The models are combined arg min j ( y , f ( x ) . M ( x )) in order to minimize b i b 1 i b i in additive fashion i 1 (using a numerical optimization approach) The advantage of this generic formulation is that one can use other loss functions and the associated gradients. Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Other loss functions Other gradient formulation Other behavior and performance of the aggregate model - j(y i ,f(x i )) Loss function Pros / Cons Sensitivity to small differences, ½(y i -f(x i ))² y i -f(x i ) but not robust against the outliers Less sensitive to small |y i -f(x i )| signe[y i -f(x i )] differences but robust against outliers Combine the benefits of the y i -f(x i ) si |y i -f(x i )| b square error (more sensitive to b .sign[yi-f(xi)] si |y i -f(x i )| > b Huber small values of the gradient) and Where b is a quantile of {|y i -f(x i )|} absolute error (more robust against the outliers) Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Working with the indicator variables (dummy variables) Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Recommend
More recommend