STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 7 1/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Generalized Additive Models Definition Fitting algorithm Tree-based Methods Background How to grow a regression tree Bagging Bootstrap aggregation Bootstrap trees STK-IN4300: lecture 7 2/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: introduction From the previous lecture: ‚ linear regression models are easy and effective models; ‚ often the effect of a predictor on the response is not linear; Ó local polynomials and splines. Generalized Additive Models : ‚ flexible statistical methods to identify and characterize nonlinear regression effects; ‚ larger class than the generalized linear models. STK-IN4300: lecture 7 3/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: additive models Consider the usual framework: ‚ X 1 , . . . , X p are the predictors; ‚ Y is the response variable; ‚ f 1 p¨q , . . . , f p p¨q are unspecified smooth functions. Then, an additive model has the form E r Y | X 1 , . . . , X p s “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q . STK-IN4300: lecture 7 4/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: more generally As linear models are extended to generalized linear models, we can generalize the additive models to the generalized additive models, g p µ p X 1 , . . . , X p qq “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q , where: ‚ µ p X 1 , . . . , X p q “ E r Y | X 1 , . . . , X p s ; ‚ g p µ p X 1 , . . . , X p qq is the link function; ‚ classical examples: § g p µ q “ µ Ø identity link Ñ Gaussian models; § g p µ q “ log p µ {p 1 ´ µ qq Ø logit link Ñ Binomial models; § g p µ q “ Φ ´ 1 p µ q Ø probit link Ñ Binomial models; § g p µ q “ log p µ q Ø logarithmic link Ñ Poisson models; § . . . STK-IN4300: lecture 7 5/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: semiparametric models Generalized additive models are very flexible: ‚ not all functions f j p¨q must be nonlinear; g p µ q “ X T β ` f p Z q in which case we talk about semiparametric models . ‚ nonlinear effect can be combined with qualitative inputs, g p µ q “ f p X q ` g k p Z q “ f p X q ` g p V, Z q where k indexes the level of a qualitative variable V . STK-IN4300: lecture 7 6/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: difference with splines When implementing splines: ‚ each function is modelled by a basis expansion; ‚ the resulting model can be fitted with least squares. Here the approach is different: ‚ each function is modelled with a smoother (smoothing splines, kernel smoothers, . . . ) ‚ all p functions are simultaneously fitted via an algorithm. STK-IN4300: lecture 7 7/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: ingredients Consider an additive model p ÿ Y “ α ` f j p X j q ` ǫ. j “ 1 We can define a loss function, ¸ 2 N ˜ p p ż ÿ ÿ ÿ t f 2 j p t j qu 2 dt j y i ´ α ´ f j p x ij q ` λ j i “ 1 j “ 1 j “ 1 ‚ λ j are tuning parameters; ‚ the minimizer is an additive cubic spline model, § each f j p X j q is a cubic spline with knots at the (unique) x ij ’s. STK-IN4300: lecture 7 8/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: constraints The parameter α is in general not identifiable: ‚ same result if adding a constant to each f j p X j q and subtracting it from α ; ‚ by convention, ř p j “ 1 f j p X j q “ 0 : § the functions average 0 over the data; § α is therefore identifiable; § in particular, ˆ α “ ¯ y . If this is true and the matrix of inputs X has full rank: ‚ the loss function is convex; ‚ the minimizer is unique; ‚ simple procedure to find the solution Ñ backfitting algorithm. STK-IN4300: lecture 7 9/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: backfitting algorithm The backfitting algorithm : α “ N ´ 1 ř N i “ 1 y i and ˆ 1. Initialization: ˆ f j ” 0 @ j 2. In cycle, j “ 1 , . . . , p, 1 , . . . , p, . . . » fi ÿ ˆ f k p x ik qu N ˆ f j Ð S j – t y i ´ ˆ α ´ 1 fl k ‰ j N f j ´ 1 f j Ð ˆ ˆ ÿ ˆ f j p x ij q N i “ 1 until ˆ f j changes less than a pre-specified threshold. S j is usually a cubic smoothing spline, but other smoothing operators can be used. STK-IN4300: lecture 7 10/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: remarks Note: ‚ the smoother S can be (when applied only at the training points) represented by the N ˆ N smoothing matrix S , § the degrees of freedom for the j -th terms are trace p S q ; ‚ for the generalized additive model, the loss function is the penalized negative log-likelihood; ‚ the backfitting algorithm fits all predictors, § not feasible when p ąą N . STK-IN4300: lecture 7 11/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Example: logistic regression for email spam data Consider the spam data (as in Exercise 3.17). ‚ binary response ( email / spam ), § logistic regress., log P r p Y “ 1 | X q P r p Y “ 0 | X q “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q ; ‚ 48 percentages of words in the email (e.g. you , free , . . . ); ‚ 6 percentages of specific characters (e.g. ch; , ch$ , . . . ); ‚ average length sequences of capital letters ( CAPAVE ); ‚ length longest sequence of capital letters ( CAPMAX ); ‚ sum length sequences of capital letters ( CAPTOT ). Sample size: 3065 training, 1536 test. Choice of f j p¨q : smoothing cubic splines with d f “ 4 . STK-IN4300: lecture 7 12/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Example: logistic regression for email spam data STK-IN4300: lecture 7 13/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Example: logistic regression for email spam data STK-IN4300: lecture 7 14/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Example: logistic regression for email spam data STK-IN4300: lecture 7 15/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Consider a regression problem, Y the response, X the input matrix. A tree is a recursive binary partition of the feature space: ‚ each time, a region is divide into two or more regions; § until a stopping criterion applies; ‚ at the end, the input space is split in M regions R m ; ‚ a constant c m is fitted to each R m . The final prediction is M ˆ ÿ f p X q “ c m 1 p X P R m q , ˆ m “ 1 where ˆ c m is an estimate for the region R m (e.g., ave p y i | x i P R m q ). STK-IN4300: lecture 7 16/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 7 17/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Note: ‚ the split can be represented as a junction of a tree; ‚ this representation works for p ą 2 ; ‚ each observation is assigned to a branch at each junction; ‚ the model is easy to interpret. STK-IN4300: lecture 7 18/ 43
STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 7 19/ 43
STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: split How to grow a regression tree: ‚ we need to automatically decide the splitting variables . . . ‚ . . . and the splitting points; ‚ we need to decide the shape (topology) of the tree. Using a sum of squares criterion, ř N i “ 1 p y i ´ f p x i qq 2 , ‚ the best ˆ c m “ ave p y i | x i P R m q ; ‚ finding the best partition in terms of minimum sum of squares is generally computationally infeasible Ó go greedy STK-IN4300: lecture 7 20/ 43
STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: greedy algorithm Starting with all data: ‚ for each X j , find the best split point s § define the two half-hyperplanes, § R 1 p j, s q “ t X | X j ď s u ; § R 2 p j, s q “ t X | X j ą s u ; § the choice of s can be done really quickly; ‚ for each j and s , solve p y i ´ c 1 q 2 ` min ÿ ÿ p y i ´ c 2 q 2 s j, s r min min c 1 c 2 x i P R 1 p j,s q x i P R 2 p j,s q ‚ the inner minimization is solved by § ˆ c 1 “ ave p y i | x i P R 1 p j, s qq ; § ˆ c 2 “ ave p y i | x i P R 2 p j, s qq . ‚ the identification of the best p j, s q is feasible. STK-IN4300: lecture 7 21/ 43
STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: when to stop The tree size: ‚ is a tuning parameter; ‚ it controls the model complexity; ‚ its optimal values should be chosen from the data. Naive approach : ‚ split the tree nodes only if there is a sufficient decrease in the sum-of-squares (e.g., larger than a pre-specified threshold); § intuitive; § short-sighted (a split can be preparatory for a split below). Preferred strategy : ‚ grow a large (pre-specified # of nodes) or complete tree T 0 ; ‚ prune it (remove branches) to find the best tree. STK-IN4300: lecture 7 22/ 43
Recommend
More recommend