stk in4300 statistical learning methods in data science
play

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 8 1/ 39 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Generalized Additive Models Definition


  1. STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 8 1/ 39

  2. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Generalized Additive Models Definition Fitting algorithm Tree-based Methods Background How to grow a regression tree Bagging Bootstrap aggregation Bootstrap trees STK-IN4300: lecture 8 2/ 39

  3. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: introduction From the previous lecture: ‚ linear regression model are easy and effective models; ‚ often the effect of a predictor on the response is not linear; Ó local polynomials and splines. Generalized Additive Models : ‚ flexible statistical methods to identify and characterize nonlinear regression effects; ‚ lerger class than the generalized linear models. STK-IN4300: lecture 8 3/ 39

  4. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: additive models Consider the usual framework: ‚ X 1 , . . . , X p are the predictors; ‚ Y is the response variable; ‚ f 1 p¨q , . . . , f p p¨q are unspecified smooth functions. Then, an additive model has the form E r Y | X 1 , . . . , X p s “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q . STK-IN4300: lecture 8 4/ 39

  5. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: more generally As linear models are extended to generalized linear models, we can generalized the additive models to the generalized additive models, g p µ p X , . . . , X p qq “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q , where: ‚ µ p X 1 , . . . , X p q “ E r Y | X 1 , . . . , X p s is the link function; ‚ g p µ p X 1 , . . . , X p qq is the link function; ‚ classical examples: § g p µ q “ µ Ø identity link Ñ Gaussian models; § g p µ q “ log p µ {p 1 ´ µ qq Ø logit link Ñ Binomial models; § g p µ q “ Φ ´ 1 p µ q Ø probit link Ñ Binomial models; § g p µ q “ log p µ q Ø logarithmic link Ñ Poisson models; § . . . STK-IN4300: lecture 8 5/ 39

  6. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: semiparametric models Generalized additive models are very flexible: ‚ not all functions f j p¨q must be nonlinear; g p µ q “ X T β ` f p Z q in which case we talk about semiparametric models . ‚ nonlinear effect can be combined with qualitative inputs, g p µ q “ f p X q ` g k p Z q “ f p X q ` g p V, Z q where k indexes the level of a qualitative variable V . STK-IN4300: lecture 8 6/ 39

  7. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: difference with splines When implementing splines: ‚ each function is modelled by a basis expansion; ‚ the resulting model can be fitted with least squares. Here the approach is different: ‚ each function is modelled with a smoother (smoothing splines, kernel smoothers, . . . ) ‚ all p functions are simultaneously fitted via an algorithm. STK-IN4300: lecture 8 7/ 39

  8. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: ingredients Consider an additive model p ÿ Y “ α ` f j p X j q ` ǫ. j “ 1 We can define a loss function, ¸ 2 N ˜ p p ż ÿ ÿ ÿ t f 2 j p t j qu 2 dt j y i ´ α ´ f j p x ij q ` λ j i “ 1 j “ 1 j “ 1 ‚ λ j are tuning parameters; ‚ the minimizer is an additive cubic spline model, § each f j p X j q is a cubic spline with knots at the (unique) x ij ’s. STK-IN4300: lecture 8 8/ 39

  9. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: constrains The parameter α is in general not identifiable: ‚ same result if adding a constant to each f j p X j q and subtracting it from α ; ‚ by convention, ř p j “ 1 f j p X j q “ 0 : § the functions average 0 over the data; § α is therefore identifiable; § in particular, ˆ α “ ¯ y . If this is true and the matrix of inputs X has full rank: ‚ the loss function is convex; ‚ the minimizer is unique. STK-IN4300: lecture 8 9/ 39

  10. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: backfitting algorithm The backfitting algorithm : α “ N ´ 1 ř N i “ 1 y i and ˆ 1. Initialization: ˆ f j ” 0 @ j 2. In cycle, j “ 1 , . . . , p, 1 , . . . , p, . . . » fi ÿ ˆ f k p x ik qu N ˆ f j Ð S j – t y i ´ α ´ 1 fl k ‰ j N f j ´ 1 f j Ð ˆ ˆ ÿ ˆ f j p x ij qu N i “ 1 until ˆ f j change less than a pre-specified threshold. S j is usually a cubic smoothing spline, but other smoothing operators can be used. STK-IN4300: lecture 8 10/ 39

  11. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: remarks Note: ‚ the smoother S can be (when applied only at the training points) represented by the N ˆ N smoothing matrix S , § the degrees of freedom for the j -th terms are trace p S q ; ‚ for the generalized additive model, the loss function is the penalized log-likelihood; ‚ the backfitting algorithm fits all predictors, § not feasible when p ąą N . STK-IN4300: lecture 8 11/ 39

  12. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Consider a regression problem, Y the response, X the input matrix. A tree is a recursive binary partition of the feature space: ‚ each time a region is divide in two or more regions; § until a stopping criterion applies; ‚ at the end, the input space is split in M regions R m ; ‚ a constant c m is fitted to each R m . The final prediction is M ˆ ÿ f p X q “ c m 1 p X P R m q , ˆ m “ 1 where ˆ c m is an estimate for the region R m (e.g., ave p y i | x i P R m q ). STK-IN4300: lecture 8 12/ 39

  13. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 8 13/ 39

  14. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Note: ‚ the split can be represented as a junction of a tree; ‚ this representation works for p ą 2 ; ‚ each observation is assigned to a branch at each junction; ‚ the model is easy to interpret. STK-IN4300: lecture 8 14/ 39

  15. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 8 15/ 39

  16. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: split How to grow a regression tree: ‚ we need to automatically decide the splitting variables . . . ‚ . . . and the splitting points; ‚ we need to decide the shape (topology) of the tree. Using a sum of squares criterion, ř N i “ 1 p y i ´ f p x i qq 2 , ‚ the best ˆ c m “ ave p y i | x i P R m q ; ‚ finding the best partition in terms of minimum sum of squares is generally computationally infeasible Ó go greedy STK-IN4300: lecture 8 16/ 39

  17. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: greedy algorithm Starting with all data: ‚ for each X j , find the best split point s § define the two half-hyperplanes, § R 1 p j, s q “ t X | X j ď s u ; § R 2 p j, s q “ t X | X j ą s u ; § the choice of s can be done really quickly; ‚ for each j and s , solve p y i ´ c 1 q 2 ` min ÿ ÿ p y i ´ c 2 q 2 s j, s r min min c 1 c 2 x i P R 1 p j,s q x i P R 2 p j,s q ‚ the inner minimization is solved by § ˆ c 1 “ ave p y i | x i P R 1 p j, s qq ; § ˆ c 2 “ ave p y i | x i P R 2 p j, s qq . ‚ the identification of the best p j, s q is feasible. STK-IN4300: lecture 8 17/ 39

  18. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: when to stop The tree size: ‚ is a tuning parameter; ‚ it controls the model complexity; ‚ its optimal values should be chosen from the data. Naive approach : ‚ split the tree nodes only if there is a sufficient decrease in the sum-of-squares (e.g., larger than a pre-specified threshold); § intuitive; § short-sighted (a split can be preparatory for a split below). Preferred strategy : ‚ grow a large (pre-specified # of nodes) or complete tree T 0 ; ‚ prune (remove branches) it to find the best tree. STK-IN4300: lecture 8 18/ 39

  19. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: cost-complexity pruning Consider a tree T Ă T 0 , computed by pruning T 0 and define: ‚ R m the region defined by the node m ; ‚ | T | the number of terminal nodes in T ; ‚ N m the number of observations in R m , N m “ # t x i P R m u ; c m “ N ´ 1 ‚ ˆ c m the estimate in R m , ˆ ř x i P R m y i ; m ‚ Q m p T q the loss in R m , Q m p T q “ N ´ 1 c m q 2 . ř x i P R m p y i ´ ˆ m Then, the cost complexity criterion is | T | ÿ C α p T q “ N m Q m p T q ` α | T | . m “ 1 STK-IN4300: lecture 8 19/ 39

  20. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: cost-complexity pruning The idea is to find the subtree T ˆ α Ă T 0 which minimizes C α p T q : ‚ @ α , find the unique subtree T α which minimizes C α p T q ; ‚ through weakest link pruning : § successively collapse the internal node that produce the smallest increase in ř | T | m “ 1 N m Q m p T q ; § until the single node tree; § find T α within the sequence; ‚ find ˆ α via cross-validation. Here the tuning parameter α : ‚ governs the trade-off between tree size and goodness of fit; ‚ larger values of α correspond to smaller trees; ‚ α “ 0 Ñ full tree. STK-IN4300: lecture 8 20/ 39

Recommend


More recommend