STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 4 1/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Shrinkage Methods Lasso Comparison of Shrinkage Methods More on Lasso and Related Path Algorithms STK-IN4300: lecture 4 2/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Shrinkage Methods: ridge regression and PCR STK-IN4300: lecture 4 3/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Shrinkage Methods: bias and variance STK-IN4300: lecture 4 4/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Lasso: Least Absolute Shrinkage and Selection Operator Lasso is similar to ridge regression, with an L 1 penalty instead of the L 2 one, N p ÿ ÿ β j x ij q 2 , p y i ´ β 0 ´ i “ 1 j “ 1 subject to ř p j “ 1 | β j | ď t . Or, in the equivalent Lagrangian form, # N p p + β j x ij q 2 ` λ ˆ ÿ ÿ ÿ β lasso p λ q “ argmin β p y i ´ β 0 ´ | β j | . i “ 1 j “ 1 j “ 1 ‚ X must be standardized; ‚ β 0 is again not considered in the penalty term. STK-IN4300: lecture 4 5/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Lasso: constrained estimation STK-IN4300: lecture 4 6/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Lasso: remarks Due to the structure of the L 1 norm; ‚ some estimates are forced to be 0 (variable selection); ‚ no close form for the estimator. From a Bayesian prospective: ‚ ˆ β lasso p λ q as the posterior mode estimate. ‚ β „ Laplace p 0 , τ 2 q ; ‚ for more details, see Park & Casella (2008). Extreme situations: ‚ λ Ñ 0 , ˆ β lasso p λ q Ñ ˆ β OLS ; ‚ λ Ñ 8 , ˆ β lasso p λ q Ñ 0 . STK-IN4300: lecture 4 7/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Lasso: shrinkage STK-IN4300: lecture 4 8/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Lasso: generalized linear models Lasso (and ridge r.) can be used with any linear regression model; ‚ e.g., logistic regression. In logistic regression, the lasso solution is the maximizer of # N p + y i p β 0 ` β T x i q ´ log p 1 ` e β 0 ` β T x i q ” ı ÿ ÿ max β 0 ,β ´ λ | β j | . i “ 1 j “ 1 Note: ‚ penalized logistic regression can be applied to problems with high-dimensional data (see Section 18.4). STK-IN4300: lecture 4 9/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 10/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 11/ 39
STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 12/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: generalization Generalization including lasso and ridge r. Ñ bridge regression : # N p p + β j x ij q 2 ` λ ˜ ÿ ÿ ÿ | β j | q β p λ q “ argmin β p y i ´ β 0 ´ , q ě 0 . i “ 1 j “ 1 j “ 1 Where: ‚ q “ 0 Ñ best subset selection; ‚ q “ 1 Ñ lasso; ‚ q “ 2 Ñ ridge regression. STK-IN4300: lecture 4 13/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: generalization Note that: ‚ 0 ă q ď 1 Ñ non differentiable; ‚ 1 ă q ă 2 Ñ compromise between lasso and ridge (but makespaceoii differentiable ñ no variable selection property). ‚ q defines the shape of the constrain area: ‚ q could be estimated from the data (tuning parameter); ‚ in practice does not work well (variance). STK-IN4300: lecture 4 14/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: elastic net Different compromise lasso / ridge regression: elastic net # N + p p β j x ij q 2 ` λ ˜ ÿ ÿ ÿ ` α | β j | ` p 1 ´ α q β 2 ˘ β p λ q “ argmin β p y i ´ β 0 ´ . j i “ 1 j “ 1 j “ 1 Idea: ‚ L 1 penalty takes care of variable selection; ‚ L 2 penalty helps in correctly handling correlation; ‚ α defines how much L 1 and L 2 penalty should be used: § it is a tuning parameter, must be found in addition to λ ; § a grid search is discouraged; § in real experiments, often very close to 0 or 1. STK-IN4300: lecture 4 15/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: elastic net Comparing the bridge regression and the elastic net, ‚ they look very similar; ‚ huge difference due to differentiability (variable selection). STK-IN4300: lecture 4 16/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: Least Angle Regression The Least Angle Regression ( LAR ): ‚ can be viewed as a “democratic” version of the forward selection; ‚ add sequentially new predictors into the model § only “as much as it deserves”; ‚ eventually reaches the least square estimation; ‚ strongly connected with lasso; § lasso can be seen as a special case of LAR; § LAR is often used to fit lasso models. STK-IN4300: lecture 4 17/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: LAR Least Angle Regression: 1. Standardize the predictors (mean zero, unit norm). Initialize: § residuals r “ y ´ ¯ y § regression coefficient estimates β 1 “ ¨ ¨ ¨ “ β p “ 0 ; 2. find the predictor x j most correlated with r ; 3. move ˆ β j towards its least-squares coefficient x x j , r y , § until for k ‰ j , corr p x k , r q “ corr p x j , r q . 4. add x k in the active list and update both ˆ β j and ˆ β k : § towards their joint least squares coefficient; § until x l has as much correlation with the current residual; 5. continue until all p predictors have been entered. STK-IN4300: lecture 4 18/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: comparison STK-IN4300: lecture 4 19/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: overfit STK-IN4300: lecture 4 20/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Group Lasso Sometimes predictors belong to the same group: ‚ genes that belong to the same molecular pathway; ‚ dummy variables from the same categorical variable . . . Suppose the p predictors are grouped in L groups, group lasso minimizes # L L + ? p ℓ || β j || 2 ÿ ÿ ||p y ´ β 0 � X ℓ β ℓ || 2 min β 1 ´ 2 ` λ , ℓ “ 1 ℓ “ 1 where: ‚ ? p ℓ accounts for the group sizes; ‚ || ¨ || denotes the (not squared) Euclidean norm § it is 0 ð all its component are 0; ‚ sparsity is encouraged at group level. STK-IN4300: lecture 4 21/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Non-negative garrote The idea of lasso originates from the non-negative garrote, p N ˆ ÿ ÿ c j β j x ij q 2 , β garrote “ argmin β p y i ´ β 0 ´ i “ 1 j “ 1 subject to ÿ c j ě 0 and c j ď t. j Non-negative garrote starts with OLS estimates and shrinks them: ‚ by non-negative factors; ‚ the sum of the non-negative factor is constrained; ‚ for more information, see Breiman (1995). STK-IN4300: lecture 4 22/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods In the case of orthogonal design ( X T X “ I N ), ˜ ¸ λ c j p λ q “ 1 ´ , ˆ β OLS j where λ is a tuning parameter (related to t ). Note that the solution depends on ˆ β OLS : ‚ cannot be applied in p ąą N problems; ‚ may be a problem when ˆ β OLS behaves poorly; ‚ has the oracle properties (Yuan & Lin, 2006) Ð see soon. STK-IN4300: lecture 4 23/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Comparison between lasso (left) and non-negative garrote (right). (picture from Tibshirani, 1996) STK-IN4300: lecture 4 24/ 39
STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: the oracle property Let: ‚ A : “ t j : β j ‰ 0 u be the set of the true relevant coefficients; ‚ δ be a fitting procedure (lasso, non-negative garrote, . . . ); ‚ ˆ β p δ q the coefficient estimator of the procedure δ . We would like that δ : (a) identifies the right subset model, t j : ˆ β p δ q ‰ 0 u “ A ; (b) has the optimal estimation rate, ? n p ˆ β p δ q A ´ β A q d Ý Ñ N p 0 , Σ q , where Σ is the covariance matrix for the true subset model. If δ asymptotically satisfies (a) and (b), it is defined an oracle procedure. STK-IN4300: lecture 4 25/ 39
Recommend
More recommend