lecture 10 regularized penalized regression cont d
play

Lecture 10: Regularized/penalized regression (contd) Felix Held, - PowerPoint PPT Presentation

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019 A short recap Goals of modelling 1. Predictive strength: How well can we reconstruct the


  1. Lecture 10: Regularized/penalized regression (cont’d) Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019

  2. A short recap

  3. Goals of modelling 1. Predictive strength: How well can we reconstruct the observed data? Has been most important so far. true model ? This is about uncovering structure to allow for mechanistic understanding. 1/31 2. Model/variable selection: Which variables are part of the

  4. Feature selection variable into consideration (e.g. PCA) estimation through penalisation of the model coefficients additional hyper-parameter Feature selection can be addressed in multiple ways 2/31 time (e.g. F-Score, MIC) or does not take the outcome the data is built ▶ Filtering: Remove variables before the actual model for ▶ Often crude but fast ▶ Typically only pays attention to one or two features at a ▶ Wrapping: Consider the selected features as an ▶ computationally very heavy ▶ most approximations are greedy algorithms ▶ Embedding: Include feature selection into parameter ▶ Naive form is equally computationally heavy as wrapping ▶ Soft-constraints create biased but useful approximations

  5. Penalised regression ̂ in 𝜸 = 𝟏 for 𝑟 = 1 the lasso when 𝑟 ≥ 1 . 𝑟 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 The optimization problem 𝜸 = arg min for 𝑟 > 0 is equivalent to ‖𝜸‖ 𝑟 subject to 2 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 arg min 3/31 𝑟 ≤ 𝑢 2 + 𝜇‖𝜸‖ 𝑟 ▶ For 𝑟 = 2 known as ridge regression for 𝑟 = 1 known as ▶ Constraints are convex for all 𝑟 ≥ 1 but not differentiable

  6. Intuition for the penalties (I) 𝐬 = 𝐳 − 𝐘𝜸 OLS ‖𝐳 − 𝐘𝜸‖ 2 2 = ‖(𝐘(𝜸 − 𝜸 OLS ) − 𝐬‖ 2 2 = (𝜸 − 𝜸 OLS ) 𝑈 𝐘 𝑈 𝐘(𝜸 − 𝜸 OLS ) − 2𝐬 𝑈 𝐘(𝜸 − 𝜸 OLS ) + 𝐬 𝑈 𝐬 4/31 Assume the OLS solution 𝜸 OLS exists and set it follows for the residual sum of squares (RSS) that 2 = ‖(𝐘𝜸 OLS + 𝐬) − 𝐘𝜸‖ 2 which is an ellipse (at least in 2D) centred on 𝜸 OLS .

  7. Intuition for the penalties (II) The least squares RSS is minimized for 𝜸 OLS . If a constraint is The blue lines are the contour lines for the RSS. 5/31 possible that fulfills the constraint. added ( ‖𝜸‖ 𝑟 𝑟 ≤ 𝑢 ) then the RSS is minimized by the closest 𝜸 Lasso Ridge β 1 β 1 ● ● β OLS β OLS ● ● β lasso β ridge β 2 β 2

  8. Intuition for the penalties (III) will be at the Depending on 𝑟 the Convexity only for 𝑟 ≥ 1 Sparsity only for 𝑟 ≤ 1 corresponding dot. 6/31 constrained solution different constraints one of the coloured lead to different areas or on a line, the q: 0.7 q: 1 β 1 β 1 ● ● ● ● ● ● β 2 β 2 solutions. If 𝜸 OLS is in q: 2 q: Inf β 1 β 1 ● ● ● ● ● ● β 2 β 2

  9. Shrinkage and effective degrees of freedom When 𝜇 is fixed, the shrinkage of the lasso estimate 𝜸 lasso (𝜇) the effective degrees of freedom . 𝑒 2 𝑘 𝑒 2 𝑘=1 ∑ 𝑞 df (𝜇) ∶= tr (𝐈(𝜇)) = and 𝐈(𝜇) ∶= 𝐘(𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 ) −1 𝐘 𝑈 For ridge regression define 𝜇 = 0 Note: 𝑡(𝜇) ∈ [0, 1] with 𝑡(𝜇) → 0 for increasing 𝜇 and 𝑡(𝜇) = 1 if ‖𝜸 OLS ‖ 1 7/31 compared to the OLS estimate 𝜸 OLS is defined as 𝑡(𝜇) = ‖𝜸 lasso (𝜇)‖ 1 𝑘 + 𝜇,

  10. A regularisation path Prostate cancer dataset ( 𝑜 = 67 , 𝑞 = 8 ) 8/31 Red dashed lines indicate the 𝜇 selected by cross-validation Ridge Lasso 0.75 0.75 0.50 0.50 Coefficient Coefficient 0.25 0.25 0.00 0.00 −0.25 −0.25 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 Effective degrees of freedom Shrinkage 0.75 0.75 0.50 0.50 Coefficient Coefficient 0.25 0.25 0.00 0.00 −0.25 −0.25 −5 0 5 10 −5 0 5 10 log( λ ) log( λ )

  11. Connection to classification

  12. Recall: Regularised Discriminant Analysis (RDA) 𝚻 𝑗 . If diagonal matrix 𝚬 + 𝜇ˆ 𝑗 𝚻 QDA Penalisation can help here: 9/31 𝑞(𝑗) = 𝜌 𝑗 𝑞(𝐲|𝑗) = 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 ) and Estimates ˆ Given training samples (𝑗 𝑚 , 𝐲 𝑚 ) , quadratic DA models 𝝂 𝑗 , ˆ 𝚻 𝑗 and ˆ 𝜌 𝑗 are straight-forward to find,… …but evaluating the normal density requires inversion of ˆ it is (near-)singular, this can lead to numerical instability . 𝚻 LDA for 𝜇 > 0 ▶ Use ˆ 𝚻 𝑗 = ˆ 𝚻 LDA + 𝜇𝚬 for 𝜇 > 0 and a ▶ Use LDA (i.e. 𝚻 𝑗 = 𝚻 ) and ˆ 𝚻 = ˆ

  13. Recall: Naive Bayes LDA 𝑗 𝑚 =𝑗 as the predicted class. 𝜀 𝑗 (𝐲) 𝑗 𝑑(𝐲) = arg max and by choosing 𝜌 𝑗 ) 𝝂 𝑗 ) + log (ˆ 𝚬 −1 (𝐲 − ˆ 2(𝐲 − ˆ 𝜀 𝑗 (𝐲) = −1 functions 𝝂 𝑗,𝑘 ) 2 10/31 ∑ 𝑜 − 𝐿 𝚬 for a 𝚬 . The diagonal elements are estimated as ˆ Δ 2 1 𝐿 ∑ 𝑗=1 Naive Bayes LDA means that we assume that ˆ 𝚻 = ˆ diagonal matrix ˆ 𝑘𝑘 = (𝑦 𝑚𝑘 − ˆ which is the pooled within-class variance . Classification is performed by evaluating the discriminant 𝝂 𝑗 ) 𝑈 ˆ

  14. Shrunken centroids (I) In high-dimensional problems, centroids will 𝝂 𝑈 ‖ 1 ‖𝐰−ˆ 𝑜 (𝑜 − 𝑜 𝑗 )𝑜 𝑗 2 +𝜇√ 𝚬+𝑡 0 𝐉 𝑞 ) −1/2 (𝐲 𝑚 −𝐰)‖ 2 ‖(ˆ 𝑗 𝑚 =𝑗 1 𝐰 𝝂 𝑡 ˆ stabilises centroid estimates by solving Nearest shrunken centroids performs variable selection and 2 and reduce noise . Note: The class centroids solve ˆ 𝐰 1 𝑗 𝑚 =𝑗 11/31 ▶ contain noise ▶ be hard to interpret when all variables are active As in regression, we would like to perform variable selection 𝝂 𝑗 = arg min 2 ∑ ‖𝐲 𝑚 − 𝐰‖ 2 𝑗 = arg min 2 ∑

  15. Shrunken centroids (II) 2 +𝜇√ problem that are less variable across samples ( interpretability ) covariance matrix. Leads to greater weights for variables centroid 𝝂 𝑈 𝝂 𝑈 ‖ 1 ‖𝐰−ˆ Nearest shrunken centroids (𝑜 − 𝑜 𝑗 )𝑜 𝑗 𝑜 𝚬 + 𝑡 0 𝐉 𝑞 ) −1/2 (𝐲 𝑚 −𝐰)‖ 2 ‖(ˆ 𝑗 𝑚 =𝑗 1 𝐰 𝝂 𝑡 ˆ 12/31 𝑗 = arg min 2 ∑ ▶ Penalises distance of class centroid to the overall ▶ ˆ 𝚬 + 𝑡 0 𝐉 𝑞 is the diagonal regularised within-class ▶ √(𝑜 − 𝑜 𝑗 )𝑜 𝑗 /𝑜 is only there for technical reasons ▶ If the predictors are centred ( ˆ 𝝂 𝑈 = 0 ) this is a scaled lasso

  16. Shrunken centroids (III) 𝝂 𝑈,𝑘 respective component of the overall centroid. increasing 𝜇 and declines for too high values through e.g. cross-validation. Note: 𝜇 is a tuning parameter and has to be determined 𝑜 . 1 1 The solution for component 𝑘 can be derived using 13/31 ˆ where 𝝂 𝑡 ˆ subdifferentials as 𝝂 𝑗,𝑘 − ˆ 𝑗,𝑘 = ˆ 𝝂 𝑈,𝑘 +𝑛 𝑗 (Δ 𝑘𝑘 +𝑡 0 ) ST (𝑢 𝑗,𝑘 , 𝜇) 𝑢 𝑗,𝑘 = 𝑛 𝑗 (Δ 𝑘𝑘 + 𝑡 0 ) and 𝑛 𝑗 = √ 𝑜 𝑗 − ▶ Typically, misclassification rate improves first with ▶ The larger 𝜇 the more components will be equal to the

  17. Application of nearest shrunken centroids (I) A gene expression data set with 𝑜 = 63 and 𝑞 = 2308 . There misclassification rate 5-fold cross-validation curve and largest 𝜇 that leads to minimal 14/31 are four classes (cancer subtypes) with 𝑜 BL = 8 , 𝑜 EWS = 23 , 𝑜 NB = 12 , and 𝑜 RMS = 20 . Misclassification rate ● 0.6 ● ● 0.4 ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● 0 2 4 6 λ

  18. Application of nearest shrunken centroids (II) Grey lines show the original centroids and red lines show the shrunken centroids 15/31 BL 0 EWS Average Expression 0 NB 0 RMS 0 0 500 1000 1500 2000 Gene

  19. General calculation of the lasso estimates

  20. Calculation of the lasso estimate 2 |𝛾 𝑗 | 𝑗=1 ∑ 𝑞 𝑗=1 ∑ 𝑞 𝑚=1 ∑ 𝑜 𝑘 𝛾 𝑗 𝛾 𝐲 𝑈 𝑗,𝑘=1 ∑ 𝑞 1 2‖𝐳 − 𝐘𝜸‖ 2 then ˆ What about the general case? arg min 𝜸 𝛾 1 ,…,𝛾 𝑞 1 can be written in coordinates (omitting terms not dependent on any 𝛾 𝑗 ) arg min 16/31 Last lecture: When 𝐘 𝑈 𝐘 = 𝐉 𝑞 and 𝜸 OLS are the OLS estimates 𝛾 lasso ,𝑘 (𝜇) = sign (𝛾 OLS ,𝑘 )(|𝛾 OLS ,𝑘 | − 𝜇) + = ST (𝛾 OLS ,𝑘 , 𝜇) where 𝑦 + = max (𝑦, 0) and the soft-thresholding operator ST . Coordinate Descent: The lasso problem 2 + 𝜇‖𝜸‖ 1 𝑗 𝐲 𝑘 − 𝑧 𝑚 𝑦 𝑚𝑗 𝛾 𝑗 + 𝜇

  21. Subderivative and subdifferential 𝑦 − 𝑦 0 {+1} [−1, 1] {−1} ⎩ ⎪ ⎨ ⎪ ⎧ 𝜀𝑔(𝑦 0 ) = Example: Let 𝑔(𝑦) = |𝑦| , then subdifferential of 𝑔 at 𝑦 0 . Let 𝑔 ∶ 𝐽 → ℝ be a convex function in an open interval 𝐽 and all 𝑑 ∈ [𝑏, 𝑐] are subderivatives. Call 𝜀𝑔(𝑦 0 ) ∶= [𝑏, 𝑐] the 𝑔(𝑦) − 𝑔(𝑦 0 ) 0 𝑔(𝑦) − 𝑔(𝑦 0 ) ≥ 𝑑(𝑦 − 𝑦 0 ) It can be shown that for 𝑏 = lim 0 𝑦→𝑦 − 𝑔(𝑦) − 𝑔(𝑦 0 ) 𝑦 − 𝑦 0 and 𝑐 = lim 𝑦→𝑦 + 17/31 𝑦 0 ∈ 𝐽 . A 𝑑 ∈ ℝ is called a subderivative of 𝑔 at 𝑦 0 if 𝑦 0 < 0 𝑦 0 = 0 𝑦 0 > 0

Recommend


More recommend