lecture 9 regularized penalized regression
play

Lecture 9: Regularized/penalized regression Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()


  1. Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019

  2. Revisited: Expectation-Maximization (I) New target function: Maximize ] π‘Ÿ(𝐚) and ] π‘Ÿ(𝐚) Choosing π‘Ÿ(𝐚) is therefore a trade-off between same value, irrespective of the chosen π‘Ÿ(𝐚) . 1/24 Note: with respect to π‘Ÿ(𝐚) and 𝜾 ] π‘Ÿ(𝐚) π‘Ÿ(𝐚) log (π‘ž(𝐘|𝜾)) = 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐘, 𝐚|𝜾) ] βˆ’ 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐚|𝐘, 𝜾) β–Ά The left hand side is independent of π‘Ÿ(𝐚) β–Ά The difference on the right hand side has always the 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐘, 𝐚|𝜾) 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐚|𝐘, 𝜾)

  3. Revisited: Expectation-Maximization (II) 𝑅(𝜾, 𝜾 (𝑛) ) π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] it follows that Note: Since 𝜾 π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] π‘ž(𝐘, 𝐚|𝜾) π‘Ÿ(𝐚) = π‘ž(𝐚|𝐘, 𝜾 (𝑛) ) minimizes the second term and 2/24 1. Expectation step: For given parameters 𝜾 (𝑛) the density thereby maximizes the first one . Set 𝑅(𝜾, 𝜾 (𝑛) ) = 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log 2. Maximization step: Maximize the first term with 𝜾 (𝑛+1) = arg max 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log π‘ž(𝐚|𝐘, 𝜾 (𝑛) ) π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] = 0 log (π‘ž(𝐘|𝜾 (𝑛) )) = 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log π‘ž(𝐘, 𝐚|𝜾 (𝑛) )

  4. Regularized/penalized regression

  5. Remember ordinary least-squares (OLS) 1 removes the need to estimate the intercept π‘œ 1 π‘œ 1 ( Consider the model π‘œ variance (4) which are (roughly) normally distributed (5) where 𝐳 = 𝐘𝜸 + 𝜻 3/24 Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant β–Ά 𝐳 ∈ ℝ π‘œ is the outcome , 𝐘 ∈ ℝ π‘œΓ—(π‘ž+1) is the design matrix , 𝜸 ∈ ℝ π‘ž+1 are the regression coefficients , and 𝜻 ∈ ℝ π‘œ is the additive error β–Ά Five basic assumptions have to be checked β–Ά Centring ( π‘œ βˆ‘ π‘š=1 𝑦 π‘šπ‘˜ = 0 ) and standardisation π‘š=1 𝑦 2 π‘šπ‘˜ = 1 ) of predictors simplifies interpretation π‘œ βˆ‘ β–Ά Centring the outcome ( π‘œ βˆ‘ π‘š=1 𝑧 π‘š = 0 ) and features

  6. Feature selection as motivation Analytical solution exists when 𝐘 π‘ˆ 𝐘 is invertible Μ‚ This can be unstable or fail in case of Solutions: Regularisation or feature selection 4/24 𝜸 OLS = (𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ 𝐳 β–Ά high correlation between predictors, or β–Ά if π‘ž > π‘œ .

  7. Filtering for feature selection after observing 𝑧 of a proper feature selection step not geared towards a certain method random forests 5/24 correlate most with the response β–Ά Choose features through pre-processing β–Ά Features with maximum variance β–Ά Use only the first 𝑙 PCA components β–Ά Examples of other useful measures β–Ά Use a univariate criterion, e.g. F-score: Features that β–Ά Mutual Information: Reduction in uncertainty about 𝐲 β–Ά Variable importance: Determine variable importance with β–Ά Summary β–Ά Pro: Fast and easy β–Ά Con: Filtering mostly operates on single features and is β–Ά Care with cross-validation and multiple testing necessary β–Ά Filtering is often more of a pre-processing step and less

  8. Wrapping for feature selection then remove sequentially the one with the least impact model) being selected, resulting in a potentially very different variance (small changes could lead to different predictors ( greedy algorithm ) algorithm ) each step the variable that improves fit the most ( greedy performance with e.g. cross-validation many ) subsets of features and compare model of different complexity and comparing their performance 6/24 β–Ά Idea: Determine the best set of features by fitting models β–Ά Best subset selection: Try all possible ( exponentially β–Ά Forward selection: Start with just an intercept and add in β–Ά Backward selection: Start with all variables included and β–Ά As discreet procedures, all of these methods exhibit high

  9. Embedding for feature selection π‘˜=1 where πœ‡ is a tuning parameter and π‘Ÿ β‰₯ 1 or π‘Ÿ = ∞ . π‘Ÿ ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 = arg min Μ‚ solve However, discrete optimization problems are hard to 7/24 βˆ‘ π‘ž ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 = arg min Μ‚ estimation procedure β–Ά Embed/include the feature selection into the model β–Ά Ideally, penalization on the number of included features 1 (𝛾 2 + πœ‡ π‘˜ β‰  0) β–Ά Softer regularisation methods can help 2 + πœ‡β€–πœΈβ€– π‘Ÿ

  10. Constrained regression Μ‚ subgradients) both are differentiable problem. π‘Ÿ ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 The optimization problem 𝜸 = arg min for π‘Ÿ > 0 is equivalent to β€–πœΈβ€– π‘Ÿ subject to 2 ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 arg min 8/24 π‘Ÿ ≀ 𝑒 2 + πœ‡β€–πœΈβ€– π‘Ÿ when π‘Ÿ β‰₯ 1 . This is the Lagrangian of the constrained β–Ά Clear when π‘Ÿ > 1 : Convex constraint + target function and β–Ά Harder to prove for π‘Ÿ = 1 , but possible (e.g. with

  11. Ridge regression For π‘Ÿ = 2 the constrained problem is ridge regression Μ‚ i.e. 1 + πœ‡, 𝜸 OLS Μ‚ 𝜸 ridge (πœ‡) = Μ‚ If 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž , then 𝜸 ridge (πœ‡) = (𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ 𝐳 Μ‚ 9/24 π‘ž where β€–πœΈβ€– 2 2 ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 ridge (πœ‡) = arg min Μ‚ 2 + πœ‡β€–πœΈβ€– 2 2 = βˆ‘ π‘˜=1 𝛾 2 π‘˜ . An analytical solution exists if 𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž is invertible 𝜸 ridge (πœ‡) is biased but has lower variance .

  12. SVD and ridge regression βˆ‘ features. lower eigenvalues , e.g. in presence of correlation between π‘˜ 𝐯 π‘ˆ 𝑒 2 π‘˜ 𝑒 π‘˜=1 π‘ž = 𝜸 ridge (πœ‡) = (𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ 𝐳 Μ‚ The analytical solution for ridge regression becomes ( π‘œ β‰₯ π‘ž ) 𝐘 = 𝐕𝐄𝐖 π‘ˆ 10/24 Recall: The SVD of a matrix 𝐘 ∈ ℝ π‘œΓ—π‘ž was = (𝐖𝐄 2 𝐖 π‘ˆ + πœ‡π‰ π‘ž ) βˆ’1 𝐖𝐄𝐕 π‘ˆ 𝐳 = 𝐖(𝐄 2 + πœ‡π‰ π‘ž ) βˆ’1 𝐄𝐕 π‘ˆ 𝐳 π‘˜ 𝐳 π‘˜ + πœ‡π° Ridge regression acts most on principal components with

  13. Effective degrees of freedom π‘ž 𝑒 2 π‘˜ 𝑒 2 π‘˜=1 βˆ‘ df (πœ‡) ∢= tr (𝐈(πœ‡)) = and 𝐈(πœ‡) ∢= 𝐘(𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ In analogy define for ridge regression regression coefficients. 𝚻 and the degrees of freedom for the tr (𝐼) = tr (𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ ) = tr (𝐘 π‘ˆ 𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 ) = tr (𝐉 π‘ž ) = π‘ž 11/24 Recall the hat matrix 𝐈 = 𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ in OLS. The trace of 𝐈 is equal to the trace of Λ† π‘˜ + πœ‡, the effective degrees of freedom .

  14. Lasso regression For π‘Ÿ = 1 the constrained problem is known as the lasso Μ‚ 𝜸 ridge (πœ‡) = arg min 𝜸 ‖𝐳 βˆ’ π˜πœΈβ€– 2 12/24 2 + πœ‡β€–πœΈβ€– 1 β–Ά Smallest π‘Ÿ in penalty such that constraint is still convex β–Ά Performs feature selection

  15. Intuition for the penalties (I) 𝐬 = 𝐳 βˆ’ 𝐘𝜸 OLS ‖𝐳 βˆ’ π˜πœΈβ€– 2 2 = β€–(𝐘(𝜸 βˆ’ 𝜸 OLS ) βˆ’ 𝐬‖ 2 2 = (𝜸 βˆ’ 𝜸 OLS ) π‘ˆ 𝐘 π‘ˆ 𝐘(𝜸 βˆ’ 𝜸 OLS ) βˆ’ 2𝐬 π‘ˆ 𝐘(𝜸 βˆ’ 𝜸 OLS ) + 𝐬 π‘ˆ 𝐬 13/24 Assume the OLS solution 𝜸 OLS exists and set it follows for the residual sum of squares (RSS) that 2 = β€–(𝐘𝜸 OLS + 𝐬) βˆ’ π˜πœΈβ€– 2 which is an ellipse (at least in 2D) centred on 𝜸 OLS .

  16. Intuition for the penalties (II) The least squares RSS is minimized for 𝜸 OLS . If a constraint is The blue lines are the contour lines for the RSS. 14/24 possible that fulfills the constraint. added ( β€–πœΈβ€– π‘Ÿ π‘Ÿ ≀ 𝑒 ) then the RSS is minimized by the closest 𝜸 Lasso Ridge Ξ² 1 Ξ² 1 ● ● Ξ² OLS Ξ² OLS ● ● Ξ² lasso Ξ² ridge Ξ² 2 Ξ² 2

  17. Intuition for the penalties (III) constrained Depending on π‘Ÿ the dot. the corresponding solution will be at 15/24 on a line, the different constraints in one of the lead to different coloured areas or q: 0.7 q: 1 Ξ² 1 Ξ² 1 ● ● ● ● ● ● Ξ² 2 Ξ² 2 solutions. If 𝜸 OLS is q: 2 q: Inf Ξ² 1 Ξ² 1 ● ● ● ● ● ● Ξ² 2 Ξ² 2

  18. Computational aspects of the Lasso (I) What estimates does the lasso produce? non-differentiable penalisation β€–πœΈβ€– 1 ? 𝜸 in presence of the Μ‚ How do we find the solution OLS =𝜸 π‘ˆ ⏟ 2𝐳 π‘ˆ 𝐳 βˆ’ 𝐳 π‘ˆ 𝐘 16/24 2‖𝐳 βˆ’ π˜πœΈβ€– 2 1 Special case: 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž . Then 2‖𝐳 βˆ’ π˜πœΈβ€– 2 1 𝜸 arg min Target function 2 + πœ‡β€–πœΈβ€– 1 2 + πœ‡β€–πœΈβ€– 1 = 1 𝜸 + 1 2𝜸 π‘ˆ 𝜸 + πœ‡β€–πœΈβ€– 1 = 𝑕(𝜸)

  19. Computational aspects of the Lasso (II) π‘˜ | otherwise 0 𝑦 > 0 where 𝛾 Λ† Each case results in 17/24 π‘˜=1 2𝛾 2 arg min 𝜸 π‘ž βˆ‘ βˆ’π›Ύ OLS ,π‘˜ 𝛾 For 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž the target function can be written as π‘˜ + 1 π‘˜ + πœ‡|𝛾 This results in π‘ž uncoupled optimization problems. β–Ά If 𝛾 OLS ,π‘˜ > 0 , then 𝛾 π‘˜ > 0 to minimize the target β–Ά If 𝛾 OLS ,π‘˜ ≀ 0 , then 𝛾 π‘˜ ≀ 0 π‘˜ = sign (𝛾 OLS ,π‘˜ )(|𝛾 OLS ,π‘˜ | βˆ’ πœ‡) + = ST (𝛾 OLS ,π‘˜ , πœ‡), 𝑦 + = {𝑦 and ST is the soft-thresholding operator

Recommend


More recommend