Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019
Revisited: Expectation-Maximization (I) New target function: Maximize ] π(π) and ] π(π) Choosing π(π) is therefore a trade-off between same value, irrespective of the chosen π(π) . 1/24 Note: with respect to π(π) and πΎ ] π(π) π(π) log (π(π|πΎ)) = π½ π(π) [ log π(π, π|πΎ) ] β π½ π(π) [ log π(π|π, πΎ) βΆ The left hand side is independent of π(π) βΆ The difference on the right hand side has always the π½ π(π) [ log π(π, π|πΎ) π½ π(π) [ log π(π|π, πΎ)
Revisited: Expectation-Maximization (II) π (πΎ, πΎ (π) ) π(π|π, πΎ (π) )] it follows that Note: Since πΎ π(π|π, πΎ (π) )] π(π, π|πΎ) π(π) = π(π|π, πΎ (π) ) minimizes the second term and 2/24 1. Expectation step: For given parameters πΎ (π) the density thereby maximizes the first one . Set π (πΎ, πΎ (π) ) = π½ π(π|π,πΎ (π) ) [ log 2. Maximization step: Maximize the first term with πΎ (π+1) = arg max π½ π(π|π,πΎ (π) ) [ log π(π|π, πΎ (π) ) π(π|π, πΎ (π) )] = 0 log (π(π|πΎ (π) )) = π½ π(π|π,πΎ (π) ) [ log π(π, π|πΎ (π) )
Regularized/penalized regression
Remember ordinary least-squares (OLS) 1 removes the need to estimate the intercept π 1 π 1 ( Consider the model π variance (4) which are (roughly) normally distributed (5) where π³ = ππΈ + π» 3/24 Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant βΆ π³ β β π is the outcome , π β β πΓ(π+1) is the design matrix , πΈ β β π+1 are the regression coefficients , and π» β β π is the additive error βΆ Five basic assumptions have to be checked βΆ Centring ( π β π=1 π¦ ππ = 0 ) and standardisation π=1 π¦ 2 ππ = 1 ) of predictors simplifies interpretation π β βΆ Centring the outcome ( π β π=1 π§ π = 0 ) and features
Feature selection as motivation Analytical solution exists when π π π is invertible Μ This can be unstable or fail in case of Solutions: Regularisation or feature selection 4/24 πΈ OLS = (π π π) β1 π π π³ βΆ high correlation between predictors, or βΆ if π > π .
Filtering for feature selection after observing π§ of a proper feature selection step not geared towards a certain method random forests 5/24 correlate most with the response βΆ Choose features through pre-processing βΆ Features with maximum variance βΆ Use only the first π PCA components βΆ Examples of other useful measures βΆ Use a univariate criterion, e.g. F-score: Features that βΆ Mutual Information: Reduction in uncertainty about π² βΆ Variable importance: Determine variable importance with βΆ Summary βΆ Pro: Fast and easy βΆ Con: Filtering mostly operates on single features and is βΆ Care with cross-validation and multiple testing necessary βΆ Filtering is often more of a pre-processing step and less
Wrapping for feature selection then remove sequentially the one with the least impact model) being selected, resulting in a potentially very different variance (small changes could lead to different predictors ( greedy algorithm ) algorithm ) each step the variable that improves fit the most ( greedy performance with e.g. cross-validation many ) subsets of features and compare model of different complexity and comparing their performance 6/24 βΆ Idea: Determine the best set of features by fitting models βΆ Best subset selection: Try all possible ( exponentially βΆ Forward selection: Start with just an intercept and add in βΆ Backward selection: Start with all variables included and βΆ As discreet procedures, all of these methods exhibit high
Embedding for feature selection π=1 where π is a tuning parameter and π β₯ 1 or π = β . π βπ³ β ππΈβ 2 πΈ πΈ = arg min Μ solve However, discrete optimization problems are hard to 7/24 β π βπ³ β ππΈβ 2 πΈ πΈ = arg min Μ estimation procedure βΆ Embed/include the feature selection into the model βΆ Ideally, penalization on the number of included features 1 (πΎ 2 + π π β 0) βΆ Softer regularisation methods can help 2 + πβπΈβ π
Constrained regression Μ subgradients) both are differentiable problem. π βπ³ β ππΈβ 2 πΈ The optimization problem πΈ = arg min for π > 0 is equivalent to βπΈβ π subject to 2 βπ³ β ππΈβ 2 πΈ arg min 8/24 π β€ π’ 2 + πβπΈβ π when π β₯ 1 . This is the Lagrangian of the constrained βΆ Clear when π > 1 : Convex constraint + target function and βΆ Harder to prove for π = 1 , but possible (e.g. with
Ridge regression For π = 2 the constrained problem is ridge regression Μ i.e. 1 + π, πΈ OLS Μ πΈ ridge (π) = Μ If π π π = π π , then πΈ ridge (π) = (π π π + ππ π ) β1 π π π³ Μ 9/24 π where βπΈβ 2 2 βπ³ β ππΈβ 2 πΈ πΈ ridge (π) = arg min Μ 2 + πβπΈβ 2 2 = β π=1 πΎ 2 π . An analytical solution exists if π π π + ππ π is invertible πΈ ridge (π) is biased but has lower variance .
SVD and ridge regression β features. lower eigenvalues , e.g. in presence of correlation between π π― π π 2 π π π=1 π = πΈ ridge (π) = (π π π + ππ π ) β1 π π π³ Μ The analytical solution for ridge regression becomes ( π β₯ π ) π = πππ π 10/24 Recall: The SVD of a matrix π β β πΓπ was = (ππ 2 π π + ππ π ) β1 πππ π π³ = π(π 2 + ππ π ) β1 ππ π π³ π π³ π + ππ° Ridge regression acts most on principal components with
Effective degrees of freedom π π 2 π π 2 π=1 β df (π) βΆ= tr (π(π)) = and π(π) βΆ= π(π π π + ππ π ) β1 π π In analogy define for ridge regression regression coefficients. π» and the degrees of freedom for the tr (πΌ) = tr (π(π π π) β1 π π ) = tr (π π π(π π π) β1 ) = tr (π π ) = π 11/24 Recall the hat matrix π = π(π π π) β1 π π in OLS. The trace of π is equal to the trace of Λ π + π, the effective degrees of freedom .
Lasso regression For π = 1 the constrained problem is known as the lasso Μ πΈ ridge (π) = arg min πΈ βπ³ β ππΈβ 2 12/24 2 + πβπΈβ 1 βΆ Smallest π in penalty such that constraint is still convex βΆ Performs feature selection
Intuition for the penalties (I) π¬ = π³ β ππΈ OLS βπ³ β ππΈβ 2 2 = β(π(πΈ β πΈ OLS ) β π¬β 2 2 = (πΈ β πΈ OLS ) π π π π(πΈ β πΈ OLS ) β 2π¬ π π(πΈ β πΈ OLS ) + π¬ π π¬ 13/24 Assume the OLS solution πΈ OLS exists and set it follows for the residual sum of squares (RSS) that 2 = β(ππΈ OLS + π¬) β ππΈβ 2 which is an ellipse (at least in 2D) centred on πΈ OLS .
Intuition for the penalties (II) The least squares RSS is minimized for πΈ OLS . If a constraint is The blue lines are the contour lines for the RSS. 14/24 possible that fulfills the constraint. added ( βπΈβ π π β€ π’ ) then the RSS is minimized by the closest πΈ Lasso Ridge Ξ² 1 Ξ² 1 β β Ξ² OLS Ξ² OLS β β Ξ² lasso Ξ² ridge Ξ² 2 Ξ² 2
Intuition for the penalties (III) constrained Depending on π the dot. the corresponding solution will be at 15/24 on a line, the different constraints in one of the lead to different coloured areas or q: 0.7 q: 1 Ξ² 1 Ξ² 1 β β β β β β Ξ² 2 Ξ² 2 solutions. If πΈ OLS is q: 2 q: Inf Ξ² 1 Ξ² 1 β β β β β β Ξ² 2 Ξ² 2
Computational aspects of the Lasso (I) What estimates does the lasso produce? non-differentiable penalisation βπΈβ 1 ? πΈ in presence of the Μ How do we find the solution OLS =πΈ π β 2π³ π π³ β π³ π π 16/24 2βπ³ β ππΈβ 2 1 Special case: π π π = π π . Then 2βπ³ β ππΈβ 2 1 πΈ arg min Target function 2 + πβπΈβ 1 2 + πβπΈβ 1 = 1 πΈ + 1 2πΈ π πΈ + πβπΈβ 1 = π(πΈ)
Computational aspects of the Lasso (II) π | otherwise 0 π¦ > 0 where πΎ Λ Each case results in 17/24 π=1 2πΎ 2 arg min πΈ π β βπΎ OLS ,π πΎ For π π π = π π the target function can be written as π + 1 π + π|πΎ This results in π uncoupled optimization problems. βΆ If πΎ OLS ,π > 0 , then πΎ π > 0 to minimize the target βΆ If πΎ OLS ,π β€ 0 , then πΎ π β€ 0 π = sign (πΎ OLS ,π )(|πΎ OLS ,π | β π) + = ST (πΎ OLS ,π , π), π¦ + = {π¦ and ST is the soft-thresholding operator
Recommend
More recommend