Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso
Ridge/Lasso Regression Model Selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Data X : N × P matrix, Target y : N × 1 vector N samples, each sample has P features Want to find θ so that y and X θ are as close as possible Pick θ that minimizes the cost function L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i use gradient descent j − step ∗ ∂ L � θ t + 1 = θ t = θ t j − step ∗ ( y i − X i θ )( − X ij ) j ∂θ j i Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Matrix form: L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i = 1 2 ( y − X θ ) ⊤ ( y − X θ ) = 1 2 ( y ⊤ y − y ⊤ X θ − θ ⊤ X ⊤ y + θ ⊤ X ⊤ X θ ) Take derivative w.r.t. θ ∂ L ∂θ = 1 2 ( − 2 X ⊤ y + 2 X ⊤ X θ ) = 0 Hence we get θ = ( X ⊤ X ) − 1 X ⊤ y Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Comparison of iterative methods and matrix methods: matrix methods achieve solution in a single step, but can be infeasible for real-time data, or large amount of data. iterative methods can be used in large practical problems, but need to decide learning rate Any problems? Data X is an N × P matrix Usually N > P , i.e., number of data points larger than feature dimensions. And usually X is of full column rank. Under this case X ⊤ X have rank P , i.e., invertible What if X has less than full column rank? Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 2 norm Ridge Regression: 1 ( y i − X i θ ) 2 + λ || θ || 2 � min 2 2 θ i Solution is given by: θ = ( X ⊤ X + λ I ) − 1 X ⊤ y Results in a solution with small θ Solves the problem that X ⊤ X is not invertible Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 1 norm Lasso Regression: 1 ( y i − X i θ ) 2 + λ || θ || 1 � min 2 θ i Solution is given by taking subgradient: � ( y i − X i θ )( − X ij ) + λ t j i where t j is the subgradient of ℓ 1 norm, t j = sign ( θ j ) if θ j � = 0 , t j ∈ [ − 1 , 1 ] otherwise Sparse solution, i.e., θ will be a vector with more zero coordinates. Good for high-dimensional problems Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Solving Lasso regression Efron et al. proposed LARS (least angle regression) which computes the LASSO path efficiently Forward stagewise algorithm Assume X is standardized and y is centered choose small ǫ Start with initial residual r = y , and θ 1 = ... = θ P = 0 Find the predictor Z j ( j th column of X ) most correlated with r Update θ j ← θ j + δ j , where δ j = ǫ · sign ( Z ⊤ j r ) Set r ← r − δ j Z j , repeat steps 2 and 3 Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Two-dimensional case: Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Higher dimensional case: Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Choosing λ Standard practice now is to use cross-validation Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Linear regression Assume y i = X i θ + ǫ i , where ǫ is the random noise. Assume ǫ ∼ N ( 0 , σ 2 ) exp {− ( y i − X i θ ) 2 1 p ( y i | X i ; θ ) = √ } 2 σ 2 2 πσ Since data points are i.i.d, we have the data likelihood N � N i = 1 ( y i − X i θ ) 2 � L ( θ ) = p ( y i | X i ; θ ) ∝ exp {− } 2 σ 2 i = 1 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 ℓ ( θ ) = − + const 2 σ 2 Maximizing the log-likelihood is equivalent to minimize � N i = 1 ( y i − X i θ ) 2 , i.e., the loss function in LR! Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Ridge regression Assume a Gaussian prior on θ ∼ N ( 0 , τ 2 I ) , i.e., p ( θ ) ∝ exp {− θ ⊤ θ/ 2 τ 2 } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 } exp {− θ ⊤ θ/ 2 τ 2 } p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− 2 σ 2 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 − θ ⊤ θ/ 2 τ 2 + const ℓ ( θ | X , y ) = − 2 σ 2 i ( y i − X i θ ) 2 + λ || θ || 2 which matches min θ 1 � 2 , where λ is a 2 constant associated with σ 2 , τ 2 . Lasso
Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Lasso regression iid Assume a Laplace prior on θ i ∼ Laplace ( 0 , t ) , i.e., p ( θ i ) ∝ exp {−| θ i | / t } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 � p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− } exp {− | θ i | / t } 2 σ 2 i The log likelihood is: � N i = 1 ( y i − X i θ ) 2 � ℓ ( θ | X , y ) = − − | θ i | / t + const 2 σ 2 i i ( y i − X i θ ) 2 + λ || θ || 1 , where λ is a which matches min θ 1 � 2 constant associated with σ 2 , t . Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection Variable Selection Consider "best" subsets, order O ( 2 P ) (combinatorial explosion) Stepwise selection A new variable may be added into the model even with a small improvement in LMS When applying stepwise to a perturbation of the data, probably have different set of variables enter into the model at each stage LASSO produces sparse solutions, which takes care of model selection we can even see when variables jump into the model by looking at the LASSO path Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection Example Suppose you have data Y 1 , ..., Y n and you want to model the distribution of Y . Some popular models are: the Exponential distribution: f ( y ; θ ) = θ e − θ y the Gaussian distribution: f ( y ; u , σ 2 ) ∼ N ( u , σ 2 ) ... How do you know which model is better? Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection AIC Suppose we have models M 1 , ..., M k where each model is a set of densities: M j = { p ( y ; θ j ) : θ j ∈ Θ j } We have data Y 1 , ..., Y n drawn from some density f (not necessarily drawn from these models). Define AIC ( j ) = ℓ j (ˆ θ j ) − 2 d j where ℓ j ( θ j ) is the log-likelihood, and ˆ θ j is the parameter that maximizes the log-likelihood. d j is the dimension of Θ j . Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection BIC Bayesian Information Criterion We choose j to maximize θ j ) − d j BIC j = ℓ j (ˆ 2 log n which is similar to AIC but the penalty is harsher, hence BIC tends to choose simpler models. Lasso
Ridge/Lasso Regression Variable Selection Model Selection Model selection Simple example Let Y 1 , ..., Y n ∼ N ( µ, 1 ) we want to compare two model: M 0 : N ( 0 , 1 ) and M 1 : N ( u , 1 ) Lasso
Recommend
More recommend