Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman
Linear Regression Models p � = � + � f ( X ) X 0 j j = j 1 Here the X ’s might be: •Raw predictor variables (continuous or coded-categorical) •Transformed predictors ( X 4 =log X 3 ) •Basis expansions ( X 4 = X 3 2 , X 5 = X 3 3 , etc.) •Interactions ( X 4 = X 2 X 3 ) Popular choice for estimation is least squares: p N � � � = � � � � 2 RSS ( ) ( y X ) i 0 j j = = i 1 j 1
Least Squares � = � � � � T RSS ( ) ( y X ) ( y X ) ˆ � � = � T 1 T ( X X ) X y ˆ � = � = � T 1 T ˆ y X X ( X X ) X y hat matrix Often assume that the Y ’s are independent and normally distributed, leading to various classical statistical tests and confidence intervals
Gauss-Markov Theorem Consider any linear combination of the β ’s: � = � T a The least squares estimate of θ is: ˆ ˆ � = � = � T T T 1 T a a ( X X ) X y If the linear model is correct, this estimate is unbiased ( X fixed): � ) = E ( a T ( X T X ) � 1 X T y ) = a T ( X T X ) � 1 X T X � = a T � E ( ˆ Gauss-Markov states that for any other linear unbiased estimator ~ � = : c T i.e., E ( c T y ) = a T � , y ˆ � � T T Var ( a ) Var ( c y ) Of course, there might be a biased estimator with lower MSE…
bias-variance ~ For any estimator : � ~ ~ � = E � � � 2 ( ) ( ) MSE ~ ~ ~ = � � � + � � � 2 E ( E ( ) E ( ) ) ~ ~ ~ = � � � + � � � 2 2 E ( E ( )) E ( E ( ) ) ~ ~ = � + � � � 2 Var ( ) ( E ( ) ) bias Note MSE closely related to prediction error: ~ ~ ~ � � = � � + � � � = � + � T 2 T 2 T T 2 2 T E ( Y x ) E ( Y x ) E ( x x ) MSE ( x ) 0 0 0 0 0 0 0
Too Many Predictors? When there are lots of X ’s, get models with high variance and prediction suffers. Three “solutions:” 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods, 2. Shrinkage/Ridge Regression 3. Derived Inputs
Subset Selection •Standard “all-subsets” finds the subset of size k , k =1,…, p , that minimizes RSS: •Choice of subset size requires tradeoff – AIC, BIC, marginal likelihood, cross-validation, etc. •“Leaps and bounds” is an efficient algorithm to do all-subsets
Cross-Validation •e.g. 10-fold cross-validation: Randomly divide the data into ten parts Train model using 9 tenths and compute prediction error on the remaining 1 tenth Do these for each 1 tenth of the data Average the 10 prediction error estimates “One standard error rule” pick the simplest model within one standard error of the minimum
Shrinkage Methods •Subset selection is a discrete process – individual variables are either in or out •This method can have high variance – a different dataset from the same source can result in a totally different model •Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient.
Ridge Regression p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � 2 s subject to: j = j 1 Equivalently: � � p p N � � � ˆ � � � = � � � � + � � ridge 2 2 arg min ( y x ) � � i 0 ij j j � � � = = = i 1 j 1 j 1 This leads to: ˆ � � = + � ridge T 1 T ( X X I ) X y works even when X T X is singular Choose λ by cross-validation. Predictors should be centered.
effective number of X ’s
Ridge Regression = Bayesian Regression � + � � T 2 y ~ N ( x , ) i 0 i � � 2 ~ N ( 0 , ) j � = � � 2 2 same as ridge with
The Lasso p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � s subject to: j = j 1 Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. � � q q =0: var. sel. p p ~ N � � � � � � = � � � � + � � 2 arg min ( y x ) q =1: lasso � � i 0 ij j j q =2: ridge � � � = = = i 1 j 1 j 1 Learn q ?
function of 1/lambda
Principal Component Regression Consider a an eigen-decomposition of X T X (and hence the covariance matrix of X ): (X is first centered) = T 2 T X X VD V (X is N x p) The eigenvectors v j are called the principal components of X D is diagonal with entries d 1 ≥ d 2 ≥ … ≥ d p Xv has largest sample variance amongst all normalized linear 1 2 d combinations of the columns of X = 1 (var ( Xv ) ) 1 N Xv has largest sample variance amongst all normalized linear k combinations of the columns of X subject to being orthogonal to all the earlier ones
Principal Component Regression PC Regression regresses on the first M principal components where M < p Similar to ridge regression in some respects – see HTF, p.66
www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf
x1<-rnorm(10) x2<-rnorm(10) y<-(3*x1) + x2 + rnorm(10,0.1) par(mfrow=c(1,2)) plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x1)) plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x2)) epsilon <- 0.1 r <- y beta <- c(0,0) numIter <- 25 for (i in 1:numIter) { cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n"); if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") } }
LARS ► Start with all coefficients b j = 0 ► Find the predictor x j most correlated with y ► Increase b j in the direction of the sign of its correlation with y . Take residuals r = y - y hat along the way. Stop when some other predictor x k has as much correlation with r as x j has ► Increase ( b j , b k ) in their joint least squares direction until some other predictor x m has as much correlation with the residual r . ► Continue until all predictors are in the model
Fused Lasso If there are many correlated features, lasso gives • non-zero weight to only one of them Maybe correlated features (e.g. time-ordered) • should have similar coefficients? Tibshirani et al. (2005)
Group Lasso Suppose you represent a categorical predictor • with indicator variables Might want the set of indicators to be in or out • regular lasso: group lasso: Yuan and Lin (2006)
Recommend
More recommend