Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - PowerPoint PPT Presentation

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models p � = � + � f ( X ) X 0 j j = j 1 Here the X ’s might be: •Raw predictor variables (continuous or coded-categorical) •Transformed predictors ( X 4 =log X 3 ) •Basis expansions ( X 4 = X 3 2 , X 5 = X 3 3 , etc.) •Interactions ( X 4 = X 2 X 3 ) Popular choice for estimation is least squares: p N � � � = � � � � 2 RSS ( ) ( y X ) i 0 j j = = i 1 j 1

Least Squares � = � � � � T RSS ( ) ( y X ) ( y X ) ˆ � � = � T 1 T ( X X ) X y ˆ � = � = � T 1 T ˆ y X X ( X X ) X y hat matrix Often assume that the Y ’s are independent and normally distributed, leading to various classical statistical tests and confidence intervals

Gauss-Markov Theorem Consider any linear combination of the β ’s: � = � T a The least squares estimate of θ is: ˆ ˆ � = � = � T T T 1 T a a ( X X ) X y If the linear model is correct, this estimate is unbiased ( X fixed): � ) = E ( a T ( X T X ) � 1 X T y ) = a T ( X T X ) � 1 X T X � = a T � E ( ˆ Gauss-Markov states that for any other linear unbiased estimator ~ � = : c T i.e., E ( c T y ) = a T � , y ˆ � � T T Var ( a ) Var ( c y ) Of course, there might be a biased estimator with lower MSE…

bias-variance ~ For any estimator : � ~ ~ � = E � � � 2 ( ) ( ) MSE ~ ~ ~ = � � � + � � � 2 E ( E ( ) E ( ) ) ~ ~ ~ = � � � + � � � 2 2 E ( E ( )) E ( E ( ) ) ~ ~ = � + � � � 2 Var ( ) ( E ( ) ) bias Note MSE closely related to prediction error: ~ ~ ~ � � = � � + � � � = � + � T 2 T 2 T T 2 2 T E ( Y x ) E ( Y x ) E ( x x ) MSE ( x ) 0 0 0 0 0 0 0

Too Many Predictors? When there are lots of X ’s, get models with high variance and prediction suffers. Three “solutions:” 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods, 2. Shrinkage/Ridge Regression 3. Derived Inputs

Subset Selection •Standard “all-subsets” finds the subset of size k , k =1,…, p , that minimizes RSS: •Choice of subset size requires tradeoff – AIC, BIC, marginal likelihood, cross-validation, etc. •“Leaps and bounds” is an efficient algorithm to do all-subsets

Cross-Validation •e.g. 10-fold cross-validation:  Randomly divide the data into ten parts  Train model using 9 tenths and compute prediction error on the remaining 1 tenth  Do these for each 1 tenth of the data  Average the 10 prediction error estimates “One standard error rule” pick the simplest model within one standard error of the minimum

Shrinkage Methods •Subset selection is a discrete process – individual variables are either in or out •This method can have high variance – a different dataset from the same source can result in a totally different model •Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient.

Ridge Regression p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � 2 s subject to: j = j 1 Equivalently: � � p p N � � � ˆ � � � = � � � � + � � ridge 2 2 arg min ( y x ) � � i 0 ij j j � � � = = = i 1 j 1 j 1 This leads to: ˆ � � = + � ridge T 1 T ( X X I ) X y works even when X T X is singular Choose λ by cross-validation. Predictors should be centered.

effective number of X ’s

Ridge Regression = Bayesian Regression � + � � T 2 y ~ N ( x , ) i 0 i � � 2 ~ N ( 0 , ) j � = � � 2 2 same as ridge with

The Lasso p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � s subject to: j = j 1 Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. � � q q =0: var. sel. p p ~ N � � � � � � = � � � � + � � 2 arg min ( y x ) q =1: lasso � � i 0 ij j j q =2: ridge � � � = = = i 1 j 1 j 1 Learn q ?

function of 1/lambda

Principal Component Regression Consider a an eigen-decomposition of X T X (and hence the covariance matrix of X ): (X is first centered) = T 2 T X X VD V (X is N x p) The eigenvectors v j are called the principal components of X D is diagonal with entries d 1 ≥ d 2 ≥ … ≥ d p Xv has largest sample variance amongst all normalized linear 1 2 d combinations of the columns of X = 1 (var ( Xv ) ) 1 N Xv has largest sample variance amongst all normalized linear k combinations of the columns of X subject to being orthogonal to all the earlier ones

Principal Component Regression PC Regression regresses on the first M principal components where M < p Similar to ridge regression in some respects – see HTF, p.66

www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf

x1<-rnorm(10) x2<-rnorm(10) y<-(3*x1) + x2 + rnorm(10,0.1) par(mfrow=c(1,2)) plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x1)) plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x2)) epsilon <- 0.1 r <- y beta <- c(0,0) numIter <- 25 for (i in 1:numIter) { cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n"); if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") } }

LARS ► Start with all coefficients b j = 0 ► Find the predictor x j most correlated with y ► Increase b j in the direction of the sign of its correlation with y . Take residuals r = y - y hat along the way. Stop when some other predictor x k has as much correlation with r as x j has ► Increase ( b j , b k ) in their joint least squares direction until some other predictor x m has as much correlation with the residual r . ► Continue until all predictors are in the model

Fused Lasso If there are many correlated features, lasso gives • non-zero weight to only one of them Maybe correlated features (e.g. time-ordered) • should have similar coefficients? Tibshirani et al. (2005)

Group Lasso Suppose you represent a categorical predictor • with indicator variables Might want the set of indicators to be in or out • regular lasso: group lasso: Yuan and Lin (2006)

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - PowerPoint PPT Presentation

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models p = + f ( X ) X 0 j j = j 1 Here the X s might be: Raw predictor variables (continuous or

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

January 2019 Important Notice The information contained in this presentation is for

The Role of the Guernsey Registry in Charity and NPO registration Helen Proudlove-Gains Deputy

2019 Compliance Workshop Session 2: Alberta Emission Offsets Alberta Environment and Parks

~SAMHSA ~ Department of Health and Human Services Substance Abuse and Mental Health Services

(U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015

Security Regression Addressing Security Regression by Unit Testing Christopher Grayson

When to invest in high speed rail British experience Chris Nash Research Professor

Mark Carlson Hui Shan Missaka Warusawitharana Motivation Concerns about the impact of bank