Statistics for Applications Chapter 7: Regression 1/43
Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43
Heuristics of the linear regression (2) I Idea: Fit the best line fitting the data. I Approximation: Y i ⇡ a + bX i , i = 1 , . . . , n , for some (unknown) a, b 2 I R . ˆ I Find ˆ a, b that approach a and b . d , I More generally: Y i 2 I R , X i 2 I R Y i ⇡ a + X > b, a 2 I d . R , b 2 I R i I Goal: Write a rigorous model and estimate a and b . 3/43
Heuristics of the linear regression (3) Examples: Economics: Demand and price, D i ⇡ a + bp i , i = 1 , . . . , n. Ideal gas law: PV = nRT , log P i ⇡ a + b log V i + c log T i , i = 1 , . . . , n. 4/43
Linear regression of a r.v. Y on a r.v. X (1) Let X and Y be two real r.v. (non necessarily independent) with two moments and such that V ar ( X ) 6 = 0 . The theoretical linear regression of Y on X is the best approximation in quadratic means of Y by a linear function of X , i.e. the r.v. a + bX , where a and b are the two real h i 2 . numbers minimizing I E ( Y − a − bX ) By some simple algebra: cov ( X, Y ) I b = , V ar ( X ) cov ( X, Y ) I a = I E[ Y ] − b I E[ X ] = I E[ Y ] − I E[ X ] . V ar ( X ) 5/43
Linear regression of a r.v. Y on a r.v. X (2) If ε = Y − ( a + bX ) , then Y = a + bX + ε, with I E[ ε ] = 0 and cov ( X, ε ) = 0 . Conversely: Assume that Y = a + bX + ε for some a, b 2 I R and some centered r.v. ε that satisfies cov ( X, ε ) = 0 . E.g., if X ? ? ε or if I E[ ε | X ] = 0 , then cov ( X, ε ) = 0 . Then, a + bX is the theoretical linear regression of Y on X . 6/43
Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 7/43
Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 8/43
Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 9/43
Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 10/43
Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , Y 1 ) , . . . , ( X n , Y n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 11/43
Linear regression of a r.v. Y on a r.v. X (4) Definition The least squared error (LSE) estimator of ( a, b ) is the minimizer of the sum of squared errors: n X ( Y i − a − bX i ) 2 . i =1 ˆ) is given by (ˆ a, b ¯ ¯ XY − X Y ˆ b = , ¯ 2 X 2 − X Y − ˆ X. ¯ b ¯ a ˆ = 12/43
Linear regression of a r.v. Y on a r.v. X (5) 13/43
Multivariate case (1) Y i = X i β + ε i , i = 1 , . . . , n. R p (wlog, Vector of explanatory variables or covariates : X i 2 I assume its first coordinate is 1). Dependent variable : Y i . β = ( a, b ) ; β 1 (= a ) is called the intercept . { ε i } i =1 ,...,n : noise terms satisfying cov ( X i , ε i ) = 0 . Definition The least squared error (LSE) estimator of β is the minimizer of the sum of square errors: n ˆ = argmin X ( Y i − X i t ) 2 β t 2 I R p i =1 14/43
Multivariate case (2) LSE in matrix form R n . Let Y = ( Y 1 , . . . , Y n ) 2 I Let X be the n ⇥ p matrix whose rows are X 1 , . . . , X ( X is n called the design ). R n (unobserved noise) Let ε = ( ε 1 , . . . , ε n ) 2 I Y = X β + ε . ˆ satisfies: The LSE β ˆ = argmin k Y − Xt k 2 2 . β R p t 2 I 15/43
Multivariate case (3) Assume that rank ( X ) = p . Analytic computation of the LSE: ˆ = ( X X ) − 1 X Y . β Geometric interpretation of the LSE ˆ is the orthogonal projection of Y onto the subspace X β spanned by the columns of X : ˆ = P Y , X β where P = X ( X X ) − 1 X . 16/43
Linear regression with deterministic design and Gaussian noise (1) Assumptions: The design matrix X is deterministic and rank ( X ) = p . The model is homoscedastic : ε 1 , . . . , ε n are i.i.d. The noise vector ε is Gaussian: ε ⇠ N n (0 , σ 2 I n ) , for some known or unknown σ 2 > 0 . 17/43
Linear regression with deterministic design and Gaussian noise (2) ⇣ β , σ 2 ( X X ) − 1 . ⌘ ˆ LSE = MLE: β ⇠ N p h ˆ − β k 2 = σ 2 tr ( X X ) − 1 . i ⇣ ⌘ ˆ : Quadratic risk of β E k β I 2 i h ˆ k 2 = σ 2 ( n − p ) . Prediction error: E k Y − X β I 2 1 k Y − X β ˆ 2 = ˆ k 2 2 . Unbiased estimator of σ 2 : σ n − p Theorem 2 σ ˆ ⇠ χ 2 ( n − p ) σ . n − p 2 ˆ 2 . ˆ β ? ? σ 18/43
Significance tests (1) Test whether the j -th explanatory variable is significant in the linear regression ( 1 j p ). H 0 : β j = 0 v.s. H 1 : β j = 0 . If γ j is the j -th diagonal coe ffi cient of ( X X ) − 1 ( γ j > 0 ): ˆ j − β j β ⇠ t n − p . p ˆ 2 γ j σ ˆ j β ( j ) = p Let T . n ˆ 2 γ j σ Test with non asymptotic level α 2 (0 , 1) : δ ( j ) = 1 {| T ( j ) | > q α ( t n − p ) } , n α 2 where q α ( t n − p ) is the (1 − α/ 2) -quantile of t n − p . 2 19/43
Significance tests (2) Test whether a group of explanatory variables is significant in the linear regression. H 0 : β j = 0 , 8 j 2 S v.s. H 1 : 9 j 2 S, β j = 0 , where S ✓ { 1 , . . . , p } . Bonferroni’s test : δ B = max δ ( j ) , where k = | S | . α α/k j 2 S δ α has non asymptotic level at most α . 20/43
More tests (1) R k . Let G be a k ⇥ p matrix with rank ( G ) = k ( k p ) and λ 2 I Consider the hypotheses: H 0 : G β = λ v.s. H 1 : G β = λ . The setup of the previous slide is a particular case. If H 0 is true, then: ˆ − λ ⇠ N k 0 , σ 2 G ( X X ) − 1 G , G β and − 1 ˆ − λ ) ( G β − λ ) ⇠ χ 2 σ − 2 ( G β G ( X X ) − 1 G k . 21/43
More tests (2) � − 1 ( G β − λ ) . ( G ˆ G ( X X ) − 1 G β − λ ) � Let S n = 1 σ 2 k ˆ If H 0 is true, then S n ⇠ F k,n − p . Test with non asymptotic level α 2 (0 , 1) : δ α = 1 { S n > q α ( F k,n − p ) } , where q α ( F k,n − p ) is the (1 − α ) -quantile of F k,n − p . Definition The Fisher distribution with p and q degrees of freedom , denoted U/p by F p,q , is the distribution of , where: V/q U ⇠ χ 2 , V ⇠ χ 2 , q p U ? ? V . 22/43
Concluding remarks Linear regression exhibits correlations, NOT causality Normality of the noise: One can use goodness of fit tests to ˆ are Gaussian. test whether the residuals ε ˆ i = Y i − X i β Deterministic design: If X is not deterministic, all the above can be understood conditionally on X , if the noise is assumed to be Gaussian, conditionally on X . 23/43
Linear regression and lack of identifiability (1) Consider the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown; 3. ε ⇠ N n (0 , σ 2 I n ) . Previously, we assumed that X had rank p , so we could invert X X . What if X is not of rank p ? E.g., if p > n ? β would no longer be identified: estimation of β is vain (unless we add more structure). 24/43
Linear regression and lack of identifiability (2) What about prediction ? X β is still identified. ˆ Y : orthogonal projection of Y onto the linear span of the columns of X . ˆ = X β ˆ = X ( X X ) † XY , where A † stands for the Y (Moore-Penrose) pseudo inverse of a matrix A . Similarly as before, if k = rank ( X ) : ˆ − Y k 2 k Y 2 ⇠ χ 2 n − k , σ 2 ˆ − Y k 2 ˆ . ? Y k Y 2 ? 25/43
Linear regression and lack of identifiability (3) In particular: 2 ] = ( n − k ) σ 2 . ˆ − Y k 2 E[ k Y I Unbiased estimator of the variance: 1 ˆ 2 = 2 . ˆ − Y k 2 k Y σ n − k 26/43
Linear regression in high dimension (1) Consider again the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown: to be estimated; 3. ε ⇠ N n (0 , σ 2 I n ) . R p is the vector of covariates of the i -th For each i , X i 2 I individual. If p is too large ( p > n ), there are too many parameters to be estimated (overfitting model), although some covariates may be irrelevant. Solution: Reduction of the dimension. 27/43
Linear regression in high dimension (2) Idea: Assume that only a few coordinates of β are nonzero (but we do not know which ones). Based on the sample, select a subset of covariates and estimate the corresponding coordinates of β . For S ✓ { 1 , . . . , p } , let ˆ S 2 argmin k Y − X S t k 2 , β R S t 2 I where X S is the submatrix of X obtained by keeping only the covariates indexed in S . 28/43
Linear regression in high dimension (3) Select a subset S that minimizes the prediction error penalized by the complexity (or size) of the model: k Y − X S β S k 2 + λ | S | , ˆ where λ > 0 is a tuning parameter. σ 2 , this is the Mallow’s C p or AIC criterion. If λ = 2ˆ ˆ 2 log n , this is the BIC criterion. If λ = σ 29/43
Recommend
More recommend