Statistics for Applications Chapter 7: Regression 1/43 Heuristics - PDF document

Statistics for Applications Chapter 7: Regression 1/43

Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43

Heuristics of the linear regression (2) I Idea: Fit the best line fitting the data. I Approximation: Y i ⇡ a + bX i , i = 1 , . . . , n , for some (unknown) a, b 2 I R . ˆ I Find ˆ a, b that approach a and b . d , I More generally: Y i 2 I R , X i 2 I R Y i ⇡ a + X > b, a 2 I d . R , b 2 I R i I Goal: Write a rigorous model and estimate a and b . 3/43

Heuristics of the linear regression (3) Examples: Economics: Demand and price, D i ⇡ a + bp i , i = 1 , . . . , n. Ideal gas law: PV = nRT , log P i ⇡ a + b log V i + c log T i , i = 1 , . . . , n. 4/43

Linear regression of a r.v. Y on a r.v. X (1) Let X and Y be two real r.v. (non necessarily independent) with two moments and such that V ar ( X ) 6 = 0 . The theoretical linear regression of Y on X is the best approximation in quadratic means of Y by a linear function of X , i.e. the r.v. a + bX , where a and b are the two real h i 2 . numbers minimizing I E ( Y − a − bX ) By some simple algebra: cov ( X, Y ) I b = , V ar ( X ) cov ( X, Y ) I a = I E[ Y ] − b I E[ X ] = I E[ Y ] − I E[ X ] . V ar ( X ) 5/43

Linear regression of a r.v. Y on a r.v. X (2) If ε = Y − ( a + bX ) , then Y = a + bX + ε, with I E[ ε ] = 0 and cov ( X, ε ) = 0 . Conversely: Assume that Y = a + bX + ε for some a, b 2 I R and some centered r.v. ε that satisfies cov ( X, ε ) = 0 . E.g., if X ? ? ε or if I E[ ε | X ] = 0 , then cov ( X, ε ) = 0 . Then, a + bX is the theoretical linear regression of Y on X . 6/43

Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , . . . , X n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 7/43

Linear regression of a r.v. Y on a r.v. X (3) A sample of n i.i.d. random pairs ( X 1 , Y 1 ) , . . . , ( X n , Y n ) with same distribution as ( X, Y ) is available. We want to estimate a and b . 11/43

Linear regression of a r.v. Y on a r.v. X (4) Definition The least squared error (LSE) estimator of ( a, b ) is the minimizer of the sum of squared errors: n X ( Y i − a − bX i ) 2 . i =1 ˆ) is given by (ˆ a, b ¯ ¯ XY − X Y ˆ b = , ¯ 2 X 2 − X Y − ˆ X. ¯ b ¯ a ˆ = 12/43

Linear regression of a r.v. Y on a r.v. X (5) 13/43

Multivariate case (1) Y i = X i β + ε i , i = 1 , . . . , n. R p (wlog, Vector of explanatory variables or covariates : X i 2 I assume its first coordinate is 1). Dependent variable : Y i . β = ( a, b ) ; β 1 (= a ) is called the intercept . { ε i } i =1 ,...,n : noise terms satisfying cov ( X i , ε i ) = 0 . Definition The least squared error (LSE) estimator of β is the minimizer of the sum of square errors: n ˆ = argmin X ( Y i − X i t ) 2 β t 2 I R p i =1 14/43

Multivariate case (2) LSE in matrix form R n . Let Y = ( Y 1 , . . . , Y n ) 2 I Let X be the n ⇥ p matrix whose rows are X 1 , . . . , X ( X is n called the design ). R n (unobserved noise) Let ε = ( ε 1 , . . . , ε n ) 2 I Y = X β + ε . ˆ satisfies: The LSE β ˆ = argmin k Y − Xt k 2 2 . β R p t 2 I 15/43

Multivariate case (3) Assume that rank ( X ) = p . Analytic computation of the LSE: ˆ = ( X X ) − 1 X Y . β Geometric interpretation of the LSE ˆ is the orthogonal projection of Y onto the subspace X β spanned by the columns of X : ˆ = P Y , X β where P = X ( X X ) − 1 X . 16/43

Linear regression with deterministic design and Gaussian noise (1) Assumptions: The design matrix X is deterministic and rank ( X ) = p . The model is homoscedastic : ε 1 , . . . , ε n are i.i.d. The noise vector ε is Gaussian: ε ⇠ N n (0 , σ 2 I n ) , for some known or unknown σ 2 > 0 . 17/43

Linear regression with deterministic design and Gaussian noise (2) ⇣ β , σ 2 ( X X ) − 1 . ⌘ ˆ LSE = MLE: β ⇠ N p h ˆ − β k 2 = σ 2 tr ( X X ) − 1 . i ⇣ ⌘ ˆ : Quadratic risk of β E k β I 2 i h ˆ k 2 = σ 2 ( n − p ) . Prediction error: E k Y − X β I 2 1 k Y − X β ˆ 2 = ˆ k 2 2 . Unbiased estimator of σ 2 : σ n − p Theorem 2 σ ˆ ⇠ χ 2 ( n − p ) σ . n − p 2 ˆ 2 . ˆ β ? ? σ 18/43

Significance tests (1) Test whether the j -th explanatory variable is significant in the linear regression ( 1  j  p ). H 0 : β j = 0 v.s. H 1 : β j = 0 . If γ j is the j -th diagonal coe ffi cient of ( X X ) − 1 ( γ j > 0 ): ˆ j − β j β ⇠ t n − p . p ˆ 2 γ j σ ˆ j β ( j ) = p Let T . n ˆ 2 γ j σ Test with non asymptotic level α 2 (0 , 1) : δ ( j ) = 1 {| T ( j ) | > q α ( t n − p ) } , n α 2 where q α ( t n − p ) is the (1 − α/ 2) -quantile of t n − p . 2 19/43

Significance tests (2) Test whether a group of explanatory variables is significant in the linear regression. H 0 : β j = 0 , 8 j 2 S v.s. H 1 : 9 j 2 S, β j = 0 , where S ✓ { 1 , . . . , p } . Bonferroni’s test : δ B = max δ ( j ) , where k = | S | . α α/k j 2 S δ α has non asymptotic level at most α . 20/43

More tests (1) R k . Let G be a k ⇥ p matrix with rank ( G ) = k ( k  p ) and λ 2 I Consider the hypotheses: H 0 : G β = λ v.s. H 1 : G β = λ . The setup of the previous slide is a particular case. If H 0 is true, then: ˆ − λ ⇠ N k 0 , σ 2 G ( X X ) − 1 G , G β and − 1 ˆ − λ ) ( G β − λ ) ⇠ χ 2 σ − 2 ( G β G ( X X ) − 1 G k . 21/43

More tests (2) � − 1 ( G β − λ ) . ( G ˆ G ( X X ) − 1 G β − λ ) � Let S n = 1 σ 2 k ˆ If H 0 is true, then S n ⇠ F k,n − p . Test with non asymptotic level α 2 (0 , 1) : δ α = 1 { S n > q α ( F k,n − p ) } , where q α ( F k,n − p ) is the (1 − α ) -quantile of F k,n − p . Definition The Fisher distribution with p and q degrees of freedom , denoted U/p by F p,q , is the distribution of , where: V/q U ⇠ χ 2 , V ⇠ χ 2 , q p U ? ? V . 22/43

Concluding remarks Linear regression exhibits correlations, NOT causality Normality of the noise: One can use goodness of fit tests to ˆ are Gaussian. test whether the residuals ε ˆ i = Y i − X i β Deterministic design: If X is not deterministic, all the above can be understood conditionally on X , if the noise is assumed to be Gaussian, conditionally on X . 23/43

Linear regression and lack of identifiability (1) Consider the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown; 3. ε ⇠ N n (0 , σ 2 I n ) . Previously, we assumed that X had rank p , so we could invert X X . What if X is not of rank p ? E.g., if p > n ? β would no longer be identified: estimation of β is vain (unless we add more structure). 24/43

Linear regression and lack of identifiability (2) What about prediction ? X β is still identified. ˆ Y : orthogonal projection of Y onto the linear span of the columns of X . ˆ = X β ˆ = X ( X X ) † XY , where A † stands for the Y (Moore-Penrose) pseudo inverse of a matrix A . Similarly as before, if k = rank ( X ) : ˆ − Y k 2 k Y 2 ⇠ χ 2 n − k , σ 2 ˆ − Y k 2 ˆ . ? Y k Y 2 ? 25/43

Linear regression and lack of identifiability (3) In particular: 2 ] = ( n − k ) σ 2 . ˆ − Y k 2 E[ k Y I Unbiased estimator of the variance: 1 ˆ 2 = 2 . ˆ − Y k 2 k Y σ n − k 26/43

Linear regression in high dimension (1) Consider again the following model: Y = X β + ε , with: R n (dependent variables), X 2 I R n ⇥ p (deterministic 1. Y 2 I design) ; 2. β 2 I R p , unknown: to be estimated; 3. ε ⇠ N n (0 , σ 2 I n ) . R p is the vector of covariates of the i -th For each i , X i 2 I individual. If p is too large ( p > n ), there are too many parameters to be estimated (overfitting model), although some covariates may be irrelevant. Solution: Reduction of the dimension. 27/43

Linear regression in high dimension (2) Idea: Assume that only a few coordinates of β are nonzero (but we do not know which ones). Based on the sample, select a subset of covariates and estimate the corresponding coordinates of β . For S ✓ { 1 , . . . , p } , let ˆ S 2 argmin k Y − X S t k 2 , β R S t 2 I where X S is the submatrix of X obtained by keeping only the covariates indexed in S . 28/43

Linear regression in high dimension (3) Select a subset S that minimizes the prediction error penalized by the complexity (or size) of the model: k Y − X S β S k 2 + λ | S | , ˆ where λ > 0 is a tuning parameter. σ 2 , this is the Mallow’s C p or AIC criterion. If λ = 2ˆ ˆ 2 log n , this is the BIC criterion. If λ = σ 29/43

Statistics for Applications Chapter 7: Regression 1/43 Heuristics - PDF document

Statistics for Applications Chapter 7: Regression 1/43 Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43 Heuristics of the linear regression (2) I Idea: Fit the best line

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Correlation and Regression 9-1 Overview 9-2 Correlation 9-3 Regression 9-4 Variation and

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Statistics 430/514 Introduction to Regression Analysis/ Statistics for Management and the Social

Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of

Statistics I Chapter 1 What is Statistics? Ling-Chieh Kung Department of Information

t P P

Exploring Author Gender in Book Rating and Recommendation Michael D. Ekstrand People and

FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren e M uller, Gustavo Alonso

The effect of early dark matter halos on reionization Aravind Natarajan and Dominik J. Schwarz

Feedback Control and Visual Servoing Lecture 11 What will you take home today? Introduction to

A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem Bharath K.

Preserved and unpreserved extreme points A. J. Guirao 1 , V. Montesinos 1 , V. Zizler 2 1 Instituto

CS 147: Computer Systems Performance Analysis Advanced Regression Techniques 1 / 31 Overview

Statistics for Applications Chapter 7: Regression 1/43 Heuristics - PDF document

Statistics for Applications Chapter 7: Regression 1/43 Heuristics of the linear regression (1) Consider a cloud of i.i.d. random points ( X i , Y i ) , i = 1 , . . . , n : 2/43 Heuristics of the linear regression (2) I Idea: Fit the best line

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Correlation and Regression 9-1 Overview 9-2 Correlation 9-3 Regression 9-4 Variation and

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Statistics 430/514 Introduction to Regression Analysis/ Statistics for Management and the Social

Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of

Statistics I Chapter 1 What is Statistics? Ling-Chieh Kung Department of Information

t P P

Exploring Author Gender in Book Rating and Recommendation Michael D. Ekstrand People and

FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren e M uller, Gustavo Alonso

The effect of early dark matter halos on reionization Aravind Natarajan and Dominik J. Schwarz

Feedback Control and Visual Servoing Lecture 11 What will you take home today? Introduction to

A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem Bharath K.

Preserved and unpreserved extreme points A. J. Guirao 1 , V. Montesinos 1 , V. Zizler 2 1 Instituto

CS 147: Computer Systems Performance Analysis Advanced Regression Techniques 1 / 31 Overview

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and