Multivariate Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 1 / 21
Table of Contents Multivariate Regression 1 Confidence Intervals and Significance Tests 2 ANOVA Tables for Multivariate Regression 3 Chapter #11 R Assignment 4 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 2 / 21
Multivariate Regression Multivariate Regression Multivariate Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 3 / 21
Multivariate Regression Given multivariate variate data, ( x (1) 1 , x (1) 2 , · · · , x (1) k , y 1 ) , ( x (2) 1 , x (2) 2 , · · · , x (2) k , y 2 ) , · · · , ( x ( n ) 1 , x ( n ) 2 , · · · , x ( n ) k , y n ) where x ( i ) 1 , x ( i ) 2 , · · · , x ( i ) is a predictor of the response y i , one explores the k following possible model. Definition (Statistical Model of Multivariate Linear Regression) Given a k dimensional multivariate predictor, ( x ( i ) 1 , x ( i ) 2 , · · · , x ( i ) k ), the response, y i , is y i = β 0 + β 1 x ( i ) + · · · + β k x ( i ) + ǫ i 1 k where β 0 + β 1 x ( i ) + · · · + β k x ( i ) is the mean response . The noise terms, 1 k the ǫ i ’s are assumed to be independent of each other and to be randomly sampled from N (0 , σ ). The parameters of the model are β 0 , β 1 , · · · , β k and σ . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 4 / 21
Multivariate Regression Definition Given a multivariate normal sample, � x (1) 1 , · · · , x (1) � � x ( n ) 1 , · · · , x ( n ) � k , y 1 , · · · , k , y n , the least–squares multiple regression equation , y = b 0 + b 1 x 1 + · · · + b k x k , ˆ is the linear equation that minimizes n y j − y j ) 2 , � (ˆ j =1 where def = b 0 + b 1 x ( j ) + · · · + b k x ( j ) y j ˆ k . 1 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 5 / 21
Multivariate Regression There must be at least k + 2 data points to do obtain the estimators � n y i ) 2 j =1 ( y i − ˆ s 2 def b 0 , b j ’s and = n − k − 1 of β 0 , β j ’s and σ 2 , where b 0 , the y –intercept, is the unbiased, least square estimator of β 0 . b j , the coefficient of x j , is the unbiased, least square estimator of β j . s 2 is an unbiased estimator of σ 2 and s is an estimator of σ . Due to computational intensity, computers are used to obtain b 0 , b j ’s and s 2 . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 6 / 21
Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 7 / 21
Confidence Intervals and Significance Tests Due to computational intensity – computer programs are used with multiple regression. In particular, computers are used to calculate the SE b j ’s, the standard error of the b j ’s. Theorem To test the hypothesis H 0 : β j = 0 use the test statistic b j t ∼ ∼ t ( n − k − 1) for H 0 . SE b j A level (1 − α )100 % confidence interval for β j is b j ± t ∗ ( n − k − 1) SE b j . Accepting H 0 : β j = 0 is accepting that there is no linear association between X j and Y , ie that correlation between X j and Y is zero. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 8 / 21
Confidence Intervals and Significance Tests Example > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > par(mfrow=c(2,2)) > plot(g.lm) > par(mfrow=c(1,1)) Does the linear model fit? Residuals vs Fitted Normal Q−Q 6 ● Chrysler Imperial Fiat 128 ● Chrysler Imperial ● Toyota Corolla Fiat 128 ● Toyota Corolla ● Standardized residuals ● 2 4 ● Residuals ● ● ● 2 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● −4 ● ● ● ● 10 15 20 25 30 −2 −1 0 1 2 Fitted values Theoretical Quantiles Scale−Location Residuals vs Leverage 1.5 ● Chrysler Imperial Fiat 128 ● Toyota Corolla ● Chrysler Imperial ● ● Standardized residuals ● Toyota Corolla Standardized residuals 2 1 ● ● ● 0.5 1.0 ● ● ● ● ● ● ● 1 Maserati Bora ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● 0.5 Cook's distance ● 0.0 −2 1 10 15 20 25 30 0.0 0.1 0.2 0.3 0.4 0.5 Fitted values Leverage Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 9 / 21
Confidence Intervals and Significance Tests Example (cont.) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.8664 -1.5819 -0.3788 1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp -0.018666 0.015613 -1.196 0.24227 wt -4.609123 1.265851 -3.641 0.00113 ** qsec 0.544160 0.466493 1.166 0.25362 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 10 / 21
Confidence Intervals and Significance Tests Example (cont.) And to find confidence intervals for the coefficients: > confint(g.lm) 2.5 % 97.5 % (Intercept) 9.60380809 45.05546784 disp -0.01936545 0.02469831 hp -0.05070153 0.01336912 wt -7.20643496 -2.01181027 qsec -0.41300458 1.50132521 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 11 / 21
ANOVA Tables for Multivariate Regression ANOVA Tables for Multivariate Regression ANOVA Tables for Multivariate Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 12 / 21
ANOVA Tables for Multivariate Regression Definition n def y ) 2 � SS A = Sum of Squares of Model = (ˆ y j − ¯ j =1 n def � y j ) 2 SS E = Sum of Squares of Error = ( y j − ˆ j =1 n def � y j ) 2 SS TOT = Sum of Squares of Total = ( y j − ¯ j =1 Mean Square of Model = SS A def = MS A k SS E def = Mean Square of Error = MS E n − k − 1 Theorem SS TOT = SS A + SS E . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 13 / 21
ANOVA Tables for Multivariate Regression Theorem (ANOVA F Test for Multivariate Regression) The test statistic for H O : β 1 = β 2 = · · · = β k = 0 versus H A : not H 0 is f = MS A MS E . The p–value of the above test is P ( F ≥ f ) where F ∼ F ( k , n − k − 1) under H 0 . Statistical Software usually summarizes the calculations and conclusion above in an ANOVA table: Definition (ANOVA Table) Source df SS MS F p –value MS A Model k SS A MS A P ( F ( k , n − k − 1) ≥ f ) MS E Error n − k − 1 SS E MS E Total n − 1 SS TOT Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 14 / 21
ANOVA Tables for Multivariate Regression Definition The squared multiple correlation is given by R 2 def SS A = SS TOT . The multiple √ correlation coefficient is just R = R 2 . SS A measures how much of variation in the data is explained by model. By taking the ratio of SS A to the total amount of variation, SS TOT , one obtains R 2 , the portion of the variation that is explained by the model . In fact, R is just the correlation between the observations and the predicted values. Inflation Problem: As k increases r 2 increases, but the increase in predictability is illusionary. Solution: Best to use Definition The adjusted coefficient of determination is n − 1 R 2 n − k − 1(1 − R 2 ) . adj = 1 − Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 15 / 21
ANOVA Tables for Multivariate Regression > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.8664 -1.5819 -0.3788 1.1712 5.6468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.329638 8.639032 3.164 0.00383 ** disp 0.002666 0.010738 0.248 0.80576 hp -0.018666 0.015613 -1.196 0.24227 wt -4.609123 1.265851 -3.641 0.00113 ** qsec 0.544160 0.466493 1.166 0.25362 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.622 on 27 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.8107 F-statistic: 34.19 on 4 and 27 DF, p-value: 3.311e-10 Over 80% of variation explained by the model, but it seems like only weight matters. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Multivariate Regression 16 / 21
Recommend
More recommend