lecture 13 multiple regression 2020 1 introduction
play

Lecture 13. Multiple regression 2020 (1) Introduction Now there is - PowerPoint PPT Presentation

Lecture 13. Multiple regression 2020 (1) Introduction Now there is one response variable Y and two predictor variables, X and Z . Data ( X 1 ; Z 1 ; Y 1 ) : : : ( X n ; Z n ; Y n ). We want to either a) predict the value of Y associated with


  1. Lecture 13. Multiple regression 2020

  2. (1) Introduction Now there is one response variable Y and two predictor variables, X and Z . Data ( X 1 ; Z 1 ; Y 1 ) : : : ( X n ; Z n ; Y n ). We want to either a) predict the value of Y associated with particular values of X and Z , or b) describe the relationship between Y , X and Z , or c) estimate the e¸ect of changes in X and Z on Y .

  3. (2) Data example Time Distance Climb Race (mins) (miles) (1000 ft) Greenmantle Dash 16.08 2.5 0.65 Carnethy 5 Hill 48.35 6.0 2.50 Craig Dunain 33.65 6.0 0.90 Ben Rha 45.60 7.5 0.80 Ben Lomond 62.27 8.0 3.07 Goat Fell 73.22 8.0 2.87 Bens of Jura 204.62 16.0 7.50 Cairnpapple 36.37 6.0 0.80 Scolty 29.75 5.0 0.80 Traprain Law 39.75 6.0 0.65 . . . and so on . . .

  4. (3) Prediction equation As for simple linear regression, it may be that a) Predictors X , Z and response Y are all random, or b) Values of predictors X and Z are ˛xed, e.g. by experimental design. In either case, there is a prediction equation Y = b 0 + b 1 X + b 2 Z + e; Prediction error e is assumed N (0 ; ff 2 ).

  5. (4) The multiple regression surface ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● z x

  6. (5) Sums of squares and products The starting point for all calculations is this 3 ˆ 3 matrix of sums of squares and products. 0 S xx S xz S xy 1 B S zx S zz S zy C B C @ A S yx S yz S yy

  7. (6) Estimation equations ^ b 1 and ^ b 2 are solutions of the two equations b 1 S xx + b 2 S xz = S xy b 1 S zx + b 2 S zz = S zy When appropriate, corrected sums of squares and products are replaced by variances and covariances. ’Partial’ regression coe‹cient b 1 is the e¸ect on E ( Y ) of changing X while holding Z constant. ’Partial’ regression coe‹cient b 2 is the e¸ect on E ( Y ) of changing Z while holding X constant.

  8. (7) Partial regression coe‹cients When X is increased by one unit, the total e¸ect on Y is the sum of two parts, one due to the change in X , the other due to the concomitant change in Z . If the model includes X and not Z , we only see the total e¸ect. Including both X and Z in the model allows us to separate the two parts. The partial regression coe‹cient estimates the part speci˛c to X .

  9. (8) Estimate of regression coe‹cient The estimate of b 1 is ( S xy ` S xz S yz =S zz ) ( S xx ` S 2 xz =S zz ) Compare this with the estimate S xy =S xx obtained when Z is ignored. The denominator is the residual sum of squares obtained after regressing X on Z . Numerator is the sum of products of Y and the residual of X after ˛tting Z . There is a similar expression for the estimate of b 2 .

  10. (9) Residuals and ˛tted values Fitted value is now Y + ^ X ) + ^ Y = — ^ b 1 ( X ` — b 2 ( Z ` — Z ), and the anova equation still holds: Y ) 2 = Y ) 2 + X ( Y ` — X (^ Y ` — X ( Y ` ^ Y ) 2 Regression SSQ simpli˛es to ^ b 1 S xy +^ b 2 S zy , with 2 d.f. Residual sum of squares has n ` 3 d.f.

  11. (10) The anova table Sums of squares and mean squares are set out in an anova table, as for simple linear regression, but now degrees of freedom for regression, residual and total sums of squares are 2, n ` 3, and n ` 1. The ANOVA F statistic (with 2 and n ` 3 d.f.) tests the null hypothesis that b 1 = b 2 = 0, i.e. that E ( Y ) = b 0 (constant). Regression sum of squares may be split into two components, each with 1 d.f. See later.

  12. (11) Tests for regression coe‹cients There is a t test for the hypothesis b 1 = 0. As usual, the test statistic is the estimate of b 1 divided by its standard error. The null distn is t with n ` 3 d.f. S xx determined the size of the s.e. for simple linear regression. Now this role is played by S xx ` S 2 xz =S zz . Correlation between the two predictors reduces this quantity and ’in‚ates’ the standard error. There is a similar result for b 2 (switch x and z in the previous paragraph.)

  13. � � � � (12) Cow and her relatives Mother Father ?? Cow Halfsib

  14. (13) Estimated breeding value Y is the breeding value of a cow, X and Z are phenotypes of its mother and paternal half-sister. We want to use X and Z to predict Y . Covariance matrix for X , Z and Y is 1 0 1 V P 0 2 V A B C 1 B 0 V P 4 V A C B C B C @ 1 1 A 2 V A 4 V A V A where V P = V A + V E .

  15. (14) Estimated breeding value 1 V P 0 2 V A 1 0 V P 4 V A 1 1 2 V A 4 V A V A The two equations to be solved are b 1 V P = 1 2 V A , b 2 V P = 1 4 V A , and prediction is ^ Y = h 2 ( X= 2 + Z= 4), where h 2 = V A =V P is the heritability of the trait.

  16. END OF LECTURE

  17. Lecture 14. Hill race data 2020

  18. (15) A special case If Z takes values 0 and 1, the model gives b 0 + b 2 X when Z = 0 b 0 + b 1 + b 2 X when Z = 1. Common slope of the parallel lines is b 2 . Intercept for the ˛rst line is b 0 . Intercept for the second line is b 0 + b 1 . b 1 is the di¸erence between the intercepts (the constant vertical distance between the two lines).

  19. (16) A special case Y = b 0 + b 1 Z + b 2 X Y 1 = Z b 1 b 0 + b 1 0 = Z b 0 X

  20. (17) Hill-race data The di‹culty of a hill race is measured by a) X = total distance covered, b) Z = total climb required. Given distance, climb, and record time for 31 Scottish hill races, multiple regression can ˛nd a relationship between record time Y and the two measures of di‹culty X and Z .

  21. (18) Hill-race data 100 80 60 Time (mins) 40 20 Distance (miles) 0 2 4 6 8 10 12 14

  22. (19) Hill-race data For this analysis, values of climb are grouped as low (climb < 1000 feet, Z = 0), or high (climb > 1000 feet, Z = 1), corresponding to light and dark gray dots on the graph. Estimate Std Error t (Distance) 6.8731 0.4564 15.06 (Climb) 10.3651 2.3175 4.472 Both partial regression coe‹cients are highly signi˛cant ( P < 0 : 001). The single regression line shown on the previous slide fails to capture the e¸ect of di¸erent amounts of climb.

  23. (20) An F test Anovas for the regression on distance alone, and the regression on both distance and climb: DF SSQ Distance 1 12081 Residual 29 1474 Distance + Climb 2 12695 Residual 28 860 The two anovas can be combined into one: DF SSQ Distance (ignoring Climb) 1 12081 Climb (adjusted for Distance) 1 614 Residual 28 860

  24. (21) An F test DF SSQ MSQ F Distance 1 12081 12081 Climb (adjusted) 1 614 614 20.0 Residual 28 860 30.7 Test b(Climb) = 0: F = 20.0 on 1 and 28 d.f. (P < 0.001). Adding climb to the equation signi˛cantly improves the ˛t. The hypothesis is ˛rmly rejected. There is strong evidence for an e¸ect of climb, after allowing for the e¸ect of distance. The same result (exactly) was obtained with a t test based on the estimated partial regression coe‹cient ( F = t 2 ).

  25. (22) Using the original climb data What happens if we use the original climb data rather than the grouped (0/1) version? The model is now E ( Y ) = ( b 0 + b 1 Z ) + b 2 X On the ( X , Y ) graph, this speci˛es a family of parallel lines. The vertical position of the line changes smoothly and continuously as Z changes. The regression coe‹cient b 1 measures the rate at which this happens (in units of inches per 1000 feet). The grouped (0/1) version of Z gave just two lines, one for low climb races, the other for high climb races.

  26. (23) Comparing the two analyses The regression coe‹cient and s.e. for distance is similar in the two analyses. The table below shows the estimated e¸ects of climb. Estimate Std Error t Grouped Z 10.3651 2.3175 4.472 Original Z 6.8288 1.1134 6.133 The ungrouped analysis tells us that the ( X , Y ) line moves up (the predicted race time increases) by 6.8 mins for every additional 1000 feet of climb. The grouped analysis told us that the line for a ’high’ climb race is 10.4 mins above the line for a ’low’ climb race.

  27. (24) Diagnostic plot for hill race data Residuals vs Fitted 10 5 Residuals 0 −5 −10 20 40 60 80 Fitted values lm(Time ~ Distance + Climb)

  28. (25) Diagnostic plot for hill race data Normal Q−Q 2 Standardized residuals 1 0 −1 −2 −2 −1 0 1 2 Theoretical Quantiles lm(Time ~ Distance + Climb)

  29. END OF LECTURE

  30. Lecture 15. Using R, . . . 2020

  31. (26) Using R The lm function deals with multiple regression. Diagnostic plots and analysis of variance tables are produced as for simple linear regression. library(sda) hills31 <- subset(hills, Time < 100) fit <- lm(Time ˜ Distance + Climb, data = hills31) summary(fit) anova(fit) plot(fit, which = 1:2, add.smooth = FALSE) confint(fit, parm = 2:3)

  32. (27) summary and anova summary(fit) produces estimates and standard errors for partial regression coe‹cients. Each coe‹cient is adjusted for all other e¸ects in the model. Results do not depend on the order of terms. anova(fit) produces ’extra’ sums of squares, which depend on the order of terms.

Recommend


More recommend