regression
play

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - PowerPoint PPT Presentation

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions,


  1. Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven “ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 1 / 41

  2. Table of Contents Simple Regression 1 Confidence Intervals and Significance Tests 2 Variation 3 Chapter #10 R Assignment 4 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 2 / 41

  3. Simple Regression Simple Regression Simple Regression Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 3 / 41

  4. Simple Regression Let X = the predictor or independent variable Y = the response or dependent variable . Given a bivariate random variable, ( X , Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x , the response, y is y = β 0 + β 1 x + ǫ x where β 0 + β 1 x is the mean response for x . The noise terms, the ǫ x ’s, are assumed to be independent of each other and to be randomly sampled from N (0 , σ ). The parameters of the model are β 0 , β 1 and σ . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 4 / 41

  5. Simple Regression Conditions for Regression Inference The figure below shows the regression model when the conditions are met. The line in the figure is the population regression line µy = β 0 + β 1 x . For each possible value The Normal curves show of the explanatory how y will vary when x is variable x , the mean of held fixed at different values. the responses µ ( y | x ) All the curves have the same moves along this line. standard deviation σ , so the variability of y is the same for all values of x . The value of σ determines whether the points fall close to the population regression line (small σ ) or are widely scattered (large σ ). 8 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 5 / 41

  6. ̂ ̂ ̂ ̂ Simple Regression Moderate linear Obvious nonlinear association; relationship; regression OK. regression inappropriate. y = 3 + 0.5 x y = 3 + 0.5 x One extreme Only two values outlier, requiring for x; a redesign is further due here… examination. y = 3 + 0.5 x y = 3 + 0.5 x Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 6 / 41

  7. Simple Regression Given bivariate random sample from the simple linear regression model, ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x n , y n ) one wishes to estimate the parameters of the model, ( β 0 , β 1 , σ ). Given an arbitrary line, y = mx + b define the sum of the squares of errors to be � n i =1 [ y i − ( mx i + b )] 2 . Using Calculus, one can find the least–squares regression line , y = b 0 + b 1 x , that minimizes the sum of squares of errors. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 7 / 41

  8. Simple Regression Theorem (Estimating β 0 and β 1 ) Given the bivariate random sample, ( x 1 , y 1 ) · · · , ( x n , y n ) , the least–squares regression line, y = b 0 + b 1 x is obtained by letting � s y � b 1 = r and b 0 = ¯ y − b 1 ¯ x . s x where b 0 is an unbiased estimator of β 0 and b 1 is an unbiased estimator of β 1 . Note: The point (¯ x , ¯ y ) will lie on the regression line, though there is no reason to believe that (¯ x , ¯ y ) is one of the data points. One can also calculate b 1 using n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) b 1 = . n � n j − ( � n j =1 x 2 j =1 x j ) 2 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 8 / 41

  9. Simple Regression Example > plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height -6.1883945 0.2557471 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 9 / 41

  10. Simple Regression Example > plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height -6.1883945 0.2557471 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 9 / 41

  11. Simple Regression Definition def The predicted value of y at x j is ˆ y j = b 0 + b 1 x j . The predicted value, ˆ y , is a unbiased estimator of the mean response, µ y . Example Using the R dataset “trees”, one wants the predicted girth of three trees, of heights 74, 83 and 91 respectively. One uses the regression model “girth˜height” for our predictions. The work below is done in R. > g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91))) 1 2 3 12.73689 15.03862 17.08459 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 10 / 41

  12. Simple Regression “Never make forecasts, especially about the future.” – Samuel Goldwyn The regression line only has predictive value for y at x if 1 ρ �≈ 0 (if no significant linear correlation, don’t use the regression line for predictions.) If ρ ≈ 0, then ¯ y is best predictor of y at x . 2 only predict y for x ’s within the range of the x j ’s – one does not predict the girth of a tree with a height of 1000 feet. Interpolate, don’t extrapolate . | r | (or r 2 ) is a measure of how well the regression equation fits data. bigger | r | ⇒ better data fits regression line ⇒ better prediction. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 11 / 41

  13. Simple Regression Definition The variance of the observed y i ’s about the predicted ˆ y i ’s is � y 2 � y j − b 1 � x j y j � ( y j − ˆ y j ) 2 j − b 0 s 2 def = = , n − 2 n − 2 which is an unbiased estimator of σ 2 . The standard error of estimate (also called the residual standard error ) is s , an estimator of σ . Note: ( b 0 , b 1 , s ) is an estimator of the parameters of the simple linear regression model, ( β 0 , β 1 , σ ). Furthermore, b 0 , b 1 and s 2 are unbiased estimators of β 0 , β 1 and σ 2 . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 12 / 41

  14. Simple Regression Outliers and influential points Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point. Child 19 = outlier (large residual) Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 is isolated from the rest of the points, and might Child 18 = potential be an influential point. influential individual Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 13 / 41

  15. Simple Regression All data Without child 18 Outlier Without child 19 Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 14 / 41

  16. Simple Regression Definition Given a data point, ( x j , y j ), the residual of that point is y i − ˆ y i . Note: 1 Outliers are data points with large residuals. 2 The residuals should be approximately N (0 , σ ). Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 15 / 41

  17. Simple Regression R command for finding residuals: Example > g.lm=lm(Girth~Height,data=trees) > residuals(g.lm) 1 2 3 4 5 6 7 -3.4139043 -1.8351687 -1.1236745 -1.7253986 -3.8271227 -4.2386170 0.3090842 8 9 10 11 12 13 14 -1.9926400 -3.1713756 -1.7926400 -2.7156285 -1.8483871 -1.8483871 0.2418428 15 16 17 18 19 20 21 -0.9926400 0.1631072 -2.6501112 -2.5058584 1.7303485 3.6205784 0.2401187 22 23 24 25 26 27 28 -0.0713756 1.7631072 3.7746014 2.7958658 2.7728773 2.7171301 3.6286244 29 30 31 3.7286244 3.7286244 4.5383945 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Regression 16 / 41

Recommend


More recommend