Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability STAT 213 Regression Inference II Colin Reimer Dawson Oberlin College 18 February 2016
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Outline Key Ideas: Last Time Influence and Outliers Regression Inference Simulation Approaches Partitioning Variability
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Reading Quiz A regression equation was fit to a set of data for which the correlation, r , between X and Y was 0.6. Which of the following must be true? (a) The slope of the regression line is 0.6. (b) The regression model explains 60% of the variability in Y . (c) The regression model explains 36% of the variability in Y . (d) At least half of the residuals are smaller than 0.6 in absolute value.
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability For Tuesday... • Write and turn in: Ex. 1.10, 1.12, 1.26, 2.14a, 2.34 • Read: Ch. 2.4, 4.6 • Answer: 1. Exercise 2.5 2. Exercise 2.6 3. In a randomization distribution to test whether a regression slope is significantly different from zero, the P -value is the proportion of obtained by that exceed ?
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Transformations and Outliers Data Transformations Can be used to • Address non-linearity • Stabilize (homogenize) variance • “Unskew” residual distribution • Reduce influence of outliers
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Brain and Body Weight of Terrestrial Mammals library(mosaic) BrainBodyWeight <- read.file("http://colinreimerdawson.com/data/BrainBodyWeight.csv") xyplot( brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight, type = c("p", "r")) ● brain.weight.grams 5000 ● 4000 3000 2000 ● 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2000 4000 6000 body.weight.kilograms
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Brain and Body Weight of Terrestrial Mammals brain.model <- lm(brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight) par(mfrow = c(1,2)) # to create a 1-by-2 plotting grid plot(brain.model, which = 1) #residuals by predicted plot(brain.model, which = 2) #quantile-quantile Residuals vs Fitted Normal Q−Q 8 Standardized residuals 5 5 ● ● 6 4 34 ● Residuals 1000 34 ● 2 ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1000 ● −6 1 1 ● 0 2000 4000 6000 −2 −1 0 1 2 Fitted values Theoretical Quantiles
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Log Brain and Log Body Weight xyplot( log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight, type = c("p", "r")) log(brain.weight.grams) ● ● 8 ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● −2 ● −5 0 5 log(body.weight.kilograms)
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Log Brain and Log Body Weight log.brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) par(mfrow = c(1,2)) plot(log.brain.model, which = 1) #residuals by predicted plot(log.brain.model, which = 2) #quantile-quantile Residuals vs Fitted Normal Q−Q Standardized residuals 3 2 34 ● 34 ● ● 50 50 ● ● 2 ● ● ● 1 ● ● Residuals ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● 61 −2 ● 61 −2 0 2 4 6 8 −2 −1 0 1 2 Fitted values Theoretical Quantiles
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Percent Brain Weight by Body Weight library(mosaic) transform( BrainBodyWeight, percent.brain = brain.weight.grams / (body.weight.kilograms * 1000) ) %>% xyplot( log(percent.brain) ~ log(body.weight.kilograms), data = ., type = c("p", "r")) ● ● ● ● log(percent.brain) ● ● ● ● ● −4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −6 ● ● ● ● ● ● ● ● ● ● ● −7 ● ● ● −5 0 5 log(body.weight.kilograms)
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Percent Brain Weight By Body Weight Residuals vs Fitted Normal Q−Q Standardized residuals 3 2 ● 34 34 ● 50 ● 50 ● ● 2 ● ● ● 1 ● ● Residuals ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● 61 −2 61 ● 7.0 8.0 9.0 10.0 −2 −1 0 1 2 Fitted values Theoretical Quantiles
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Unusual Cases Detecting Unusual Cases • Residual plots • Standardized/Studentized residuals • Leverage measurement
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Men’s Long Jump library(Stat2Data) data(LongJumpOlympics) xyplot( Gold ~ Year, data = LongJumpOlympics, type = c("p", "r"), groups = (Year == 1968) ## highlight the outlier ) ● ● ● ● ● ● 8.5 ● ● ● ● Gold ● ● ● 8.0 ● ● ● ● ● ● 7.5 ● ● ● ● ● ● 1900 1920 1940 1960 1980 2000 Year
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Men’s Long Jump: Residuals long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) Residuals vs Fitted Normal Q−Q 0.8 16 16 ● 3 ● Standardized residuals 2 0.4 Residuals ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.4 ● ● ● 26 12 ● 26 12 ● ● 7.5 8.0 8.5 −2 −1 0 1 2 Fitted values Theoretical Quantiles
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Men’s Long Jump: Residuals long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) Residuals vs Fitted Normal Q−Q 0.8 16 16 ● 3 ● Standardized residuals 2 0.4 Residuals ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.4 ● ● ● 26 12 ● 26 12 ● ● 7.5 8.0 8.5 −2 −1 0 1 2 Fitted values Theoretical Quantiles
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Influence Two characteristics contribute to influence of a data point on regression line: 1. Distance in Y from trend (think: residual for line fit w/o that point) 2. Distance of X from ¯ X (think: distance from center on a see-saw)
Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability Standardized and Studentized Residuals Standardized Residuals y i − ˆ y i (1) σ ε ˆ “Studentized” Residuals y i − ˆ y i (2) σ ( i ) ˆ ε σ ( i ) where ˆ is standard deviation of all residuals other ε than i .
Recommend
More recommend