Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - - PowerPoint PPT Presentation

regression
SMART_READER_LITE
LIVE PREVIEW

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New - - PowerPoint PPT Presentation

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions,


slide-1
SLIDE 1

Marc Mehlman Marc Mehlman

Regression

Marc H. Mehlman marcmehlman@yahoo.com

University of New Haven

“ · · · the statistician knows · · · that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” – George Box

Marc Mehlman (University of New Haven) Regression 1 / 41

slide-2
SLIDE 2

Marc Mehlman Marc Mehlman

Table of Contents

1

Simple Regression

2

Confidence Intervals and Significance Tests

3

Variation

4

Chapter #10 R Assignment

Marc Mehlman (University of New Haven) Regression 2 / 41

slide-3
SLIDE 3

Marc Mehlman Marc Mehlman

Simple Regression

Simple Regression

Simple Regression

Marc Mehlman (University of New Haven) Regression 3 / 41

slide-4
SLIDE 4

Marc Mehlman Marc Mehlman

Simple Regression

Let X = the predictor or independent variable Y = the response or dependent variable. Given a bivariate random variable, (X, Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x, the response, y is y = β0 + β1x + ǫx where β0 + β1x is the mean response for x. The noise terms, the ǫx’s, are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β0, β1 and σ.

Marc Mehlman (University of New Haven) Regression 4 / 41

slide-5
SLIDE 5

Marc Mehlman Marc Mehlman

Simple Regression

8

Conditions for Regression Inference

The figure below shows the regression model when the conditions are

  • met. The line in the figure is the population regression line µy= β0 + β1x.

The Normal curves show how y will vary when x is held fixed at different values. All the curves have the same standard deviation σ, so the variability of y is the same for all values of x. The value of σ determines whether the points fall close to the population regression line (small σ) or are widely scattered (large σ). For each possible value

  • f the explanatory

variable x, the mean of the responses µ(y | x) moves along this line. Marc Mehlman (University of New Haven) Regression 5 / 41

slide-6
SLIDE 6

Marc Mehlman Marc Mehlman

Simple Regression Moderate linear association; regression OK. Obvious nonlinear relationship; regression inappropriate. One extreme

  • utlier, requiring

further examination. Only two values for x; a redesign is due here…

y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x y = 3 + 0.5 ̂ x Marc Mehlman (University of New Haven) Regression 6 / 41

slide-7
SLIDE 7

Marc Mehlman Marc Mehlman

Simple Regression

Given bivariate random sample from the simple linear regression model, (x1, y1), (x2, y2), · · · , (xn, yn)

  • ne wishes to estimate the parameters of the model, (β0, β1, σ).

Given an arbitrary line, y = mx + b define the sum of the squares of errors to be n

i=1 [yi − (mxi + b)]2.

Using Calculus, one can find the least–squares regression line, y = b0 + b1x, that minimizes the sum of squares of errors.

Marc Mehlman (University of New Haven) Regression 7 / 41

slide-8
SLIDE 8

Marc Mehlman Marc Mehlman

Simple Regression

Theorem (Estimating β0 and β1) Given the bivariate random sample, (x1, y1) · · · , (xn, yn), the least–squares regression line, y = b0 + b1x is obtained by letting b1 = r sy sx

  • and

b0 = ¯ y − b1¯ x. where b0 is an unbiased estimator of β0 and b1 is an unbiased estimator of β1. Note: The point (¯ x, ¯ y) will lie on the regression line, though there is no reason to believe that (¯ x, ¯ y) is one of the data points. One can also calculate b1 using b1 = n(n

j=1 xjyj) − (n j=1 xj)(n j=1 yj)

n n

j=1 x2 j − (n j=1 xj)2

.

Marc Mehlman (University of New Haven) Regression 8 / 41

slide-9
SLIDE 9

Marc Mehlman Marc Mehlman

Simple Regression

Example

> plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red")

  • 65

70 75 80 85 8 10 12 14 16 18 20 girth vs height trees$Height trees$Girth

Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”.

> class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height

  • 6.1883945

0.2557471 Marc Mehlman (University of New Haven) Regression 9 / 41

slide-10
SLIDE 10

Marc Mehlman Marc Mehlman

Simple Regression

Example

> plot(trees$Girth~trees$Height,main="girth vs height") > abline(lm(trees$Girth ~ trees$Height), col="red")

  • 65

70 75 80 85 8 10 12 14 16 18 20 girth vs height trees$Height trees$Girth

Since both variables come from “trees”, in order for the R command “lm” (linear model) to work, “trees” has to be in the R format, “data.frame”.

> class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(Girth~Height,data=trees) > coef(g.lm) (Intercept) trees$Height

  • 6.1883945

0.2557471 Marc Mehlman (University of New Haven) Regression 9 / 41

slide-11
SLIDE 11

Marc Mehlman Marc Mehlman

Simple Regression

Definition The predicted value of y at xj is ˆ yj

def

= b0 + b1xj. The predicted value, ˆ y, is a unbiased estimator of the mean response, µy. Example Using the R dataset “trees”, one wants the predicted girth of three trees,

  • f heights 74, 83 and 91 respectively. One uses the regression model

“girth˜height” for our predictions. The work below is done in R.

> g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91))) 1 2 3 12.73689 15.03862 17.08459

Marc Mehlman (University of New Haven) Regression 10 / 41

slide-12
SLIDE 12

Marc Mehlman Marc Mehlman

Simple Regression

“Never make forecasts, especially about the future.” – Samuel Goldwyn The regression line only has predictive value for y at x if

1 ρ ≈ 0 (if no significant linear correlation, don’t use the regression line

for predictions.) If ρ ≈ 0, then ¯ y is best predictor of y at x.

2 only predict y for x’s within the range of the xj’s – one does not

predict the girth of a tree with a height of 1000 feet. Interpolate, don’t extrapolate. |r| (or r2) is a measure of how well the regression equation fits data. bigger |r| ⇒ better data fits regression line ⇒ better prediction.

Marc Mehlman (University of New Haven) Regression 11 / 41

slide-13
SLIDE 13

Marc Mehlman Marc Mehlman

Simple Regression

Definition The variance of the observed yi’s about the predicted ˆ yi’s is s2 def = (yj − ˆ yj)2 n − 2 = y2

j − b0

yj − b1 xjyj n − 2 , which is an unbiased estimator of σ2. The standard error of estimate (also called the residual standard error) is s, an estimator of σ. Note: (b0, b1, s) is an estimator of the parameters of the simple linear regression model, (β0, β1, σ). Furthermore, b0, b1 and s2 are unbiased estimators of β0, β1 and σ2.

Marc Mehlman (University of New Haven) Regression 12 / 41

slide-14
SLIDE 14

Marc Mehlman Marc Mehlman

Simple Regression

Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point.

Outliers and influential points

Child 19 = outlier (large residual) Child 18 = potential influential individual

Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 is isolated from the rest of the points, and might be an influential point. Marc Mehlman (University of New Haven) Regression 13 / 41

slide-15
SLIDE 15

Marc Mehlman Marc Mehlman

Simple Regression All data Without child 18 Without child 19 Outlier Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). Marc Mehlman (University of New Haven) Regression 14 / 41

slide-16
SLIDE 16

Marc Mehlman Marc Mehlman

Simple Regression

Definition Given a data point, (xj, yj), the residual of that point is yi − ˆ yi. Note:

1 Outliers are data points with large residuals. 2 The residuals should be approximately N(0, σ). Marc Mehlman (University of New Haven) Regression 15 / 41

slide-17
SLIDE 17

Marc Mehlman Marc Mehlman

Simple Regression

R command for finding residuals: Example

> g.lm=lm(Girth~Height,data=trees) > residuals(g.lm) 1 2 3 4 5 6 7

  • 3.4139043 -1.8351687 -1.1236745 -1.7253986 -3.8271227 -4.2386170

0.3090842 8 9 10 11 12 13 14

  • 1.9926400 -3.1713756 -1.7926400 -2.7156285 -1.8483871 -1.8483871

0.2418428 15 16 17 18 19 20 21

  • 0.9926400

0.1631072 -2.6501112 -2.5058584 1.7303485 3.6205784 0.2401187 22 23 24 25 26 27 28

  • 0.0713756

1.7631072 3.7746014 2.7958658 2.7728773 2.7171301 3.6286244 29 30 31 3.7286244 3.7286244 4.5383945

Marc Mehlman (University of New Haven) Regression 16 / 41

slide-18
SLIDE 18

Marc Mehlman Marc Mehlman

Simple Regression

Definition Given bivariate data, (x1, y1), · · · , (xn, yn), the residual plot is a plot of the residuals against the xj’s. If (X, Y ) is bivariate normal, the residuals satisfy the Homoscedasticity Assumption: Definition (Homoscedasticity Assumption) The assumption that the variance around the regression line is the same for all values of the predictor variable X. In other words the pattern of the spread of the residual points around the x–axis does not change as one travels left to right on the x–axis. There should not be discernible patterns in the residual plot.

Marc Mehlman (University of New Haven) Regression 17 / 41

slide-19
SLIDE 19

Marc Mehlman Marc Mehlman

Simple Regression

R command for testing if Linear Model applies (residuals approximately N(0, σ)). Example

> g.lm=lm(Girth~Height,data=trees) > par(mfrow=c(2,2)) # visualize four graphs at once > plot(g.lm) > par(mfrow=c(1,1)) # reset the graphics defaults

10 11 12 13 14 15 16 −4 −2 2 4 Fitted values Residuals

  • Residuals vs Fitted

31 6 5

  • ● ●
  • −2

−1 1 2 −1 1 2 Theoretical Quantiles Standardized residuals

Normal Q−Q

31 6 5

10 11 12 13 14 15 16 0.0 0.4 0.8 1.2 Fitted values Standardized residuals

  • Scale−Location

31 6 5

0.00 0.05 0.10 0.15 −2 −1 1 2 Leverage Standardized residuals

  • Cook's distance

0.5

Residuals vs Leverage

31 20 6

Marc Mehlman (University of New Haven) Regression 18 / 41

slide-20
SLIDE 20

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Confidence Intervals and Significance Tests

Confidence Intervals and Significance Tests

Marc Mehlman (University of New Haven) Regression 19 / 41

slide-21
SLIDE 21

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Theorem (Hypothesis Tests and Confidence Intervals for β0 and β1:) Let SEb1

def

= s n

j=1(xj − ¯

x)2 and SEb0

def

=

  • 1

n + ¯ x2 n

j=1(xj − ¯

x)2 . SEb0 and SEb1 are the standard error of the intercept, β0, and the slope, β1, for the least–squares regression line. To test the hypothesis H0 : β1 = 0 use the test statistic t ∼

b1 SEb1 ∼ t(n − 2). A level

(1 − α)100% confidence interval for the slope β1 is b1 ± t∗(n − 2) × SEb1. To test the hypothesis H0 : β0 = b use the test statistic t ∼ b0−b

SEb0 ∼ t(n − 2). A level

(1 − α)100% confidence interval for the intercept β0 is b0 ± t∗(n − 2) × SEb0. Accepting H0 : β1 = 0 is equivalent to accepting H0 : ρ = 0.

Marc Mehlman (University of New Haven) Regression 20 / 41

slide-22
SLIDE 22

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example

16

Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants 4 to 10 days old and their later IQ test scores. A snap of a rubber band on the sole of the foot caused the infants to cry. The researchers recorded the crying and measured its intensity by the number of peaks in the most active 20 seconds. They later measured the children’s IQ at age three years using the Stanford-Binet IQ test. A scatterplot and Minitab output for the data from a random sample of 38 infants is below.

Do these data provide convincing evidence that there is a positive linear relationship between crying counts and IQ in the population of infants?

Example

Marc Mehlman (University of New Haven) Regression 21 / 41

slide-23
SLIDE 23

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example (cont.)

17

We want to perform a test of H0 : β1 = 0 Ha : β1 > 0 where β1 is the true slope of the population regression line relating crying count to IQ score.

  • The scatterplot suggests a moderately weak positive linear relationship between crying peaks

and

  • IQ. The residual plot shows a random scatter of points about the residual = 0 line.
  • IQ scores of individual infants should be independent.
  • The Normal probability plot of the residuals shows a slight curvature, which suggests that the

responses may not be Normally distributed about the line at each x-value. With such a large sample size (n = 38), however, the t procedures are robust against departures from Normality.

  • The residual plot shows a fairly equal amount of scatter around the horizontal line at 0 for all x-

values.

Example

Marc Mehlman (University of New Haven) Regression 22 / 41

slide-24
SLIDE 24

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Example (cont.)

18

With no obvious violations of the conditions, we proceed to inference. The test statistic and P-value can be found in the Minitab output. The P-value, 0.002, is less than our α = 0.05 significance level, so we have enough evidence to reject H0 and conclude that there is a positive linear relationship between intensity of crying and IQ score in the population of infants. The Minitab output gives P = 0.004 as the P- value for a two-sided test. The P-value for the one-sided test is half of this, P = 0.002.

Example

t= b

1

S E

b 1

=1 .4 9 2 9 .4 8 7 0=3 .0 7

Marc Mehlman (University of New Haven) Regression 23 / 41

slide-25
SLIDE 25

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Given x⋆, the mean response is µy = β0 + β1x⋆. However, since β0 and β1 are not known, one uses ˆ µy

def

= ˆ yx⋆ def = b0 + b1x⋆ as an estimator of µy. Theorem ((1 − α)100% Confidence Interval for the mean response, µy) A (1 − α)100 % confidence interval for the mean response, µy when x takes on the value x⋆ is ˆ µy ± m where the margin of error is m = tα/2(n − 2) s

  • 1

n + (x⋆ − ¯ x)2 n

j=1(xj − ¯

x)2

  • SE ˆ

µ

. The standard error of the mean response is SEˆ

µ. Marc Mehlman (University of New Haven) Regression 24 / 41

slide-26
SLIDE 26

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

A confidence interval for µy: μy μy = y ^ μy = β0 + β1x Predicting μy ^ POPULATION x*

Marc Mehlman (University of New Haven) Regression 25 / 41

slide-27
SLIDE 27

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

Definition Let y be a future observation corresponding to x⋆. A (1 − α)100% Prediction Interval for y is a confidence interval where y will be in the confidence interval (1 − α)100% of the time. A prediction interval a confidence interval that not only has to contend with the variability of the response variable, but also the fact that β0 and β1 can only be approximated. Theorem ((1 − α)100% Prediction Interval for y given x = x⋆) A (1 − α)100% Prediction Interval for y given x = x⋆ is ˆ y ± m where ˆ y = b0 + b1x⋆ and the margin of error is m = tα/2(n − 2) s

  • 1 + 1

n + (x⋆ − ¯ x)2 n

j=1(xj − ¯

x)2

  • SEˆ

y

.

Marc Mehlman (University of New Haven) Regression 26 / 41

slide-28
SLIDE 28

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

A confidence interval for y:

Marc Mehlman (University of New Haven) Regression 27 / 41

slide-29
SLIDE 29

Marc Mehlman Marc Mehlman

Confidence Intervals and Significance Tests

R commands: Example

> g.lm=lm(Girth~Height,data=trees) > predict(g.lm,newdata=data.frame(Height=c(74,83,91)),interval="prediction",level=.90) fit lwr upr 1 12.73689 8.020516 17.45327 2 15.03862 10.238843 19.83839 3 17.08459 11.971691 22.19750 > summary(g.lm) Call: lm(formula = Girth ~ Height, data = trees) Residuals: Min 1Q Median 3Q Max

  • 4.2386 -1.9205 -0.0714

2.7450 4.5384 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.18839 5.96020

  • 1.038

0.30772 Height 0.25575 0.07816 3.272 0.00276 **

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.728 on 29 degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758 Marc Mehlman (University of New Haven) Regression 28 / 41

slide-30
SLIDE 30

Marc Mehlman Marc Mehlman

Variation

Variation

Variation

Marc Mehlman (University of New Haven) Regression 29 / 41

slide-31
SLIDE 31

Marc Mehlman Marc Mehlman

Variation

Variation:

yj − ¯ y total deviation = ˆ yj − ¯ y explained deviation + yj − ˆ yj unexplained deviation . From here, using some math, one gets the following sum of squares, (SS),

n

  • j=1

(yj − ¯ y)2

  • SSTOT=total variation

=

n

  • j=1

(ˆ yj − ¯ y)2

  • SSA=explained variation

+

n

  • j=1

(yj − ˆ yj)2

  • SSE=unexplained variation

.

Marc Mehlman (University of New Haven) Regression 30 / 41

slide-32
SLIDE 32

Marc Mehlman Marc Mehlman

Variation

Definition The coefficient of determination is the portion of the variation in y explained by the regression equation r2 def = SSA SSTOT = n

j=1(ˆ

yj − ¯ y)2 n

j=1(yj − ¯

y)2 . Properties of the Coefficient of Determination:

1

r2 = (r)2 = (correlation coefficient)2.

2

r2 = proportion of variation of Y that is explained by the linear relationship between X and Y . Example Using R, since

> (cor(trees$Girth,trees$Height))^2 [1] 0.2696518

  • ne concludes that approximately 27% of variation in tree Girth is explained by tree Height and

73% by other factors.

Marc Mehlman (University of New Haven) Regression 31 / 41

slide-33
SLIDE 33

Marc Mehlman Marc Mehlman

Variation

r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.

Marc Mehlman (University of New Haven) Regression 32 / 41

slide-34
SLIDE 34

Marc Mehlman Marc Mehlman

Variation

With each of the sum of squares is associated a degrees of freedom where df of SSTOT = df of SSA + df of SSE. Also associated with SSA and SSE are the mean squares which equal the sum of squares divided by it’s degrees of freedom. Source SS df MS Model SSA 1 MSA = SSA

1

Error SSE n − 2 MSE = s2 =

n

j=1(ˆ

yj−¯ y)2 n−2

= SSE

n−2

Total SSTOT n − 1 The above is a partial ANOVA table. ANOVA is short of “analysis of variance”.

Marc Mehlman (University of New Haven) Regression 33 / 41

slide-35
SLIDE 35

Marc Mehlman Marc Mehlman

Variation

Theorem (ANOVA F Test for Simple Regression) In the simple linear regression model, consider H0 : β1 = 0 versus HA : β1 = 0. If H0 holds, f def = MSA

MSE is from F(1, n − 2) and one uses a right–sided test.

Remember, H0 : β1 = 0 is equivalent to H0 : ρ = 0. The following is an ANOVA Table for simple linear regression:

Source SS df MS ANOVA F Statistic p–value Model SSA 1 MSA f P(F(1, n − 2) ≥ f ) Error SSE n − 2 MSE Total SSTOT n − 1

Marc Mehlman (University of New Haven) Regression 34 / 41

slide-36
SLIDE 36

Marc Mehlman Marc Mehlman

Variation

Example (cont.) > g.lm=lm(Girth~Height,data=trees) > anova(g.lm) Analysis of Variance Table Response: Girth Df Sum Sq Mean Sq F value Pr(>F) Height 1 79.665 79.665 10.707 0.002758 ** Residuals 29 215.772 7.440

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Marc Mehlman (University of New Haven) Regression 35 / 41

slide-37
SLIDE 37

Marc Mehlman Marc Mehlman

Variation

Since β1 = 0 ⇔ r = 0 the following is equivalent to the ANOVA F Test. Theorem (Test for correlation) Assuming that X and Y are bivariate normal (the conditions for simple linear regression), consider the hypotheses H0 : ρ = 0 vs HA : ρ = 0 The test statistic is t = r

  • 1−r2

n−2

∼ t(n − 2) for H0. Remember, accepting H0 : β1 = 0 is equivalent to accepting H0 : ρ = 0. It can be shown that F = t2. Also it makes no difference if X or Y is the independent

  • r dependent variable - the test is for correlation. An advantage of using the above t

test is one can test one–sided alternative hypotheses. R command: > cor.test(X,Y) (one can also do one sided tests with R).

Marc Mehlman (University of New Haven) Regression 36 / 41

slide-38
SLIDE 38

Marc Mehlman Marc Mehlman

Variation

Since β1 = 0 ⇔ r = 0 the following is equivalent to the ANOVA F Test. Theorem (Test for correlation) Assuming that X and Y are bivariate normal (the conditions for simple linear regression), consider the hypotheses H0 : ρ = 0 vs HA : ρ = 0 The test statistic is t = r

  • 1−r2

n−2

∼ t(n − 2) for H0. Remember, accepting H0 : β1 = 0 is equivalent to accepting H0 : ρ = 0. It can be shown that F = t2. Also it makes no difference if X or Y is the independent

  • r dependent variable - the test is for correlation. An advantage of using the above t

test is one can test one–sided alternative hypotheses. R command: > cor.test(X,Y) (one can also do one sided tests with R).

Marc Mehlman (University of New Haven) Regression 36 / 41

slide-39
SLIDE 39

Marc Mehlman Marc Mehlman

Variation

Using R

Example (cont) > cor.test(trees$Girth,trees$Height) Pearson’s product-moment correlation data: trees$Girth and trees$Height t = 3.2722, df = 29, p-value = 0.002758 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2021327 0.7378538 sample estimates: cor 0.5192801 Note that one is assuming that the (trees$Height, trees$Girth) are sampled from a bivariate normal distribution.

Marc Mehlman (University of New Haven) Regression 37 / 41

slide-40
SLIDE 40

Marc Mehlman Marc Mehlman

Variation

Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random

  • variable. A sample correlation of r = 0.12 is calculated. Find the p–value
  • f H0 : ρ = 0 versus HA : ρ = 0.

Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] 0.9440518 > 2*(1-pt(tstat,61)) [1] 0.3488675 There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated.

Marc Mehlman (University of New Haven) Regression 38 / 41

slide-41
SLIDE 41

Marc Mehlman Marc Mehlman

Variation

Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random

  • variable. A sample correlation of r = 0.12 is calculated. Find the p–value
  • f H0 : ρ = 0 versus HA : ρ = 0.

Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] 0.9440518 > 2*(1-pt(tstat,61)) [1] 0.3488675 There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated.

Marc Mehlman (University of New Haven) Regression 38 / 41

slide-42
SLIDE 42

Marc Mehlman Marc Mehlman

Chapter #10 R Assignment

Chapter #10 R Assignment

Chapter #10 R Assignment

Marc Mehlman (University of New Haven) Regression 39 / 41

slide-43
SLIDE 43

Marc Mehlman Marc Mehlman

Chapter #10 R Assignment

(from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition Duxbury 1990)) Fifteen alligators were captured and two measurements were made on each

  • f the alligators. The weight (in pounds) was recorded with the snout vent

length (in inches this is the distance between the back of the head to the end of the nose). The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. lnLength ~ lnWeight. The authors analyzed the data on the log scale (natural logarithms) and we will follow their approach for consistency.

> lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76, + 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78) > lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, + 3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)

Marc Mehlman (University of New Haven) Regression 40 / 41

slide-44
SLIDE 44

Marc Mehlman Marc Mehlman

Chapter #10 R Assignment 1 Create a scatterplot of “lnLength” ∼ “lnWeight”, complete with the

regression line.

2 What is the slope and y–intercept of the regression line? 3 Predict “lnLength” when “lnWeight” is five. 4 Use graphs to decide if “lnLength” ∼ “lnWeight” satisfies the

requirements for being a linear model.

5 Find a 95% prediction interval for “lnLength” when “lnWeight” is five. 6 What is the p–value of a test of H0 : β1 = 0 versus HA : β1 = 0? 7 What is the standard error of estimate? 8 What is the coefficient of determination, R2. 9 What is the explained variation, the unexplained variation and the

total variation?

10 What is the F statistic of H0 : β1 = 0 versus HA : β1 = 0 and what is

its degrees of freedom?

11 Using the correlation test, what is the p–value of a test that

H0 : ρ = 0 versus HA : ρ = 0?

Marc Mehlman (University of New Haven) Regression 41 / 41