Inference for Distributions Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Based on Rare Event Rule: “rare events happen – but not to me”. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 1 / 42
Table of Contents t Distribution 1 CI for µ : σ unknown 2 t -test: Mean ( σ unknown) 3 t test: Matched Pairs 4 Two Sample z –Test: Means ( σ X and σ Y known) 5 Two Sample t –test: Means ( σ X and σ Y unknown) 6 Pooled Two Sample t –test: Means ( σ X = σ Y unknown) 7 Two Sample F –test: Variance 8 Chapter #8 R Assignment 9 Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 2 / 42
t Distribution t Distribution t Distribution. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 3 / 42
t Distribution If X 1 , · · · , X n is a normal random sample ¯ � µ, σ � X − µ ¯ X ∼ N √ n ⇒ σ/ √ n ∼ N (0 , 1) . If the random sample is not normal, but n ≥ 30, the above is also true (approximately) by the CLT. However, typically one does not know what σ equals so one is tempted to use s instead of σ . This gives Definition (Student t Distribution) If a random sample is normal or the sample size is ≥ 30 ¯ X − µ S / √ n ∼ t ( n − 1) , where t is the Student t Distribution with n − 1 degrees of freedom . Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 4 / 42
t Distribution The t Distributions When comparing the density curves of the standard Normal distribution and t distributions, several facts are apparent: The density curves of the t distributions are similar in shape to the standard Normal curve. The spread of the t distributions is a bit greater than that of the standard Normal distribution. The t distributions have more probability in the tails and less in the center than does the standard Normal. As the degrees of freedom increase, the t density curve approaches the standard Normal curve ever more closely. We can use Table D in the back of the book to determine critical values t* for t distributions with different degrees of freedom. 7 Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 5 / 42
t Distribution Robustness The t procedures are exactly correct when the population is exactly Normal. This is rare. The t procedures are robust to small deviations from Normality, but: The sample must be a random sample from the population. Outliers and skewness strongly influence the mean and therefore the t procedures. Their impact diminishes as the sample size gets larger because of the Central Limit Theorem. As a guideline: When n < 15, the data must be close to Normal and without outliers. When 15 > n > 40, mild skewness is acceptable, but not outliers. When n > 40, the t statistic will be valid even with strong skewness. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 6 / 42
CI for µ : σ unknown CI for µ : σ unknown Confidence intervals for µ when σ is unknown. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 7 / 42
CI for µ : σ unknown Confidence intervals A confidence interval is a range of values that contains the true population parameter with probability ( confidence level) C . We have a set of data from a population with both µ and σ unknown. We use x ̅ to estimate µ , and s to estimate σ , using a t distribution (df n − 1). C is the area between − t * and t *. We find t * in the line of Table. The margin of error m is: C m=t ∗ s / √ n m m − t * t * Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 8 / 42
CI for µ : σ unknown Definition (standard error) S def The standard error of the sample mean for σ unknown is SE ¯ = √ n . x Theorem (CI for µ , σ unknown) Assume n ≥ 30 or the population is normal. Let = t ⋆ ( n − 1) ⋆ s margin of error = m def √ n = t ⋆ ( n − 1) ⋆ SE ¯ x . Then the confidence interval is ¯ x ± m. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 9 / 42
CI for µ : σ unknown Example The meters of total rainfall for Jupa, Beliana in the first decade of each of the last sixteen centuries is given below: 3 . 790155 3 . 628361 3 . 989105 5 . 124677 4 . 227491 3 . 183561 5 . 286963 3 . 323666 4 . 116425 2 . 771781 6 . 243354 5 . 040272 6 . 821760 6 . 170435 5 . 439190 5 . 206938 Find a 90% confidence interval for the mean rainfall per first decade of each century. Solution: The following normal quartile plot indicates Normal Q−Q Plot that the data comes from a normal (or at least almost normal) distribution. Since σ is unknown and the distribution is close to normal with no outliers, R gives us ● > mean(mdat)-qt(0.95,15)*sd(mdat)/sqrt(16) ● ● 6 [1] 4.124238 > mean(mdat)+qt(0.95,15)*sd(mdat)/sqrt(16) ● Sample Quantiles ● ● ● ● [1] 5.171278 5 Thus a 90% confidence interval is (4 . 124238 , 5 . 171278). Notice ● ● 4 ● > t.test(mdat,mu=4.5,conf.level=0.90) ● ● One Sample t-test ● ● data: mdat 3 ● t = 0.4948, df = 15, p-value = 0.6279 −2 −1 0 1 2 alternative hypothesis: true mean is not equal to 4.5 Theoretical Quantiles 90 percent confidence interval: 4.124238 5.171278 sample estimates: mean of x 4.647758 Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 10 / 42
CI for µ : σ unknown Example The meters of total rainfall for Jupa, Beliana in the first decade of each of the last sixteen centuries is given below: 3 . 790155 3 . 628361 3 . 989105 5 . 124677 4 . 227491 3 . 183561 5 . 286963 3 . 323666 4 . 116425 2 . 771781 6 . 243354 5 . 040272 6 . 821760 6 . 170435 5 . 439190 5 . 206938 Find a 90% confidence interval for the mean rainfall per first decade of each century. Solution: The following normal quartile plot indicates Normal Q−Q Plot that the data comes from a normal (or at least almost normal) distribution. Since σ is unknown and the distribution is close to normal with no outliers, R gives us ● > mean(mdat)-qt(0.95,15)*sd(mdat)/sqrt(16) ● ● 6 [1] 4.124238 > mean(mdat)+qt(0.95,15)*sd(mdat)/sqrt(16) ● Sample Quantiles ● ● ● ● [1] 5.171278 5 Thus a 90% confidence interval is (4 . 124238 , 5 . 171278). Notice ● ● 4 ● > t.test(mdat,mu=4.5,conf.level=0.90) ● ● One Sample t-test ● ● data: mdat 3 ● t = 0.4948, df = 15, p-value = 0.6279 −2 −1 0 1 2 alternative hypothesis: true mean is not equal to 4.5 Theoretical Quantiles 90 percent confidence interval: 4.124238 5.171278 sample estimates: mean of x 4.647758 Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 10 / 42
t -test: Mean ( σ unknown) t -test: Mean ( σ unknown) t -test: Mean ( σ unknown) Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 11 / 42
t -test: Mean ( σ unknown) Theorem ( t –test for the Mean ( σ unknown)) Given a random sample, X 1 , · · · , X n , where either the random sample was sampled from a normal population or the sample size n ≥ 30 , let the test statistic be ¯ X − µ 0 T = S / √ n . Then T ∼ t ( n − 1) under H 0 : µ = µ 0 . The p–value of a test of H 0 1 versus H 1 : µ X > µ 0 is P ( T ≥ t ) . 2 versus H 2 : µ X < µ 0 is P ( T ≤ t). 3 versus H 3 : µ X � = µ 0 is 2 P ( T ≥ | t | ) . Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 12 / 42
t -test: Mean ( σ unknown) Example The number of hours of Sesame Street that American 4 year olds watch each year is assumed to x = 125 and s 2 distributed normally. Twenty-five 4 year olds are randomly sampled and one finds ¯ X = 100. What is the p –value for a test of H 0 : µ X = 120 versus H A : µ X > 120? Solution: The population is normally distributed so the test statistic is 125 − 120 t = √ = 2 . 5 10 / 25 comes from t (24) under H 0 . Thus the p –value is > 1-pt(2.5,24) [1] 0.009827088 It seems unlikely that the average number of Sesame Street watching hours is 120 or less. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 13 / 42
t -test: Mean ( σ unknown) Example The number of hours of Sesame Street that American 4 year olds watch each year is assumed to x = 125 and s 2 distributed normally. Twenty-five 4 year olds are randomly sampled and one finds ¯ X = 100. What is the p –value for a test of H 0 : µ X = 120 versus H A : µ X > 120? Solution: The population is normally distributed so the test statistic is 125 − 120 t = √ = 2 . 5 10 / 25 comes from t (24) under H 0 . Thus the p –value is > 1-pt(2.5,24) [1] 0.009827088 It seems unlikely that the average number of Sesame Street watching hours is 120 or less. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 13 / 42
Recommend
More recommend