sampling distribution of a statistic
play

Sampling Distribution of a Statistic Recall: a statistic is a summary - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Sampling Distribution of a Statistic Recall: a statistic is a summary calculated from a sample. Statistics vary from sample to sample. If samples


  1. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Sampling Distribution of a Statistic Recall: a statistic is a summary calculated from a sample. Statistics vary from sample to sample. If samples are chosen randomly , the variation of a statistic is also random. That is, under random sampling, a statistic is a random variable . 1 / 21 Review of Basic Concepts Sampling Distributions and the CLT

  2. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Sampling Distribution Every random variable has a probability distribution , usually represented by either: a probability density function , such as a normal density; or a probability mass function , such as the binomial or Poisson probability functions. In the special case of a statistic, its probability distribution is also called its sampling distribution . 2 / 21 Review of Basic Concepts Sampling Distributions and the CLT

  3. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Fuel Consumption Example For example, suppose we view the 100 fuel consumption values as a population , and draw a random sample of size 25: mean(sample(epagas$MPG, 25)) # 36.944 If we draw more samples, we get a different sample mean each time: mean(sample(epagas$MPG, 25)) # 37.044 mean(sample(epagas$MPG, 25)) # 37.088 3 / 21 Review of Basic Concepts Sampling Distributions and the CLT

  4. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II If we draw many samples, we begin to see the sampling distribution: sampleMeans = rep(NA, 1000) for (i in 1:length(sampleMeans)) sampleMeans[i] = mean(sample(epagas$MPG, 25)) hist(sampleMeans) Note that the sample means are: distributed around the population mean of 37 mpg; not as widely dispersed as the original 100 measurements. 4 / 21 Review of Basic Concepts Sampling Distributions and the CLT

  5. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Some Theoretical Results If Y 1 , Y 2 , . . . , Y n are randomly sampled from some population with mean µ and standard deviation σ , then the sampling distribution of their mean ¯ Y satisfies: for any n , � ¯ � Mean: E = µ ¯ Y = µ, Y Y = σ √ n Standard error of estimate: σ ¯ for large n , ¯ Y is approximately normally distributed (Central Limit Theorem). 5 / 21 Review of Basic Concepts Sampling Distributions and the CLT

  6. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Inference About a Parameter: Point Estimation For example, the population mean, µ A good estimator of µ should have a sampling distribution that is: centered around µ with little dispersion. We often make these ideas specific by using the mean and standard error. Consider the sample mean, ¯ Y : Y = µ ; ¯ centering: µ ¯ Y is unbiased ; Y = σ/ √ n ; ¯ dispersion: σ ¯ Y has a small standard error of estimate when n is large. 6 / 21 Review of Basic Concepts Point Estimate of a Population Mean

  7. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II In fact, when the original data are normally distributed, ¯ Y has the smallest standard error of any unbiased estimator. That is, the sample mean ¯ Y is a Minimum Variance Unbiased Estimator (MVUE). In other cases, ¯ Y is usally a good estimator of µ , but not the best. For data with the uniform distribution, the midrange is better. For data with the double exponential (Laplace) distribution, the sample median is better. 7 / 21 Review of Basic Concepts Point Estimate of a Population Mean

  8. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The sample mean ¯ Y is always the Best Linear Unbiased Estimator (BLUE): For any constants w 1 , w 2 , . . . , w n with � w i = 1, if W is the estimator n � W = w i Y i i =1 then W is unbiased: n � µ W = w i µ = µ ; i =1 but the standard error of estimate is � n � i σ 2 ≥ σ � � w 2 σ W = √ n = σ ¯ Y . � i =1 8 / 21 Review of Basic Concepts Point Estimate of a Population Mean

  9. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Inference About a Parameter: Interval Estimation Recall that, by the Central Limit Theorem, when n is large, ¯ Y is approximately normally distributed. That is, ¯ ¯ Y − µ ¯ Y − µ Y = σ/ √ n σ ¯ Y approximately follows the standard normal distribution. 9 / 21 Review of Basic Concepts Confidence Interval for a Population Mean

  10. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II So the chance that µ − 1 . 96 σ Y ≤ µ + 1 . 96 σ √ n ≤ ¯ √ n is approximately 95%. Equivalently, the chance that Y − 1 . 96 σ Y + 1 . 96 σ ¯ √ n ≤ µ ≤ ¯ √ n is approximately 95%. We say that Y ± 1 . 96 σ ¯ √ n is an approximate 95% confidence interval (CI) for µ . 10 / 21 Review of Basic Concepts Confidence Interval for a Population Mean

  11. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II To calculate the end-points of this approximate confidence interval, we need to know the additional parameter σ . Typically σ is unknown, so we cannot use the CI. But we can estimate σ by the sample standard deviation s , and use the alternative confidence interval Y ± 1 . 96 s ¯ √ n . When n is large, the chance that Y − 1 . 96 s Y + 1 . 96 s ¯ √ n ≤ µ ≤ ¯ √ n is still approximately 95%. 11 / 21 Review of Basic Concepts Confidence Interval for a Population Mean

  12. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II What if n is not large? In small samples, we can still construct a confidence interval, but it has the correct coverage probability only if the original data are approximately normally distributed. The key is to replace ± 1 . 96, the 2.5% and 97.5% points of the normal distribution, with ± t . 025 , n − 1 , the 2.5% and 97.5% points of Student’s t -distribution with ( n − 1) degrees of freedom: for normally distributed data, the chance that s s ¯ √ n ≤ µ ≤ ¯ Y − t . 025 , n − 1 Y + t . 025 , n − 1 √ n is 95%. 12 / 21 Review of Basic Concepts Confidence Interval for a Population Mean

  13. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Tables of the t -distribution show that when n is large, the percent points are very close to those of the normal distribution. So it’s reasonable to use the t -distribution percent points whenever the confidence interval is based on the sample s instead of the population σ . M&S give formulas for a general 100(1 − α )% confidence interval: s ¯ √ n ; Y ± t α/ 2 , n − 1 here α = . 05 for a 95% CI; in some situations, α = . 01 for a 99% CI is preferred; other values are rarely used. 13 / 21 Review of Basic Concepts Confidence Interval for a Population Mean

  14. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Inference About a Parameter: Testing a Hypothesis A point estimate is the most likely value of the parameter. A confidence interval is a calibrated range of plausible values. Sometimes we just want to know whether a particular value is plausible. We assess its plausibility by testing statistical hypotheses . 14 / 21 Review of Basic Concepts Testing a Hypothesis About a Mean

  15. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Example: µ 0 is an interesting value of the population mean µ . Null hypothesis, H 0 : µ = µ 0 Alternative hypothesis, H a : µ � = µ 0 . Data are a sample of size n with mean ¯ y and standard deviation s . Basic idea: H 0 is implausible if ¯ y is far from µ 0 . 15 / 21 Review of Basic Concepts Testing a Hypothesis About a Mean

  16. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II To be precise: t = | ¯ y − µ 0 | s / √ n measures how far ¯ y is from µ 0 , as a multiple of the standard error of estimate. Basic idea: reject H 0 if t is large. To be precise: choose a level of significance α ; again often α = . 05. Reject H 0 if t > t α/ 2 , n − 1 . 16 / 21 Review of Basic Concepts Testing a Hypothesis About a Mean

  17. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II We can show that when H 0 is true, that is µ = µ 0 , and the data are normally distributed, the chance of (incorrectly) rejecting H 0 is α . That is, α is the chance of making a Type I error . If a statistician always follows this procedure, true null hypotheses will be rejected only 100 α % of the time. So when a null hypothesis is rejected, either it was actually false, or one of these infrequent errors occurred. Note: We never accept H 0 , we only fail to reject it. 17 / 21 Review of Basic Concepts Testing a Hypothesis About a Mean

  18. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II This is a two-tailed test: we reject H 0 if ¯ y is too far from µ 0 in either direction. In regression analysis, almost all tests are two-tailed. M&S discuss one-tailed tests, and provide an example. Deciding which hypothesis is H 0 and which is H a may not be easy. 18 / 21 Review of Basic Concepts Testing a Hypothesis About a Mean

Recommend


More recommend