ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion
Estimating the population proportion Recall that we estimated the proportion p of a population having some characteristic with the sample proportion ˆ p . Some percentage of faculty wore costumes on Halloween. A sample of 42 faculty members showed that 6 wore costumes and 36 did not. The parameter p is the true proportion of faculty that wore costumes. The statistic ˆ p is the proportion of the sample that wore costumes. p = the number who wore costumes ˆ the total size of the sample p = 6 In this case, ˆ 42 = 0 . 143.
Properties of ˆ p Recall that if the sample size is large enough, then the sampling distribution of ˆ p is approximately Normal with mean p and s.d. � p (1 − p ) / n . We can form confidence intervals of the form � p (1 − p ) p ± z ∗ ˆ . n Problem: We do not know p .
Confidence intervals for p , take 1 First solution: use ˆ p in place of p ! � p (1 − p ) p ± z ∗ ˆ n becomes � p (1 − ˆ ˆ p ) p ± z ∗ ˆ n This works when there are at least 15 successes and 15 failures in the sample. ◮ Why z ∗ and not t ∗ ? ◮ Why is this the “first solution”?
Why not t ∗ ? Previously, when we used the sample mean ¯ x to estimate the population mean µ , σ was involved in the description of the spread of the distribution of ¯ x . But since we also estimated σ , we were forced to use a t distribution. Now, we are using the sample proportion ˆ p to estimate the population proportion p . � However, since the standard deviation of ˆ p is p (1 − p ) / n , this only depends on p . So there is really only one parameter, p , describing the distribution of ˆ p . Thus, a t distribution isn’t needed.
Why is this only a “first solution”? This approach works for large samples: at least 15 successes and 15 failures means the sample must contain at least 30 observations. � For smaller samples, however, the estimate p (1 − ˆ ˆ p ) / n is not a very good one. � In particular, using p (1 − ˆ ˆ p ) / n gives confidence intervals which are too small.
Confidence intervals for p , take 2 A simple modification to calculating ˆ p which almost always produces better estimates is the so-called “plus four” method. Before calculating ˆ p , we first add four imaginary observations, two of which are successes and two of which are failures. p = number of successes p = number of successes + 2 ˆ ⇒ ˜ n n + 4 We then use the resulting statistic ˜ p for our estimates. A few conditions: ◮ We need n > 10, and ◮ We should only work with confidence levels of at least 90%.
An example Find a 95% confidence interval for the proportion p of faculty members that wore costumes for halloween. In a sample of size n = 42 there were 6 faculty that wore costumes. p = 6 + 2 ˜ 42 + 4 = 0 . 174 and � � p (1 − ˜ ˜ p ) 0 . 144 SE (˜ p ) = = 42 + 4 = 0 . 055 . n + 4 This yields the interval interval p ± z ∗ SE (˜ ˜ p ) = 0 . 174 ± (1 . 96)(0 . 055) = [0 . 067 , 0 . 281] . Note that we used n + 4 when calculating the standard error!
Estimating the needed sample size When calculating a confidence interval for p from a large sample, we used the interval � p (1 − ˆ ˆ p ) p ± z ∗ ˆ n (We are not taking the “plus four” estimation into account here). Suppose we want a confidence interval of a certain width m : p ± m ˆ How large of sample do we need? We want � p (1 − ˆ ˆ p ) m = z ∗ n
Estimating the needed sample size � p (1 − ˆ ˆ p ) m = z ∗ n Solve for n : � 2 � z ∗ n = p (1 − ˆ ˆ p ) m Problem: ˆ p is found after taking a sample. Yet, we need to find n before taking the sample. We need a number to use in place of ˆ p . There are a few options. ◮ Guess some value p ∗ which we think will be close to ˆ p (perhaps using some prior knowledge of the population) ◮ Use p ∗ = 0 . 5 as our guess. (Why 0.5? It maximizes the needed sample size over all possible p values)
Why does p ∗ = 0 . 5 maximize the sample size? Fixing z ∗ = 1 . 96 and m = 0 . 1 we consider what happens to the sample size as our guess p ∗ varies between 0 and 1. This is the graph of n = ( z ∗ / m ) 2 p ∗ (1 − p ∗ ) . 80 60 (z*/m)^2 p*(1-p*) The graph is always a parabola 40 with zeros p ∗ = 0 and p ∗ = 1. 20 The maximum is always at p ∗ = 0 . 5. 0 0.0 0.2 0.4 0.6 0.8 1.0 p*
An Example 1 We want to survey residents in the South Bend area to see how many are aware of the dangers of dihydrogen monoxide. How large of a sample would we need to get a 95% confidence interval with a 2% margin of error? For a 95% CI we use z ∗ = 1 . 96. Asking for a 2% margin of error is the same as asking for m = 0 . 02. We have no idea what the true proportion will be so we take p ∗ = 0 . 5.
An Example 2 Calculating using the formula from before: � 2 � z ∗ n = p ∗ (1 − p ∗ ) m � 2 � 1 . 96 = (0 . 5)(0 . 5) = 2401 0 . 02 We would need a sample of size at least 2401 to get such an CI.
Hypotheses about Proportions Continuing the dihydrogen monoxide study, we think that about 25% of the South Bend population knows about the dangers. We can formulate this thought as a hypothesis test: H 0 : p = 0 . 25 H a : p � = 0 . 25 In general: H 0 : p = p 0 H a : p � = p 0
The Test Statistic for a Proportion Hypothesis Test Remember, when we do a hypothesis test we calculate the test statistic under the assumption that H 0 is true. If H 0 is true, then p = p 0 and so ˆ p has mean p 0 and standard � deviation p 0 (1 − p 0 ) / n . So, from ˆ p we calculate the test statistic as p − p 0 ˆ z = � p 0 (1 − p 0 ) / n where z has a standard normal distribution.
Conditions on proportion hypothesis tests Some conditions which need to be met to do a hypothesis test H 0 : p = p 0 . ◮ This test requires enough samples n so that both np 0 ≥ 10 and n (1 − p 0 ) ≥ 10. (i.e. the expected number of successes and failures are both ≥ 10) ◮ Fortunately, the counts only depend on p 0 , the proportion we are performing the test against, not on the actual counts which show up in the sample. ◮ The “plus four” technique only applies to confidence intervals. We do not need to use it for the hypothesis test since knowing p 0 gives us the standard deviation of ˆ p .
DHMO Example Test, 1 In our survey we find 29 people are aware of DHMO in a sample of size n = 183. Let’s perform a large sample hypothesis test at a α = 0 . 05 significance level. H 0 : p = 0 . 25 H a : p � = 0 . 25 Do the conditions to use the test apply? np 0 = 183 × 0 . 25 = 45 . 75 ≥ 10 and n (1 − p 0 ) = 137 . 25 ≥ 10 . So, yes they do.
DHMO Example Test, 2 Finishing the calculation, we have p = 29 / 183 = 0 . 158 ˆ and hence p − p 0 ˆ z = � p 0 (1 − p 0 ) / n = − 2 . 87 . Using the table for a two tailed test we get a p -value between 0.005 and 0.002. Reject H 0 at the level α = 0 . 05.
DHMO confidence interval, 1 Using our DHMO survey data, n = 183 with 29 successes, find a 95% confidence interval for p . We will use the “plus four” technique. Are the conditions met? ◮ The confidence level is at least 90%. ◮ n = 183 ≥ 10. Yes, the conditions are met.
DHMO confidence interval, 2 Proceeding, we calculate z ∗ = 1 . 96 p = 29 + 2 ˜ 183 + 4 = 0 . 166 . Plugging values into the formula p ± z ∗ � ˜ p (1 − ˜ ˜ p ) / ( n + 4) yields 0 . 166 ± 0 . 053 , which simplifies to [0 . 112 , 0 . 219] .
Recommend
More recommend