unit 4 inference for numerical variables lecture 1
play

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, - PowerPoint PPT Presentation

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics 101 Thomas Leininger June 4, 2013 Bootstrap & Randomization testing Rent in Durham - bootstrap interval The dot plot below shows the


  1. Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics 101 Thomas Leininger June 4, 2013

  2. Bootstrap & Randomization testing Rent in Durham - bootstrap interval The dot plot below shows the distribution of means of 100 bootstrap samples from the original sample. Estimate the 90% bootstrap confi- dence interval based on this bootstrap distribution. ● ● 1013.9 1354.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 900 1000 1100 1200 1300 1400 bootstrap means Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 2 / 27

  3. Bootstrap & Randomization testing Bootstrap applet http://wise.cgu.edu/bootstrap/ Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 3 / 27

  4. Paired data Paired observations 200 observations were randomly sampled from the High School and Beyond survey. The same students took a reading and writing test and their scores are shown below. At a first glance, does there appear to be a difference between the average reading and writing test score? 100 80 60 scores 40 20 0 read write Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 4 / 27

  5. Paired data Paired observations Question The same students took a reading and writing test and their scores are shown below. Are the reading and writing scores of each student independent of each other? id read write 1 70 57 52 2 86 44 33 3 141 63 44 4 172 47 52 . . . . . . . . . . . . 200 137 63 65 (a) Yes (c) No (b) No Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 5 / 27

  6. Paired data Paired observations Analyzing paired data When two sets of observations have this special correspondence (not independent), they are said to be paired . To analyze paired data, it is often useful to look at the difference in outcomes of each pair of observations. diff = read − write It is important that we always subtract using a consistent order. 40 id read write diff 30 Frequency 1 70 57 52 5 20 2 86 44 33 11 3 141 63 44 19 10 4 172 47 52 -5 . . . . . . . . . . 0 . . . . . −20 −10 0 10 20 200 137 63 65 -2 differences Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 6 / 27

  7. Paired data Paired observations Parameter and point estimate Parameter of interest: Average difference between the reading and writing scores of all high school students. µ diff Point estimate: Average difference between the reading and writing scores of sampled high school students. ¯ x diff Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 7 / 27

  8. Paired data Inference for paired data Setting the hypotheses If in fact there was no difference between the scores on the reading and writing exams, what would you expect the average difference to be? 0 What are the hypotheses for testing if there is a difference between the average reading and writing scores? H 0 : There is no difference between the average reading and writing score. µ diff = 0 H A : There is a difference between the average reading and writing score. µ diff � 0 Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 8 / 27

  9. Paired data Inference for paired data Nothing new here The analysis is no different than what we have done before. We have data from one sample: differences. We are testing to see if the average difference is different than 0. Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 9 / 27

  10. Paired data Inference for paired data Checking assumptions & conditions Question Which of the following is true? (a) Since students are sampled randomly, we can assume that the difference between the reading and writing scores of one student in the sample is independent of another. (b) The distribution of differences is bimodal, therefore we cannot continue with the hypothesis test. (c) In order for differences to be random we should have sampled with replacement. (d) Since students are sampled randomly, we can assume that the sampling distribution of the average difference will be nearly normal. Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 10 / 27

  11. Paired data Inference for paired data Application exercise: Calculating the test-statistic and the p-value The observed average difference between the two scores is -0.545 points and the standard deviation of the difference is 8.887 points. Which of the below is the closest p-value for evaluating a difference between the average scores on the two exams? (n=200) (a) 20% (c) 40% (e) 48% (b) 40% (d) 5% (f) 95% − 0 . 545 − 0 = − 0 . 545 Z = 0 . 628 = − 0 . 87 8 . 887 √ 200 p − value = 0 . 1949 × 2 = 0 . 3898 −0.545 0 0.545 Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 11 / 27

  12. Paired data Inference for paired data Interpretation of p-value Question Which of the following is the correct interpretation of the p-value? (a) Probability that the average scores on the reading and writing exams are equal. (b) Probability that the average scores on the reading and writing exams are different. (c) Probability of obtaining a random sample of 200 students where the average difference between the reading and writing scores is at least 0.545 (in either direction), if in fact the true average difference between the scores is 0. (d) Probability of incorrectly rejecting the null hypothesis if in fact the null hypothesis is true. Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 12 / 27

  13. Paired data Inference for paired data HT ↔ CI Question Suppose we were to construct a 95% confidence interval for the av- erage difference between the reading and writing scores. Would you expect this interval to include 0? (a) yes (b) no (c) cannot tell from the information given − 0 . 545 ± 1 . 96 8 . 887 = − 0 . 545 ± 1 . 96 × 0 . 628 √ 200 = − 0 . 545 ± 1 . 23 = ( − 1 . 775 , 0 . 685 ) Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 13 / 27

  14. Difference of two means Confidence intervals for differences of means The General Social Survey (GSS) conducted by the Census Bureau contains a standard ‘core’ of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. Below is an excerpt from the 2010 data set. The variables are number of hours worked per week and highest educational attainment. degree hrs1 1 BACHELOR 55 2 BACHELOR 45 3 JUNIOR COLLEGE 45 . . . 1172 HIGH SCHOOL 40 Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 14 / 27

  15. Difference of two means Confidence intervals for differences of means Exploratory analysis What can you say about the relationship between educational attain- ment and hours worked per week? ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● 60 40 ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 Less than HS HS Jr Coll Bachelor's Graduate Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 15 / 27

  16. Difference of two means Confidence intervals for differences of means Collapsing levels into two Say we are only interested the difference between the number of hours worked per week by college and non-college graduates. Then we combine the levels of education into two: hs or lower ← less than high school or high school coll or higher ← junior college, bachelor’s, and graduate Here is how you can do this in R: # create a new empty variable gss$edu = NA # if statements to determine levels of new variable gss$edu[gss$degree == "LESS THAN HIGH SCHOOL" | gss$degree == "HIGH SCHOOL"] = "hs or lower" gss$edu[gss$degree == "JUNIOR COLLEGE" | gss$degree == "BACHELOR" | gss$degree == "GRADUATE"] = "coll or higher" # make sure new variable is categorical gss$edu = as.factor(gss$edu) Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 16 / 27

  17. Difference of two means Confidence intervals for differences of means Exploratory analysis - another look ¯ x s n coll or higher 41.8 15.14 505 hs or lower 39.4 15.12 667 coll or higher Frequency 100 0 0 20 40 60 80 hs or lower Frequency 150 0 0 20 40 60 80 hours worked per week Statistics 101 (Thomas Leininger) U4 - L1: Bootstrap, paired, and two sample June 4, 2013 17 / 27

Recommend


More recommend