stat 113 final exam practice problems
play

STAT 113: FINAL EXAM PRACTICE PROBLEMS SOLUTIONS Research Design / - PDF document

STAT 113: FINAL EXAM PRACTICE PROBLEMS SOLUTIONS Research Design / Describing Samples. (1) The following measures can be used to describe distributions (either population or sample distributions). For each one describe conceptu- ally (without


  1. STAT 113: FINAL EXAM PRACTICE PROBLEMS SOLUTIONS Research Design / Describing Samples. (1) The following measures can be used to describe distributions (either population or sample distributions). For each one describe conceptu- ally (without mathematical notation, and without simply describing how to calculate it) and as concisely as possible, what information it captures. (a) The mean The mean is the “balance point”, or “center of mass” of the distribution. (b) The median The median is the value for which half of the cases are below and half are above. (c) The range The range is the difference between the largest and smallest val- ues in the distribution (d) The interquartile range (IQR) The interquartile range is the range of the “middle half” of the data; that is, the difference between the 75th and 25th per- centiles. (e) The variance The variance is the “mean squared deviation” of the data: it is an average of all of the squared deviation scores from the mean. (f) The standard deviation The standard deviation is a measure of the “typical distance” from the mean. Like variance it is based on squared deviations, but unlike variance it is in the same units as the data. Date : December 15, 2015. 1

  2. 2 SOLUTIONS (2) Describe what it means for a measure to be robust/resistant (two terms for the same thing). For each of the measures above, indicate whether it is or is not relatively robust/resitant. What consider- ations go into choosing whether or not to use a robust/resistant measure? A robust/resistant measure is one that cannot be influenced by a small handful of extreme values or outliers. The mean and vari- ance/standard deviation are not robust/resistant, because they are heavily influenced by extreme values/outliers (especially the vari- ance and standard deviation). The range is extremely not robust because it is influenced entirely by extreme observations. The me- dian and interquartile range are considered robust, because they take into account only the values near the middle of the distribution. (3) (Modified/abridged from A.3) In a study investigating how students use their laptop computers in class, researchers recruited 45 students at one university in the Northeast who regularly take their laptops to class. On average, the students cycled through 65 active windows per lecture, with one student averaging 174 active windows per lecture. They found that, on average, 62% of the windows students open in class are completely unrelated to the class, and students had dis- tracting windows open and active 42% of the time, on average. The study included a measure of how each student performed on a test of the relevant material. Not surprisingly, the study finds that the students who spent more time on distracting websites generally had lower test scores. (a) Identify the cases and sample size for this study. (b) Is this an experiment or an observational study? (c) From the description given, what variables are recorded for each case? Identify each as categorical or quantitative. (d) What graph is most appropriate to display the data about num- ber of active windows open per lecture if we want to quickly determine whether the maximum value (174) is an outlier? (e) The last sentence of the paragraph describes an association. Identify a graph and a statistic that could be used to display and quantify this association, respectively. (f) From the information given, can we conclude that students who allocate their cognitive resources to distracting sites during class get lower grades because of it? Why or why not? (4) (Modified from A.27) The number of consecutive frost-free days in a year is called the growing season. A farmer considering moving to

  3. STAT 113: FINAL EXAM PRACTICE PROBLEMS 3 a new region finds that the median growing season for the area for the last 50 years is 275 days while the mean growing season is 240 days. (a) Explain how it is possible for the mean to be so much lower than the median, and describe the distribution of the growing season lengths in this area for the last 50 years. (b) Sketch either a possible histogram or a possible density curve for the shape of this distribution. Label the mean and median on the horizontal axis. Inference Foundations. Study Exam 2 and the practice problems for exam 2. Inference for Correlation and Regression. (1) (modified from D.46) Is depression a possible factor in students miss- ing classes? A study analyzed relationships among various variables pertaining to a population of college students. Two of those variables are DepressionScore , scores on a standard depression scale with higher numbers indicating greater depression, and ClassesMissed , the number of classes missed during the semester. Computer out- put is shown below for a linear regression model used to predict the number of classes missed based on the depression score.

  4. 4 SOLUTIONS Coefficients: Estimate Std. Error t-value P-value (Intercept) 1.77712 0.26714 6.652 1.79e-10 DepressionScore 0.08312 0.03368 2.468 0.0142 Residual standard error: 3.208 on 251 degrees of freedom Multiple R-squared: 0.0237 (a) Interpret the slope of the regression line in the context of de- pression and missed classes. The slope of 0.083 means that for each unit increase in the de- pression score, we would predict an increase of 0.08 additional classes missed, on average. In other words, for each 12 or so point increase in depression, we expect an additional missed class. (b) Based on the output above, what can we conclude about the relationship between these variables in the population? The output shows a significant P -value at the 0.05 level, which corresponds to a test of the null hypothesis that the population regression line has a slope of zero. So we can conclude that there is evidence that the number of missed classes increases as depression increases. (c) Interpret R 2 in the context of depression and missed classes. (What does it tell us about the relationship?) The R 2 value of 0.0237 indicates that about 2% of the total vari- ation in the number of missed classes is predictable by knowing a person’s depression score. Put another way, the amount of uncertainty we have in predicting the number of missed classes goes down by about 2% if we have a depression score and can use this regression model. (2) (modified from D.50 and D.51) We can use data from a sample of NBA basketball games to construct a regression model to predict points in a season for a player based on the number of free throws made. For our sample data, the number of free throws made in a season ranges from 16 to 594, while the number of points ranges from 104 to 2161. For the information in (a) and (b), interpret the confidence and prediction interval given in the context of free throws and points scored per season . Make a specific statement about what the value of 95% means in each case. (a) The predicted number of points made for a player who makes 100 free throws in a season is 710.8 points, with a 95% confidence interval of 675.7 to 745.8 points. The prediction interval at the same free throw number is 340.7 to 1080.8 points.

  5. STAT 113: FINAL EXAM PRACTICE PROBLEMS 5 The 95% confidence interval indicates that we are 95% sure that the population regression line passes between 675.7 and 745.8 points at the free throw value of 100. In other words, we are 95% confident that the subpopulation of players who make 100 free throws have a mean points scored between those two values. The 95% prediction interval indicates that we are 95% confident that a future individual player who makes 100 free throws in a season will score between 675.7 and 745.8 points that season. (Technically, our success rate averaged over all possible samples we could have gotten is 95%, but the previous sentence is close enough for present purposes.) (b) The predicted number of points made for a player who makes 400 free throws in a season is 1613.6 points, with a 95% confidence interval of 1559.3 to 1667.9 points. The prediction interval at the same free throw number is 1241.2 to 1986.0 points. (c) Use the information above to find the slope of the regression line. We are given two ( x, y ) points on the line, so we can solve for the slope by taking (1613.6 - 710.8) / (400 - 300). (d) How do you expect the width of the confidence interval for a player who makes 20 free throws in a season to compare to the intervals given in (a) and (b)? Why? A free throw value of 20 is much farther out in the extreme of the range of values in the sample than either of the cases above, so we would expect a much wider confidence interval, due to the higher variability across different possible sample regression lines at the extremes. Goodness of Fit and Association Tests for Categorical Variables. (1) An Ipsos/Reuters poll conducted between Dec. 5th and 9th of this year asked a random sample of 494 adult Americans identifying as members of the Republican party who their preferred presidential candidate was. Donald Trump was the choice of 183 respondents, Ben Carson was chosen by 64, Marco Rubio by 59 and Ted Cruz by 54. A total of 104 respondents identified one of the other candidates, and 30 were undecided. (a) Set aside the undecided respondents and those who identified a candidate outside the top four. Can we conclude that the propor- tion of the population from which the respondents were selected who prefer Trump is higher than the combined proportion who prefer one of Carson, Rubio and Cruz? Use a chi-square test and show all details.

Recommend


More recommend