COMM 291 Midterm Review Session By Simon Roberts
Types of Variables • Categorical Variable: Variable names fall into “bins” or categories • Binary Variable: There are exactly 2 options (true/false) • Nominal: Variables are simply named (colours, shapes) • Ordinal: Variables have a specific order (Infant, Youth, Teen, Adult) • Quantitative Variable: Variables have measured numeric values • Identifier Variable: Unique Identifiers, such as a Social Insurance Number, Student ID or Amazon Tracking Number
Types of Variables – Activity Student ID Age Tuition Major Height Satisfaction Rating 12345 19 $6800 Finance 180 cm Neutral 12346 20 $5900 Computer 168 cm Extremely Science Likes
Surveys and Sampling • Population: All individuals with a common characteristic you want to generalize about • Sample: A “slice” of the population • Parameter: a fact or characteristic about the population • Statistic: a fact or characteristic about the sample
Biased Samples • Nonresponse/Undercoverage: Members of the population are systematically excluded from the sample • Conducting a telephone survey during the day (excludes commuters) • Voluntary Bias: Subjects who feel strongly self-select to participate • Common with hot-button issues (gun control, affirmative action, etc.) • Convenience Bias: Choosing subjects based on whether they’re easy to survey • Standing at a mall and asking the first 50 people who agree to take part
Sampling Designs • Simple Random Sample: Each individual has an equal chance of being selected • Stratified Random Sample: Divide population into homogenous groups and select from each stratum • Divide by age group, political affiliation and then sample by group • Cluster Random Sample: Divide population into heterogenous groups and select a few clusters • Randomly select a few high schools to represent a district • Systematic Sampling: Select every nth individual
Sampling Designs Activity UBC is interested in improving their food services on campus, so they wish to sample their students. Identify the method of sampling. 1. There are 4 faculties (commerce, engineering, science, arts). Randomly select 20 students from each. 2. Randomly select one of the faculties and survey all the students in that faculty. 3. Stop every 10 th person who enters the Nest that enters on a Thursday. 4. Each student has a student ID. Randomly select 200 participants.
Simpson’s Paradox The direction of association of the population may be the opposite of the direction of association of its relevant subgroups
Simpson’s Paradox The direction of association of the population may be the opposite of the direction of association of its relevant subgroups
Describing Categorical Data - Activity Male Female Total Finance 200 130 330 Accounting 260 280 540 Marketing 240 300 540 Total 700 710 1410 1. What percentage of students chose marketing? 2. What percentage of finance students are female? 3. What percentage of male students chose accounting?
Displaying Quantitative Data • Typically presented in a histogram, stem/leaf plot or boxplot • Mean = “Center of Mass” • Median = Middle • Mode = Most Frequent Observation • Range = Max – Min 𝑦 𝑗 −𝜈 2 • Standard Deviation = 𝜏 = 𝑂 • Variance = 𝜏 2 • IQR = Q3 – Q1
Histograms vs Stem-and-Leaf • Shows individual values • Only shows distribution • Excellent for displaying large • Impractical for large datasets datasets
Drawing Boxplots 1. Get a five-number summary (Max, Q3, Median, Q1, Min) 2. Calculate IQR 3. Find Inner Fences, but do not plot them 1. Q3 + 1.5IQR 2. Q1 – 1.5IQR Find 4. Grow whiskers to most extreme values in the fences 5. Show outliers 1. (convention uses ○ for outliers within the fences and * for outliers outside the fences) 6. Use Excel
Effect of Changing Values Activity Suppose you’ve drawn a boxplot with the following data: Min = 10; Q1 = 20; Med = 35; Q3 = 45; Max = 85 There was an error and the max was actually only 75. How does this effect: • The mean? • The median? • The range? • The IQR?
Scatterplots, Correlation and Linear Regression • Correlation (r): How strong is the linear clustering around a line? • Only for quantitative data with a linear pattern • Use “Association” for Categorical Variables as its less descriptive • - 1 ≤ r ≤ 1 • Correlation is unitless • The correlation of variables X and Y = The correlation of variables Y and X • Correlation does not necessarily imply causation. • Lurking variables: a third variable that causes both X and Y • Extrapolation: extending results beyond the range of data provided
Lurking Variables and Extrapolation
How to find a regression line • Slope of the estimated regression equation: 𝑻 𝒛 𝒄 𝟐 = 𝒔( ) 𝑻 𝒚 • Predicted value from regression equation: ഥ 𝒛 = 𝒄 𝟏 + 𝒚𝒄 𝟐 • R-Squared ( 𝑠 2 ) is the % of variation in the y value that the model can explain • Always takes on values from 0 to 1 inclusive
Residual Plots • Residual = Observation – Predicted @ each point • Good fit if there is a symmetric horizontal band around x = 0 • If it is curved, then the data is not a linear trend • If the residuals create a linear trend, there’s either a problem with your algebra or you need to take a root of the observations
Residual Plots - Example Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later on. Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed
Residual Plots - Example Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later on. Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed
Correlation and Linear Regression Activity • Is there a relationship between an NFL team’s total spending (in millions) on player salary and its league performance? A linear model predicting Wins (out of 16 regular season games) is shown below: 𝑥 = −16.32 + 0.219𝑡 ෝ A. What is the explanatory variable? What is the response variable? B. What does the slope mean? What does the y-intercept mean? C. Does a team that spends 130 million and wins 13 games over or underperform the model’s prediction? D. The residual SD is 3 games. How practical is this model?
Combining Random Variables E(x) = Expected Value of a Random Variable (think mean) σ = Standard Deviation for Random Variables E(X±Y) =E(X) ± E(Y). Does not require independence E(aX) = aE(x) Var(aX) = a 2 Var(x) SD(aX) = |a|SD(x) Var(X±Y) = Var(X) + Var(Y). Requires Independence! SD(X±Y) = Var(X) + Var(Y) . Requires Independence!
Combining Random Variables Activity Variable B has an expected value of 9.6 and an SD of 0.8. Variable C has an expected value of 30 and SD of 2.2. Find E(24B + C) and SD(24B + C)
Normal Distribution + Empirical Rule What is the total area under the Standard Deviations are measures of spread curve? We use Z-Scores to standardize different obs. Does this curve extend beyond 3 SD? Informally known as a “bell” curve
Finding Probabilities Activity 𝒚 −𝝂 Calculate Z = or use NORM.DIST(x, mu, sigma, true) 𝝉 Given 𝜈 = 85 and 𝜏 = 15, calculate the following: 1. X < 90 2. X > 105 3. 80 < x < 100
Finding X Activity Use NORM.INV(p, mu, sigma) to return the value of x such that the area to the left will have the value p Given a test where 𝜈 = 75 and 𝜏 = 9, calculate the following: 1. What score will put you in the top 5% of the class 2. What score will put you in the bottom 30%
Central Limit Theorem • The mean of a random samples has a sampling distribution that is approximated by a normal distribution • More samples = better! • Has implications for probabilities for samples of proportions and means
Ƹ Sampling Distributions for Proportions • Only for binary categorical data • Sample Proportion Ƹ 𝑞 , Population Proportion is 𝑞 𝑞𝑟 • 𝑇𝐸 𝑞 = 𝑜 ො 𝑞 −𝑞 • 𝑎 = 𝑇𝐸 ො 𝑞 • 10%, Success/Fail, Independence, Sample Size Assumptions
ҧ ҧ ҧ Sampling Distributions for Means • Only for quantitative data • Sample Mean ҧ 𝑦 , Population Proportion is 𝜈 𝜏 • 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠 𝑦 = 𝑜 𝑦−𝜈 • 𝑎 = 𝑇𝐸 𝑦 • If the population is normal, the sample is normal • If the population is not normal, but conditions are met then the distribution will be approximately normal by the central limit theorem (same conditions are proportions)
ҧ ҧ ҧ Sampling Distributions for Means • Only for quantitative data • Sample Mean ҧ 𝑦 , Population Proportion is 𝜈 𝜏 • 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠 𝑦 = 𝑜 𝑦−𝜈 • 𝑎 = 𝑇𝐸 𝑦 • If the population is normal, the sample is normal • If the population is not normal, but conditions are met then the distribution will be approximately normal by the central limit theorem (same conditions are proportions)
Recommend
More recommend