unit 6 introduction to linear regression
play

Unit 6: Introduction to linear regression MT 2 scores posted in - PowerPoint PPT Presentation

Announcements Unit 6: Introduction to linear regression MT 2 scores posted in Sakai! 1. Introduction to regression Start working on your Final projects. Due date- Sunday Dec 3, 11:55 PM. Sta 101 - Fall 2017 Lab:10:05 AM, 11:45 AM


  1. Announcements Unit 6: Introduction to linear regression ▶ MT 2 scores posted in Sakai! 1. Introduction to regression ▶ Start working on your Final projects. Due date- Sunday Dec 3, 11:55 PM. Sta 101 - Fall 2017 – Lab:10:05 AM, 11:45 AM and 1:25 PM will present on Dec 4 during their Lab session. No lab on Monday for 8:30 AM and 3:05 PM. – Lab: 8:30 AM and 3:05 PM will present on Dec 5 during our lecture at Duke University, Department of Statistical Science Social Science 139. Your labs TAs will be here! No lecture on Tuesday for 10:05 AM, 11:45 AM and 1:25 PM. ▶ PS 6 due date Nov 17 at 11:55 PM. ▶ PA 6 due date Nov 19 at 11:55 PM. Dr. Mukherjee Slides posted at http://www2.stat.duke.edu/courses/Fall17/sta101.002/ 1 Modeling numerical variables Guessing the correlation Clicker question Which of the following is the best guess for the correlation between ▶ So far we have worked with single numerical and categorical annual murders per million and percentage living in poverty? variables, and explored relationships between numerical and categorical, and two categorical variables. 40 ● ▶ In this unit we will learn to quantify the relationship between two (a) -1.52 ● 35 ● numerical variables, as well as modeling numerical response annual murders per million 30 (b) -0.63 variables using a numerical or categorical explanatory variable. ● ● 25 ● ● ● (c) -0.12 ▶ In the next unit we’ll learn to model numerical variables using ● ● 20 ● many explanatory variables at once. (d) 0.02 ● 15 ● ● ● ● ● 10 ● (e) 0.84 ● ● 5 14 16 18 20 22 24 26 % in poverty 2 3

  2. Guessing the correlation Assessing the correlation Clicker question Clicker question Which of the following is has the strongest correlation, i.e. Which of the following is the best guess for the correlation between correlation coefficient closest to +1 or -1? annual murders per million and population size? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) -0.97 ● ● ● ● ● 35 ● ● ● ● ● annual murders per million ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● (b) -0.61 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (c) -0.06 (a) (b) ● ● 20 ● (d) 0.55 ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● (e) 0.97 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2e+06 4e+06 6e+06 8e+06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● population ● ● ● (c) (d) 4 5 Spurious correlations (2) Least squares line minimizes squared residuals ▶ Residuals are the leftovers from the model fit, and calculated as the difference between the observed and predicted y : e i = y i − ˆ y i ▶ The least squares line minimizes squared residuals: – Population data: ˆ y = β 0 + β 1 x – Sample data: ˆ y = b 0 + b 1 x Remember: correlation does not always imply causation! 40 ● http://www.tylervigen.com/ ● 35 ● annual murders per million 30 ● ● 25 ● ● ● ● ● 20 ● ● 15 ● ● ● ● ● 10 ● ● ● 5 14 16 18 20 22 24 26 % in poverty 6 7

  3. (3) Interpreting the last squares line Why does the regression line always pass through (¯ y ) ? x , ¯ ▶ If there is no relationship between x and y ( b 1 = 0 ), the best ▶ Slope: For each unit increase in x , y is expected to be guess for ˆ y for any value of x is ¯ y . higher/lower on average by the slope. ▶ Even when there is a relationship between x and y ( b 1 ̸ = 0 ), the best guess for ˆ y when x = ¯ x is still ¯ y . b 1 = s y R s x ▶ Intercept: When x = 0 , y is expected to equal the intercept. 10 1.5 4 8 0.5 b 0 = ¯ y − b 1 ¯ x 6 2 (x, y) (x, y) ● ● ● ● y2 ● y3 y ● ● ● 4 (x, y) −0.5 ● 0 2 ● 0 – The calculation of the intercept uses the fact the a regression line −2 −1.5 −2 always passes through (¯ x , ¯ y ) . −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 x x x 8 9 Clicker question What is the interpretation of the slope? (a) Each additional percentage in those living in poverty increases number of annual murders per million by 2.56. (b) For each percentage increase in those living in poverty, the Application exercise: 6.1 Linear model number of annual murders per million is expected to be higher See course website for details by 2.56 on average. (c) For each percentage increase in those living in poverty, the number of annual murders per million is expected to be lower by 29.91 on average. (d) For each percentage increase annual murders per million, the percentage of those living in poverty is expected to be higher by 2.56 on average. 10 11

  4. murder <- read.csv("https://stat.duke.edu/~mc301/data/murder.csv") # load data # fit model m_mur_pov <- lm(annual_murders_per_mil ~ perc_pov, data = murder) # create new data newdata <- data.frame(perc_pov = 20) # predict predict(m_mur_pov, newdata) 1 21.28663 A note about the intercept Clicker question Suppose you want to predict annual murder count (per million) for a series of districts that were not included in the dataset. For which of Sometimes the intercept might be an extrapolation: useful for the following districts would you be most comfortable with your adjusting the height of the line, but meaningless in the context of the prediction? data. A district where % in annual murders per million 40 ● 80 poverty = ● ● 35 annual murders per million 40 30 (a) 5% ● ● ● 25 ● 0 ● (b) 15% ● ● ● 20 −40 (c) 20% ● 15 ● ● ● 0 10 20 30 40 50 60 ● ● (d) 26% ● 10 % in poverty ● ● (e) 40% ● 5 14 16 18 20 22 24 26 % in poverty 12 13 Calculating predicted values Summary of main ideas By hand: � murder = − 29 . 91 + 2 . 56 poverty The predicted number of murders per million per year for a county with 20% poverty rate is: � murder = − 29 . 91 + 2 . 56 × 20 = 21 . 29 1. Correlation coefficient describes the strength and direction of In R: the linear association between two numerical variables 2. Least squares line minimizes squared residuals 3. Interpreting the least squares line 4. Predict, but don’t extrapolate 14 15

Recommend


More recommend