Lecture 5: Multiple Linear Regression CS109A Introduction to Data - PowerPoint PPT Presentation

Lecture 5: Multiple Linear Regression CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

Lecture Outline Simple Regression: • Predictor variables Standard Errors • Evaluating Significance of Predictors • Hypothesis Testing " ? • How well do we know 𝑔 • How well do we know 𝑧 $ ? Multiple Linear Regression: • Categorical Predictors • Collinearity • Hypothesis Testing • Interaction Terms Polynomial Regression CS109A, P ROTOPAPAS , R ADER 1

Standard Errors " & , 𝑇𝐹 𝛾 " ' . The variances of 𝛾 & and 𝛾 ' are also called their standard errors , 𝑇𝐹 𝛾 If our data is drawn from a larger set of observations then we can empirically " & , 𝑇𝐹 𝛾 " ' of 𝛾 & and 𝛾 ' through estimate the standard errors , 𝑇𝐹 𝛾 bootstrapping. If we know the variance 𝜏 . of the noise 𝜗 , we can compute 𝑇𝐹 𝛾 " & , 𝑇𝐹 𝛾 " ' analytically, using the formulae below: s ⇣ ⌘ x 2 1 b = σ n + SE β 0 P i ( x i − x ) 2 ⇣ ⌘ σ b = qP SE β 1 i ( x i − x ) 2 CS109A, P ROTOPAPAS , R ADER 2

� � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ ⇣ ⌘ 5 σ b qP = SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . Remember: 𝑧 5 = 𝑔 𝑦 5 + 𝜗 5 ⟹ 𝜗 5 = 𝑧 5 − 𝑔(𝑦 5 ) CS109A, P ROTOPAPAS , R ADER 3

Standard Errors In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . However, if we make the following assumptions, • the errors 𝜗 5 = 𝑧 5 − 𝑧 $ 5 and 𝜗 B = 𝑧 B − 𝑧 $ B are uncorrelated, for 𝑗 ≠ 𝑘 , • each 𝜗 5 is normally distributed with mean 0 and variance 𝜏 . , then, we can empirically estimate 𝜏 . , from the data and our regression line: sP r y i ) 2 n · MSE i ( y i − b = σ ≈ n − 2 n − 2 s X ( ˆ f ( x ) − y i ) 2 σ ≈ n − 2 CS109A, P ROTOPAPAS , R ADER 4

� � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 ⇣ ⌘ Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ σ b 5 = qP SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 s X ( ˆ f ( x ) − y i ) 2 " − 𝑧 5 ) ↓ ⟹ 𝜏 ↓ ⟹ 𝑇𝐹 ↓ Better model: (𝑔 σ ≈ n − 2 F , 𝛾 ' F under these scenarios? Question: What happens to the 𝛾 & CS109A, P ROTOPAPAS , R ADER 5

Standard Errors The following results are for the coefficients for TV advertising: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0061 Bootstrap 0.0061 The coefficients for TV advertising but restricting the coverage of x are: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0068 Bootstrap 0.0068 This makes no sense? The coefficients for TV advertising but with added extra noise: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0028 Bootstrap 0.0023 CS109A, P ROTOPAPAS , R ADER 6

Importance of predictors We have discussed finding the importance of predictors, by determining the cumulative distribution from ∞ to 0. . CS109A, P ROTOPAPAS , R ADER 7

Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data . CS109A, P ROTOPAPAS , R ADER 8

Random sampling of the data TV TV sales sales TV TV TV TV sales sales sales sales 68.4 216.4 230.1 215.4 88.3 50.0 22.1 22.1 22.1 22.1 22.1 22.1 Shuffle the values of the predictor variable 184.9 102.7 202.5 10.4 10.4 10.4 89.7 276.7 44.5 10.4 10.4 10.4 248.8 204.1 11.7 9.3 9.3 9.3 23.8 68.4 17.2 9.3 9.3 9.3 151.5 13.2 219.8 191.1 39.5 75.3 18.5 18.5 18.5 18.5 18.5 18.5 23.8 68.4 13.1 12.9 12.9 12.9 142.9 180.8 26.8 12.9 12.9 12.9 248.8 296.4 59.6 7.2 7.2 7.2 8.7 170.2 220.3 7.2 7.2 7.2 76.4 70.6 0.7 255.4 57.5 26.8 11.8 11.8 11.8 11.8 11.8 11.8 197.6 265.2 164.5 13.2 13.2 13.2 120.2 87.2 139.5 13.2 13.2 13.2 195.4 209.6 292.9 4.8 4.8 4.8 8.6 120.5 237.4 4.8 4.8 4.8 76.4 75.5 147.3 10.6 10.6 10.6 199.8 293.6 16.9 10.6 10.6 10.6 139.2 238.2 66.1 78.2 80.2 13.1 8.6 8.6 8.6 8.6 8.6 8.6 182.6 109.8 222.4 17.4 17.4 17.4 43.0 214.7 218.5 17.4 17.4 17.4 43.0 112.9 171.3 9.2 9.2 9.2 139.2 23.8 147.3 9.2 9.2 9.2 276.9 184.9 199.1 25.6 97.5 73.4 9.7 9.7 9.7 9.7 9.7 9.7 193.2 147.3 262.7 19.0 19.0 19.0 239.3 204.1 216.4 19.0 19.0 19.0 131.7 28.6 89.7 22.4 22.4 22.4 238.2 191.1 195.4 22.4 22.4 22.4 225.8 116.0 67.8 25.1 135.2 213.4 12.5 12.5 12.5 12.5 12.5 12.5 240.1 193.2 166.8 24.4 24.4 24.4 25.6 109.8 281.4 24.4 24.4 24.4 CS109A, P ROTOPAPAS , R ADER 9

CS109A, P ROTOPAPAS , R ADER 10

Importance of predictors Translate this to Kevin’s language. Let’s look at the distance of the " ' ) = 𝜏 I estimated value of the coefficient in units of SE( 𝛾 J K . . 𝜈 I J K − 0 ˆ 𝜏 I β 1 − 0 J K t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 11

Importance of predictors And also evaluate how often a particular value of t can occur by accident (using the shuffled data)? We expect that t will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value " ' = 0 is easy. We call equal to |𝑢| or larger, assuming 𝛾 this probability the p-value . a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 12

Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data. 1. State the hypotheses, typically a null hypothesis , 𝐼 & and an alternative hypothesis , 𝐼 ' , that is the negation of the former. 2. Choose a type of analysis, i.e. how to use sample data to evaluate the null hypothesis. Typically this involves choosing a single test statistic . 3. Compute the test statistic. 4. Use the value of the test statistic to either reject or not reject the null hypothesis. CS109A, P ROTOPAPAS , R ADER 13

Hypothesis testing 1. State Hypothesis: Null hypothesis: 𝐼 & : There is no relation between X and Y The alternative: 𝐼 Q : There is some relation between X and Y 2: Choose test statistics To test the null hypothesis, we need to determine whether, our " ' , is sufficiently far from zero that we can be confident estimate for 𝛾 " ' is non-zero. We use the following test statistic: that 𝛾 ˆ β 1 − 0 t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 14

Hypothesis testing 3. Compute the statistics : ", 𝑇𝐹(𝛾) we calculate the t-statistic. Using the estimated 𝛾 4. Reject or not reject the hypothesis: If there is really no relationship between X and Y , then we expect that will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value equal to |𝑢| or larger, " ' = 0 is easy. We call this probability the p-value. assuming 𝛾 a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 15

Hypothesis testing P-values for all three predictors done independently " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 CS109A, P ROTOPAPAS , R ADER 16

Things to Consider Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? S How well do we know 𝒈 " The confidence intervals of our 𝑔 CS109A, P ROTOPAPAS , R ADER 17

" ? How well do we know 𝑔 Our confidence in 𝑔 is directly connected with the confidence in 𝛾 s. So for each 𝛾 we can determine the model. CS109A, P ROTOPAPAS , R ADER 18

" ? How well do we know 𝑔 Here we show two difference set of models given the fitted coefficients for a given subsample CS109A, P ROTOPAPAS , R ADER 19

" ? How well do we know 𝑔 There is one such regression line for every imaginable sub-sample. CS109A, P ROTOPAPAS , R ADER 20

" ? How well do we know 𝑔 Below we show all regression lines for a thousand of such sub-samples. " , and determine the mean For a given 𝑦 , we examine the distribution of 𝑔 and standard deviation. CS109A, P ROTOPAPAS , R ADER 21

Lecture 5: Multiple Linear Regression CS109A Introduction to Data - PowerPoint PPT Presentation

Lecture 5: Multiple Linear Regression CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Lecture Outline Simple Regression: Predictor variables Standard Errors Evaluating Significance of Predictors Hypothesis

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression Sta 101 - Fall

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model

Multiple Regression Peerapat Wongchaiwat, Ph.D. wongchaiwat@hotmail.com The Multiple Regression

Multiple Linear Regression James H. Steiger Department of Psychology and Human Development

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

Assessing the contribution of collective action Maria Schultz - SwedBio at Stockholm Resilience

Spin- Out of Loblaws Interest in Choice Properties George Weston to Become 65% Unitholder of

Federalism as a Mechanism of Collective Problem Solving A Paper by Jenna Bednar Presented by

The Coherence Model of Preference and Belief Formation Sun-Ki Chai Dept. of Sociology

solutions, and problems with the solutions Richard Williams Notre Dame Sociology rwilliam@ND.Edu

Introduction to Path so, can be better studied using multivariate research designs !!! The

Season Statistics with Points Kaitlyn Kramer, Lauren Johnson Villanova University Variables

Enhancing Efficiency of Employment By Predicting Compensation Value of Applicants Team 5 John