Lecture 5: Multiple Linear Regression CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader
Lecture Outline Simple Regression: • Predictor variables Standard Errors • Evaluating Significance of Predictors • Hypothesis Testing " ? • How well do we know 𝑔 • How well do we know 𝑧 $ ? Multiple Linear Regression: • Categorical Predictors • Collinearity • Hypothesis Testing • Interaction Terms Polynomial Regression CS109A, P ROTOPAPAS , R ADER 1
Standard Errors " & , 𝑇𝐹 𝛾 " ' . The variances of 𝛾 & and 𝛾 ' are also called their standard errors , 𝑇𝐹 𝛾 If our data is drawn from a larger set of observations then we can empirically " & , 𝑇𝐹 𝛾 " ' of 𝛾 & and 𝛾 ' through estimate the standard errors , 𝑇𝐹 𝛾 bootstrapping. If we know the variance 𝜏 . of the noise 𝜗 , we can compute 𝑇𝐹 𝛾 " & , 𝑇𝐹 𝛾 " ' analytically, using the formulae below: s ⇣ ⌘ x 2 1 b = σ n + SE β 0 P i ( x i − x ) 2 ⇣ ⌘ σ b = qP SE β 1 i ( x i − x ) 2 CS109A, P ROTOPAPAS , R ADER 2
� � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ ⇣ ⌘ 5 σ b qP = SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . Remember: 𝑧 5 = 𝑔 𝑦 5 + 𝜗 5 ⟹ 𝜗 5 = 𝑧 5 − 𝑔(𝑦 5 ) CS109A, P ROTOPAPAS , R ADER 3
Standard Errors In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . However, if we make the following assumptions, • the errors 𝜗 5 = 𝑧 5 − 𝑧 $ 5 and 𝜗 B = 𝑧 B − 𝑧 $ B are uncorrelated, for 𝑗 ≠ 𝑘 , • each 𝜗 5 is normally distributed with mean 0 and variance 𝜏 . , then, we can empirically estimate 𝜏 . , from the data and our regression line: sP r y i ) 2 n · MSE i ( y i − b = σ ≈ n − 2 n − 2 s X ( ˆ f ( x ) − y i ) 2 σ ≈ n − 2 CS109A, P ROTOPAPAS , R ADER 4
� � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 ⇣ ⌘ Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ σ b 5 = qP SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 s X ( ˆ f ( x ) − y i ) 2 " − 𝑧 5 ) ↓ ⟹ 𝜏 ↓ ⟹ 𝑇𝐹 ↓ Better model: (𝑔 σ ≈ n − 2 F , 𝛾 ' F under these scenarios? Question: What happens to the 𝛾 & CS109A, P ROTOPAPAS , R ADER 5
Standard Errors The following results are for the coefficients for TV advertising: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0061 Bootstrap 0.0061 The coefficients for TV advertising but restricting the coverage of x are: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0068 Bootstrap 0.0068 This makes no sense? The coefficients for TV advertising but with added extra noise: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0028 Bootstrap 0.0023 CS109A, P ROTOPAPAS , R ADER 6
Importance of predictors We have discussed finding the importance of predictors, by determining the cumulative distribution from ∞ to 0. . CS109A, P ROTOPAPAS , R ADER 7
Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data . CS109A, P ROTOPAPAS , R ADER 8
Random sampling of the data TV TV sales sales TV TV TV TV sales sales sales sales 68.4 216.4 230.1 215.4 88.3 50.0 22.1 22.1 22.1 22.1 22.1 22.1 Shuffle the values of the predictor variable 184.9 102.7 202.5 10.4 10.4 10.4 89.7 276.7 44.5 10.4 10.4 10.4 248.8 204.1 11.7 9.3 9.3 9.3 23.8 68.4 17.2 9.3 9.3 9.3 151.5 13.2 219.8 191.1 39.5 75.3 18.5 18.5 18.5 18.5 18.5 18.5 23.8 68.4 13.1 12.9 12.9 12.9 142.9 180.8 26.8 12.9 12.9 12.9 248.8 296.4 59.6 7.2 7.2 7.2 8.7 170.2 220.3 7.2 7.2 7.2 76.4 70.6 0.7 255.4 57.5 26.8 11.8 11.8 11.8 11.8 11.8 11.8 197.6 265.2 164.5 13.2 13.2 13.2 120.2 87.2 139.5 13.2 13.2 13.2 195.4 209.6 292.9 4.8 4.8 4.8 8.6 120.5 237.4 4.8 4.8 4.8 76.4 75.5 147.3 10.6 10.6 10.6 199.8 293.6 16.9 10.6 10.6 10.6 139.2 238.2 66.1 78.2 80.2 13.1 8.6 8.6 8.6 8.6 8.6 8.6 182.6 109.8 222.4 17.4 17.4 17.4 43.0 214.7 218.5 17.4 17.4 17.4 43.0 112.9 171.3 9.2 9.2 9.2 139.2 23.8 147.3 9.2 9.2 9.2 276.9 184.9 199.1 25.6 97.5 73.4 9.7 9.7 9.7 9.7 9.7 9.7 193.2 147.3 262.7 19.0 19.0 19.0 239.3 204.1 216.4 19.0 19.0 19.0 131.7 28.6 89.7 22.4 22.4 22.4 238.2 191.1 195.4 22.4 22.4 22.4 225.8 116.0 67.8 25.1 135.2 213.4 12.5 12.5 12.5 12.5 12.5 12.5 240.1 193.2 166.8 24.4 24.4 24.4 25.6 109.8 281.4 24.4 24.4 24.4 CS109A, P ROTOPAPAS , R ADER 9
CS109A, P ROTOPAPAS , R ADER 10
Importance of predictors Translate this to Kevin’s language. Let’s look at the distance of the " ' ) = 𝜏 I estimated value of the coefficient in units of SE( 𝛾 J K . . 𝜈 I J K − 0 ˆ 𝜏 I β 1 − 0 J K t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 11
Importance of predictors And also evaluate how often a particular value of t can occur by accident (using the shuffled data)? We expect that t will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value " ' = 0 is easy. We call equal to |𝑢| or larger, assuming 𝛾 this probability the p-value . a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 12
Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data. 1. State the hypotheses, typically a null hypothesis , 𝐼 & and an alternative hypothesis , 𝐼 ' , that is the negation of the former. 2. Choose a type of analysis, i.e. how to use sample data to evaluate the null hypothesis. Typically this involves choosing a single test statistic . 3. Compute the test statistic. 4. Use the value of the test statistic to either reject or not reject the null hypothesis. CS109A, P ROTOPAPAS , R ADER 13
Hypothesis testing 1. State Hypothesis: Null hypothesis: 𝐼 & : There is no relation between X and Y The alternative: 𝐼 Q : There is some relation between X and Y 2: Choose test statistics To test the null hypothesis, we need to determine whether, our " ' , is sufficiently far from zero that we can be confident estimate for 𝛾 " ' is non-zero. We use the following test statistic: that 𝛾 ˆ β 1 − 0 t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 14
Hypothesis testing 3. Compute the statistics : ", 𝑇𝐹(𝛾) we calculate the t-statistic. Using the estimated 𝛾 4. Reject or not reject the hypothesis: If there is really no relationship between X and Y , then we expect that will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value equal to |𝑢| or larger, " ' = 0 is easy. We call this probability the p-value. assuming 𝛾 a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 15
Hypothesis testing P-values for all three predictors done independently " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 CS109A, P ROTOPAPAS , R ADER 16
Things to Consider Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? S How well do we know 𝒈 " The confidence intervals of our 𝑔 CS109A, P ROTOPAPAS , R ADER 17
" ? How well do we know 𝑔 Our confidence in 𝑔 is directly connected with the confidence in 𝛾 s. So for each 𝛾 we can determine the model. CS109A, P ROTOPAPAS , R ADER 18
" ? How well do we know 𝑔 Here we show two difference set of models given the fitted coefficients for a given subsample CS109A, P ROTOPAPAS , R ADER 19
" ? How well do we know 𝑔 There is one such regression line for every imaginable sub-sample. CS109A, P ROTOPAPAS , R ADER 20
" ? How well do we know 𝑔 Below we show all regression lines for a thousand of such sub-samples. " , and determine the mean For a given 𝑦 , we examine the distribution of 𝑔 and standard deviation. CS109A, P ROTOPAPAS , R ADER 21
Recommend
More recommend