stat 213 logistic regression assessment and testing
play

STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - PowerPoint PPT Presentation

Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30 Outline Assessing Conditions Tests and Intervals Outline Assessing


  1. Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30

  2. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 2 / 30

  3. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 3 / 30

  4. Outline Assessing Conditions Tests and Intervals Conditions for Logistic Regression 1. Logit-Linearity ( log odds depends linearly on X ) 2. Independence (no clustering or time/space dependence) 3. Random (data comes from a random sample, or random assignment) 4. Normality no longer applies! (Response is binary, so it can’t) 5. Constant Variance no longer required! (In fact, more variance when ˆ π near 0.5) 4 / 30

  5. Outline Assessing Conditions Tests and Intervals Checking Linearity • Can’t just transform response via logit to check linearity... • logit(0) = −∞ • logit(1) = ∞ • ...unless data is binned... then can take logit of proportion per bin 6 / 30

  6. Outline Assessing Conditions Tests and Intervals Example: Golf Putts Distance (ft) 3 4 5 6 7 # Made 84 88 61 61 44 # Missed 17 31 47 64 90 Odds 4.94 2.84 1.30 0.95 0.49 Log Odds 1.60 1.04 0.26 -0.05 -0.71 library("mosaic") Putts <- data.frame( Distance = 3:7, Made = c(84,88,61,61,44), Missed = c(17,31,47,64,90)) %>% mutate( Total = Made + Missed, PropMade = Made / Total) 7 / 30

  7. Outline Assessing Conditions Tests and Intervals Binned Data xyplot(logit(PropMade) ~ Distance, data = Putts, type = c("p","r")) ● 1.5 logit(PropMade) ● 1.0 0.5 ● 0.0 ● −0.5 ● 3 4 5 6 7 Distance 8 / 30 Logits are fairly linear

  8. Outline Assessing Conditions Tests and Intervals Equivalent Model Code for Binned Data m2 <- glm(cbind(Made,Missed) ~ Distance, data = Putts, family = "binomial") m2 Call: glm(formula = cbind(Made, Missed) ~ Distance, family = "binomial", data = Putts) Coefficients: (Intercept) Distance 3.2568 -0.5661 Degrees of Freedom: 4 Total (i.e. Null); 3 Residual Null Deviance: 81.39 Residual Deviance: 1.069 AIC: 30.18 9 / 30

  9. Outline Assessing Conditions Tests and Intervals Deviance Residuals • Total log likelihood : ℓ := log P ( Data | Model ) • Deviance measures “total discrepancy” between data and model: Deviance := − 2 ℓ = − 2 log P ( Data | Model ) • In linear regression, we had N � ε 2 SSE = i = − 2 log p ( Data | Model ) i =1 • deviance residuals d i “reverse engineered” so that N � d 2 Deviance = 11 / 30 i i =1

  10. Outline Assessing Conditions Tests and Intervals Checking for Outliers ### Model of med school acceptance probability by MCAT score library(Stat2Data); data(MedGPA) mcatModel <- glm(Acceptance ~ MCAT, data = MedGPA, family = "binomial") ## Check for outliers by plotting residual distribution ## (Note: will almost always be bimodal; *not* expecting normality) residuals(mcatModel, type = "deviance") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 12 / 30 .

  11. Outline Assessing Conditions Tests and Intervals Pearson Residuals Another way to conceive of residuals is by “standardized distance” from the predicted value Y i − ˆ π i Pearson’s residual i = � ˆ π i (1 − ˆ π i ) residuals(mcatModel, type = "pearson") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 13 / 30 .

  12. Outline Assessing Conditions Tests and Intervals Pearson Residuals vs. Fitted Values Plot Can check logit-linearity for unbinned data by binning residuals and constructing fitted values vs. (average) residuals plot library("arm") ## for binnedplot() binnedplot(fitted(mcatModel), residuals(mcatModel, type = "pearson"), nclass = 10 # number of bins to use ) Binned residual plot Average residual ● 1.0 ● 0.0 ● ● ● ● ● ● ● ● −1.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Expected Values 15 / 30

  13. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted 16 / 30

  14. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 17 / 30

  15. Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In linear regression, we computed the test statistic : ˆ β 1 − 0 t obs = se (ˆ ˆ β 1 ) (number of standard errors ˆ β 1 is from 0). P -value: prob. of getting a test stat this big by chance if H 0 true (i.e., β 1 = 0 ) 19 / 30

  16. Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In logistic regression we can do the same thing, but with Normal instead of t distribution. ˆ β 1 − 0 z obs = se (ˆ ˆ β 1 ) and get P -value: prob of a test stat this big if H 0 true 20 / 30

  17. Outline Assessing Conditions Tests and Intervals In R summary(mcatModel) %>% coef() %>% round(3) Estimate Std. Error z value Pr(>|z|) (Intercept) -8.712 3.236 -2.692 0.007 MCAT 0.246 0.089 2.752 0.006 � � � ˆ Only 0.6% chance we’d get β 1 � ≥ 0 . 246 if the association is � � due solely to chance sampling 21 / 30

  18. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal 22 / 30

  19. Outline Assessing Conditions Tests and Intervals Confidence Interval for β 1 Same principle applies for confidence interval... β 1 ± z ∗ · ˆ CI (∆ logit ) : ˆ se ( ˆ β 1 ) confint(mcatModel) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 But... β 1 is the rate of change of the log odds, which is hard to understand. More common to report a CI for odds ratio ( e β 1 ). CI ( OR ) : ( e β ( lwr ) , e β ( upr ) ) 1 1 24 / 30

  20. Outline Assessing Conditions Tests and Intervals In R... confint(medschool.model) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 confint(medschool.model) %>% exp() %>% round(2) 2.5 % 97.5 % (Intercept) 0.00 0.05 MCAT 1.09 1.55 “We are 95% confident that the odds ( not probability ) of admittance increases by a factor of (is multiplied by) between 1.09 and 1.55 for each additional point of MCAT score” 25 / 30

  21. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 26 / 30

  22. Outline Assessing Conditions Tests and Intervals CIs at specific values Arguably easier to interpret, CIs for π at a few specific X values source("http://colindawson.net/stat213/code/helper_functions.R") ## functions made with regular makeFun() give point values but not ## intervals with logistic models, so I wrote a custom function f.hat <- makeFun.logistic(mcatModel) quartiles <- quantile(~MCAT, data = MedGPA) f.hat(MCAT = quartiles, interval = "confidence", level = 0.95) %>% round(2) MCAT pi.hat lwr upr 0% 18 0.01 0.00 0.26 25% 34 0.41 0.26 0.58 50% 36 0.54 0.39 0.67 75% 39 0.71 0.52 0.84 100% 48 0.96 0.72 0.99 Interpretation: “We are 95% confident that the probability of acceptance for students with an MCAT score of 39 is 28 / 30 between 52% and 84%”

  23. Outline Assessing Conditions Tests and Intervals Confidence Bands ## Also requires sourcing helper_functions.R ## Can supply level=, xlim=, xlab= and ylab= to customize graph plot.logistic.bands(mcatModel) 0.8 P( Acceptance = 1) 0.6 0.4 0.2 0.0 20 25 30 35 40 45 MCAT 29 / 30

  24. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 Intervals for Fitted Confidence and Confidence intervals Vals. prediction intervals only 30 / 30

Recommend


More recommend