STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - PowerPoint PPT Presentation

Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30

Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 2 / 30

Outline Assessing Conditions Tests and Intervals Conditions for Logistic Regression 1. Logit-Linearity ( log odds depends linearly on X ) 2. Independence (no clustering or time/space dependence) 3. Random (data comes from a random sample, or random assignment) 4. Normality no longer applies! (Response is binary, so it can’t) 5. Constant Variance no longer required! (In fact, more variance when ˆ π near 0.5) 4 / 30

Outline Assessing Conditions Tests and Intervals Checking Linearity • Can’t just transform response via logit to check linearity... • logit(0) = −∞ • logit(1) = ∞ • ...unless data is binned... then can take logit of proportion per bin 6 / 30

Outline Assessing Conditions Tests and Intervals Example: Golf Putts Distance (ft) 3 4 5 6 7 # Made 84 88 61 61 44 # Missed 17 31 47 64 90 Odds 4.94 2.84 1.30 0.95 0.49 Log Odds 1.60 1.04 0.26 -0.05 -0.71 library("mosaic") Putts <- data.frame( Distance = 3:7, Made = c(84,88,61,61,44), Missed = c(17,31,47,64,90)) %>% mutate( Total = Made + Missed, PropMade = Made / Total) 7 / 30

Outline Assessing Conditions Tests and Intervals Binned Data xyplot(logit(PropMade) ~ Distance, data = Putts, type = c("p","r")) ● 1.5 logit(PropMade) ● 1.0 0.5 ● 0.0 ● −0.5 ● 3 4 5 6 7 Distance 8 / 30 Logits are fairly linear

Outline Assessing Conditions Tests and Intervals Equivalent Model Code for Binned Data m2 <- glm(cbind(Made,Missed) ~ Distance, data = Putts, family = "binomial") m2 Call: glm(formula = cbind(Made, Missed) ~ Distance, family = "binomial", data = Putts) Coefficients: (Intercept) Distance 3.2568 -0.5661 Degrees of Freedom: 4 Total (i.e. Null); 3 Residual Null Deviance: 81.39 Residual Deviance: 1.069 AIC: 30.18 9 / 30

Outline Assessing Conditions Tests and Intervals Deviance Residuals • Total log likelihood : ℓ := log P ( Data | Model ) • Deviance measures “total discrepancy” between data and model: Deviance := − 2 ℓ = − 2 log P ( Data | Model ) • In linear regression, we had N � ε 2 SSE = i = − 2 log p ( Data | Model ) i =1 • deviance residuals d i “reverse engineered” so that N � d 2 Deviance = 11 / 30 i i =1

Outline Assessing Conditions Tests and Intervals Checking for Outliers ### Model of med school acceptance probability by MCAT score library(Stat2Data); data(MedGPA) mcatModel <- glm(Acceptance ~ MCAT, data = MedGPA, family = "binomial") ## Check for outliers by plotting residual distribution ## (Note: will almost always be bimodal; *not* expecting normality) residuals(mcatModel, type = "deviance") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 12 / 30 .

Outline Assessing Conditions Tests and Intervals Pearson Residuals Another way to conceive of residuals is by “standardized distance” from the predicted value Y i − ˆ π i Pearson’s residual i = � ˆ π i (1 − ˆ π i ) residuals(mcatModel, type = "pearson") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 13 / 30 .

Outline Assessing Conditions Tests and Intervals Pearson Residuals vs. Fitted Values Plot Can check logit-linearity for unbinned data by binning residuals and constructing fitted values vs. (average) residuals plot library("arm") ## for binnedplot() binnedplot(fitted(mcatModel), residuals(mcatModel, type = "pearson"), nclass = 10 # number of bins to use ) Binned residual plot Average residual ● 1.0 ● 0.0 ● ● ● ● ● ● ● ● −1.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Expected Values 15 / 30

Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted 16 / 30

Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In linear regression, we computed the test statistic : ˆ β 1 − 0 t obs = se (ˆ ˆ β 1 ) (number of standard errors ˆ β 1 is from 0). P -value: prob. of getting a test stat this big by chance if H 0 true (i.e., β 1 = 0 ) 19 / 30

Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In logistic regression we can do the same thing, but with Normal instead of t distribution. ˆ β 1 − 0 z obs = se (ˆ ˆ β 1 ) and get P -value: prob of a test stat this big if H 0 true 20 / 30

Outline Assessing Conditions Tests and Intervals In R summary(mcatModel) %>% coef() %>% round(3) Estimate Std. Error z value Pr(>|z|) (Intercept) -8.712 3.236 -2.692 0.007 MCAT 0.246 0.089 2.752 0.006 � � � ˆ Only 0.6% chance we’d get β 1 � ≥ 0 . 246 if the association is � � due solely to chance sampling 21 / 30

Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal 22 / 30

Outline Assessing Conditions Tests and Intervals Confidence Interval for β 1 Same principle applies for confidence interval... β 1 ± z ∗ · ˆ CI (∆ logit ) : ˆ se ( ˆ β 1 ) confint(mcatModel) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 But... β 1 is the rate of change of the log odds, which is hard to understand. More common to report a CI for odds ratio ( e β 1 ). CI ( OR ) : ( e β ( lwr ) , e β ( upr ) ) 1 1 24 / 30

Outline Assessing Conditions Tests and Intervals In R... confint(medschool.model) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 confint(medschool.model) %>% exp() %>% round(2) 2.5 % 97.5 % (Intercept) 0.00 0.05 MCAT 1.09 1.55 “We are 95% confident that the odds ( not probability ) of admittance increases by a factor of (is multiplied by) between 1.09 and 1.55 for each additional point of MCAT score” 25 / 30

Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 26 / 30

Outline Assessing Conditions Tests and Intervals CIs at specific values Arguably easier to interpret, CIs for π at a few specific X values source("http://colindawson.net/stat213/code/helper_functions.R") ## functions made with regular makeFun() give point values but not ## intervals with logistic models, so I wrote a custom function f.hat <- makeFun.logistic(mcatModel) quartiles <- quantile(~MCAT, data = MedGPA) f.hat(MCAT = quartiles, interval = "confidence", level = 0.95) %>% round(2) MCAT pi.hat lwr upr 0% 18 0.01 0.00 0.26 25% 34 0.41 0.26 0.58 50% 36 0.54 0.39 0.67 75% 39 0.71 0.52 0.84 100% 48 0.96 0.72 0.99 Interpretation: “We are 95% confident that the probability of acceptance for students with an MCAT score of 39 is 28 / 30 between 52% and 84%”

Outline Assessing Conditions Tests and Intervals Confidence Bands ## Also requires sourcing helper_functions.R ## Can supply level=, xlim=, xlab= and ylab= to customize graph plot.logistic.bands(mcatModel) 0.8 P( Acceptance = 1) 0.6 0.4 0.2 0.0 20 25 30 35 40 45 MCAT 29 / 30

Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 Intervals for Fitted Confidence and Confidence intervals Vals. prediction intervals only 30 / 30

STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - PowerPoint PPT Presentation

Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30 Outline Assessing Conditions Tests and Intervals Outline Assessing

STAT 213 Logistic Regression II Colin Reimer Dawson Oberlin College 28 April 2016 Outline

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State University April 23, 2018

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

iRODS User Group integrated Rule Oriented Data System Reagan Moore {moore, sekar, mwan,

Extinguishing the Three-Stone Fire Paul Arveson Solar Household Energy, Inc. www.she-inc.org

Financial Knowledge and Financial Literacy at the Household Level Alan L. Gustman Thomas L.

BRAF Mutated mCRC: Specific Considerations for Treatment Dustin Deming, MD McArdle Laboratory

CS107e Computer Systems from the Ground Up Christos Kozyrakis, Philip Levis, Peter McEvoy,

Advising Office Fall 2020 Group Advising Agenda Who we are, what we do, housekeeping I.

(did you swipe your ID yet?) Hillcrest Elementary, Feb 5, 6:00-7:30pm Douglas County

Total Cost of Care (TCOC) Workgroup October 30, 2019 Agenda Administrative Updates 1. User