u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Logistic regression Susanne Rosthøj Section of Biostatistics Institute of Public Health University of Copenhagen sr@biostat.ku.dk
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Outline • Risk, odds and odds-ratio • Simple logistic regression: • One binary explantory variable • One categorical • One quantitative. • Multiple logistic regression: • Two binary + interaction 2 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example 1: gender and CHD Is the risk of CHD different for males and females? CHD 0 1 Females 616 (85.6%) 104 (14.4%) 720 Males 479 (74.5%) 164 (25.5%) 643 1095 (80.3%) 268 (19.7%) 1363 The hypothesis of no difference in risk for the genders is rejected (p<0.0001, Chi-square test). 3 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Quantifying the difference Risk of CHD for males: p 1 ≈ 164 / 643 = 0 . 26 Risk of CHD for females: p 2 ≈ 104 / 720 = 0 . 14 Odds of CHD for males: p 1 / ( 1 − p 1 ) ≈ 164 / 479 = 0 . 34 ( ≈ 1 : 3 ) Odds of CHD for females: p 2 / ( 1 − p 2 ) ≈ 104 / 616 = 0 . 17 ( ≈ 1 : 6 ) Quantification of the difference in risk : Absolute Risk Reduction (ARR): | p 1 − p 2 | ≈ 0.12 Risk Ratio (RR) : p 1 / p 2 ≈ 1.77 Odds-ratio (OR): p 1 / ( 1 − p 1 ) / ( p 2 / ( 1 − p 2 )) ≈ 2.03. When p 1 and p 2 are small (<0.1) : RR ≈ OR. We have seen that there is a difference for males and females : p 1 � = p 2 i.e. ARR > 0, RR � = 1, OR � = 1 4 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The purpose of a logistic regression analysis Relate a binary outcome variable , e.g. � if i developed CHD 1 Y i = 0 if i did not develop CHD to explanatory variables for individual i . In logistic regression we formulate models for log-odds : � p i � log 1 − p i 5 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Odds and log-odds 4 8 2 6 Odds p/(1−p) 0 log−odds 4 −2 −4 2 −6 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p 6 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The logistic regression model � i is female 0 Explanatory variable : male i = 1 i is male . Model: � � p i � i is female a = a + b · male i = log 1 − p i a + b i is male Determine a and b by hand. The difference in log-odds between males and females is b = (?) 7 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Calculating OR using logistic regression � p i i is female � � a = a + b · male i = log a + b i is male. 1 − p i b = ( a + b ) − a = log (odds for males) - log (odds for females) = log (OR for males vs. females) ie. exp ( b ) = OR for males vs. females = Now determine the OR of CHD for females vs. males. 8 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Logistic regression in R glm = Generalized Linear Model. > d <- read.dbf(’framingham.dbf’) > glm1 <- glm ( chd01 ~ factor(sex), data=d, family=binomial ) > summary( glm1 ) Call: glm(formula = chd01 ~ factor(sex), family = binomial, data = d) Deviance Residuals: Min 1Q Median 3Q Max -0.7674 -0.7674 -0.5586 -0.5586 1.9672 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.7789 0.1060 -16.780 < 2e-16 *** factor(sex)1 0.7070 0.1394 5.073 3.92e-07 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2 on 1362 degrees of freedom Residual deviance: 1324.9 on 1361 degrees of freedom (43 observations deleted due to missingness) AIC: 1328.9 Number of Fisher Scoring iterations: 4 > 9 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Finding OR and CI > > # Estimates in terms of log-odds > coef( glm1 ) (Intercept) factor(sex)1 -1.7788561 0.7070219 > > > # OR’s : > exp( coef( glm1 ) ) (Intercept) factor(sex)1 0.1688312 2.0279428 > > > # Confidence intervals : > exp( confint.default ( glm1 ) ) 2.5 % 97.5 % (Intercept) 0.1371558 0.2078218 factor(sex)1 1.5432055 2.6649413 > 10 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Logistic regression with a quantitative variable � � p i The model for log-odds is linear: log = a + b · age i 1 − p i Compare two individuals aged 51 and 50 OR = odds 51 years odds 50 years . log ( OR ) = log ( odds 51 years ) − log ( odds 50 years ) = ( a + 51 · b ) − ( a + 50 · b ) = b i.e. = exp ( b ) = exp ( 0 . 066 ) = 1 . 068 . OR Interpretation? 11 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Exercise : Odds ratios Determine the OR comparing two individuals with a difference in age of two years. Three years? Ten years? Discuss how to assess whether the linear model is plausible. 12 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Risk of CHD according to the model exp ( a + b · age i ) Predictions : p i = 1 + exp ( a + b · age i ) . 1.0 0.8 0.6 p 0.4 0.2 0.0 0 50 100 150 Alder 13 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The additive model Consider the additive model : � p i � log = a + b × male i + c × hypertension i 1 − p i or put in tabular form: log-odds no hypertension hypertension females a + a c males a + b a + b + c OR of CHD, hypertention vs no hypertension: Males: exp ( c ) Females: exp ( c ) 14 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Interaction Is there an interaction between gender and hypertension? The interaction model log-odds no hypertension hypertension females a a + c males a + b a + b + c + d OR of CHD, hypertension vs no hypertension: Males: ? Females: ? 15 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Exercise On the following slides you find three model outputs. Study the outputs and fill in the blanks on this and the next slide. We use model to test whether there is an interaction between sex and hypertension. No interaction, i.e. d = 0 Estimated interaction term ( d ) and SE d Test statistic: Wald: W = SE = , P = Conclude : 16 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Exercise cont. We use model to compute ORs of CHD, hypertension vs no hypertension, for each gender. Males OR: Females OR: 17 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Model 1 > glm1 <- glm( chd01 ~ factor(sex)*factor(hyper), data=d, family=binomial ) > summary( glm1 ) Call: glm(formula = chd01 ~ factor(sex) * factor(hyper), family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.5244 0.1867 -13.524 < 2e-16 *** factor(sex)1 1.2147 0.2196 5.532 3.16e-08 *** factor(hyper)1 1.3812 0.2300 6.005 1.92e-09 *** factor(sex)1:factor(hyper)1 -0.6815 0.2977 -2.289 0.0221 * --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2 on 1362 degrees of freedom Residual deviance: 1271.7 on 1359 degrees of freedom (43 observations deleted due to missingness) AIC: 1279.7 Number of Fisher Scoring iterations: 5 > > exp( coef( glm1 ) ) (Intercept) factor(sex)1 0.08010336 3.36922654 factor(hyper)1 factor(sex)1:factor(hyper)1 3.97957459 0.50585702 > 18 / 20
u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Model 2 > glm2 <- glm( chd01 ~ factor(sex) + factor(sex):factor(hyper), data=d, family=binomial) > summary( glm2 ) Call: glm(formula = chd01 ~ factor(sex) + factor(sex):factor(hyper), family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.5244 0.1867 -13.524 < 2e-16 *** factor(sex)1 1.2147 0.2196 5.532 3.16e-08 *** factor(sex)0:factor(hyper)1 1.3812 0.2300 6.005 1.92e-09 *** factor(sex)1:factor(hyper)1 0.6997 0.1890 3.701 0.000214 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2 on 1362 degrees of freedom Residual deviance: 1271.7 on 1359 degrees of freedom (43 observations deleted due to missingness) AIC: 1279.7 Number of Fisher Scoring iterations: 5 > > exp( coef( glm2 ) ) (Intercept) factor(sex)1 0.08010336 3.36922654 factor(sex)0:factor(hyper)1 factor(sex)1:factor(hyper)1 3.97957459 2.01309573 > 19 / 20
Recommend
More recommend