1 Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs
2 2 What are linear models? Stage 1: response continuous - General Linear Model Procedure Response Predictors Single linear regression y ~ x Continuous 1 Continuous/discrete Two-sample t-test y ~ x Continuous 1 categorical (2 levels) One-way ANOVA y ~ x Continuous 1 categorical (2 or more levels) Two-way ANOVA y ~ x1*x2 Continuous 2 categorical (2 or more levels each) Stage 2: incl other types of response - Generalised Linear Model
3 3 Key points T-tests, ANOVA and regression are fundamentally the same Collectively ‘general linear model’ Can be extended to ‘generalised linear model’ for different types of response
4 What are Generalised linear models? • Response can be continuous, discrete or categorical • Predictors can be continuous or categorical • Type of response determines type of GLM
5 Two types of GLM Counts (“Poisson”) number of cases, individuals Binary (“Binomial”) true-false, yes-no, proportions These are the two types of GLM we will consider
6 GLM • We have to specify which one lm(y ~ x) Same as: glm(y ~ x, family = gaussian) glm(y ~ x, family = poisson) glm(y ~ x, family = binomial)
7 LM – GLM summary • Linear models (lm) – Response assumed normal – Residuals assumed normal and homoscedastic – Residual SS measures fit • GLM – Response can be other than normal – Variance need not be constant Levels of competence Minimum required – Deviance measures fit Above the minimum Advanced
8 GLM - poisson • Response is a count • Explanatory is continuous, categorical or combination • If explanatory is category then you have a chi-squared
9 Example: Poisson GLM • Does proximity to a nuclear power plant alter the risk of cancer? • Number of cases of cancer reported by a clinic by its distance from nuclear plant (note not good epidemiology) > str(cases) Each row is a clinic 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...
10 Example: Poisson GLM ggplot(cases, aes(x = distance, y = cancers)) + geom_point() Each row is a clinic > str(cases) 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...
11 Example: Poisson GLM Selecting and applying mod <- glm(cancers ~ distance, data = cases, family = poisson) summary(mod) Call: glm(formula = cancers ~ distance, family = poisson, data = cases) Deviance Residuals: Min 1Q Median 3Q Max -1.8423 -0.7437 -0.4834 0.4206 1.8933 Estimate is –’ve Coefficients: Estimate Std. Error z value Pr(>|z|) Cancers decrease with (Intercept) 1.019169 0.308709 3.301 0.000962 *** distance -0.021495 0.005034 -4.270 1.96e-05 *** increasing distance... --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 and significantly (Dispersion parameter for poisson family taken to be 1) Estimate is logged! Null deviance: 54.522 on 42 degrees of freedom Residual deviance: 31.790 on 41 degrees of freedom “Estimate is on the scale of the link function” AIC: 78.163 Number of Fisher Scoring iterations: 5
12 Example: Poisson GLM mod <- glm(cancers ~ distance, data = cases, family = poisson) anova(mod,test=”Chisq”) anova() on the object more useful for models with more than Analysis of Deviance Table one explanatory variable Model: poisson, link: log Response: cancers lm object Terms added sequentially (first to last) anova(mod, test = ”F”) summary(mod) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 42 54.522 distance 1 22.732 41 31.790 1.862e-06 *** --- glm object Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(mod, test = ”Chisq”) summary(mod)
13 Example: Poisson GLM ggplot(cases, aes(x = distance, y = cancers)) + geom_point()+ stat_smooth(method="glm", method.args = list(family = "poisson"), col="black",se=FALSE) + xlab("Distance (km)") + Need to antilog estimates to understand ylab("Number of Cancer cases per year")+ theme_bw() > exp(coef(mod)) (Intercept) distance 2.770892 0.978734 2.77 cancer cases are expected at 0km For every km from the nuclear plant, a clinic reports 0.97 of the cases reported at previous km
14 GLM - Binomial • Response is a binary variable 1 – (0 and 1, Y and N, T and F, red and blue) • Explanatory is continuous or combination 1. actually, response can be in several equivalent formats but here we use 0 and 1
15 Example: Binomial GLM • Do fibrinogen levels affect whether a person’s erythrocyte sedimentation rate (ESR) is classed is healthy or unhealthy • ESR classed as 0 (healthy) or 1 (ill) by fibrinogen levels • aka logistic regression > str(plasma) Each row is an individual 'data.frame': 32 obs. of 2 variables: $ fibrinogen: num 2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 3.15 2.6 ... $ ESR : num 0 0 0 0 0 0 0 0 0 0 ...
16 Example: Binomial GLM ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()
17 Example: Binomial GLM mod2 <- glm(ESR ~ fibrinogen,data = plasma, family = binomial) summary(mod2) Call: glm(formula = ESR ~ fibrinogen, family = binomial, data = plasma) Deviance Residuals: Min 1Q Median 3Q Max -0.9298 -0.5399 -0.4382 -0.3356 2.4794 Estimate is +’ve Coefficients: Estimate Std. Error z value Pr(>|z|) Probability of ‘ill’ (1) is (Intercept) -6.8451 2.7703 -2.471 0.0135 * fibrinogen 1.8271 0.9009 2.028 0.0425 * higher as fibrinogen --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 increases (Dispersion parameter for binomial family taken to be 1) Estimate is on logit scale! Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 24.840 on 30 degrees of freedom “Estimate is on the scale of the link function” AIC: 28.84 Number of Fisher Scoring iterations: 5
18 Example: Binomial GLM Estimates are odds ratios and hard to ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()+ understand, instead: stat_smooth(method = "glm", Use predict() to understand method.args = list(family = "binomial"), col = "black",se = FALSE) + fibv <- seq(1, 6, 1) xlab("Fibrinogen") + newdat <- data.frame(fibrinogen = fibv) ylab("ESR class")+ newdat$pr <- predict(mod2, type = "response", newdata = newdat) theme_bw() newdat fibrinogen pr 1 1 0.006574281 2 2 0.039509110 3 3 0.203618144 4 4 0.613784514 5 5 0.908072941 6 6 0.983974363
19 LM – GLM summary • Linear models (lm) – response assumed continuous; Residuals assumed normal and homoscedastic • GLM – Response can be other than normal; variance need not be constant – Specify the family • poisson for counts • binomial for binary outcome mod <- glm( response ~ explanatory1 , data = mydata , family = family ) summary(mod) mod <- glm( response ~ explanatory1 * explanatory2 , data = mydata, family = family ) summary(mod) anova(mod,test = “Chisq”)
Recommend
More recommend