Workshop 11.1: Generalized linear models Murray Logan 26-011-2013 Other data types • Binary - only 0 and 1 (dead/alive) (present/absent) • Proportional abundance - range from 0 to 100 • Count data - min of zero General linear models General linear models Residual i = y i − E ( Y i ) General linear models General linear models E ( Y ) = β 0 + β 1 x 1 + ... + β p x p + ε, ε ∼ Dist ( ... ) � �� � � �� � Link function Systematic General linear models E ( Y ) = β 0 + β 1 x 1 + ... + β p x p + e � �� � � �� � � �� � Random Link function Systematic • Random component. E ( Y i ) ∼ N ( µ i , σ 2 ) • Systematic component β 0 + β 1 x 1 + ... + β p x p 1
• Link function µ = β 0 + β 1 x 1 + ... + β p x p Generalized linear models g ( µ ) = β 0 + β 1 x 1 + ... + β p x p + e � �� � ���� � �� � Random Link function Systematic • Random component. A nominated distribution (Gaussian, Pois- son, Binomial, Gamma, Beta,. . . ) • Systematic component • Link function Generalized linear models Response variable Probability distribution Link function Model name Continuous measurements Gaussian identity: µ Linear regression Binary Binomial � � π logit: log 1 − π Logistic regression � α + β.X � 2 Z 2 � 1 − 1 probit: exp dZ √ 2 π −∞ Probit regression 2
Complimentary log-log: log ( − log (1 − π )) Logistic regression Counts Poisson log: logµ Poisson regressionlog-linear model Negative binomial � � µ log µ + θ Negative biomial regression Quasi-poisson logµ Poisson regression OLS Maximum Likelihood Logistic regression Binary data � � π Link: log 1 − π Transform scale of linear predictor ( −∞ , ∞ ) into that of the response (0,1) Logistic regression � n � p x (1 − p ) n − x E ( Y ) = x Dispersion Spread assumed to be equal to mean. ( φ = 1) 3
Dispersion Over-dispersion Sample more varied than expected from its mean • variability due to other unmeasured influences – quasi- model • due to more zeros than expected – zero-inflated model Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") • Explore residuals plot (dat.glmL) Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") • Explore residuals 4
plot (dat.glmL) • Explore goodness of fit – Pearson’s χ 2 residuals – Deviance ( G 2 ) Logistic regression Binary data • Explore model parameters summary (data.glmL } Slope parameter is on log odds-ratio scale Logistic regression Binary data • Explore model parameters summary (data.glmL } Slope parameter is on log odds-ratio scale • Quasi R 2 � � deviance quasiR 2 = 1 − null deviance Logistic regression Binary data • LD50 LD 50 = − intercept slope 5
Worked Example polis <- read.table ("../data/polis.csv", header = T, sep = ",", strip.white = T) head (polis) ISLAND RATIO PA 1 Bota 15.41 1 2 Cabeza 5.63 1 3 Cerraja 25.92 1 4 Coronadito 15.17 0 5 Flecha 13.04 1 6 Gemelose 18.85 0 plot (PA ~ RATIO, data = polis) abline ( lm (PA ~ RATIO, data = polis)) with (polis, lines ( lowess (PA ~ RATIO))) polis.glm <- glm (PA ~ RATIO, family = binomial, data = polis) par (mfrow = c (2, 3)) plot (polis.glm) ## Check the model for lack of fit via: Pearson ## chisq polis.resid <- sum ( resid (polis.glm, type = "pearson")^2) 1 - pchisq (polis.resid, polis.glm$df.resid) [1] 0.5715 # No evidence of a lack of fit # Deviance 1 - pchisq (polis.glm$deviance, polis.glm$df.resid) [1] 0.6514 # No evidence of a lack of fit # Estimate dispersal polis.resid/polis.glm$df.resid 6
Figure 1: plot of chunk unnamed-chunk-8 7
[1] 0.9019 # OR polis.glm$deviance/polis.glm$df.resid [1] 0.8365 # Number of zeros polis.tab <- table (polis$PA == 0) polis.tab/ sum (polis.tab) FALSE TRUE 0.5263 0.4737 summary (polis.glm) Call: glm(formula = PA ~ RATIO, family = binomial, data = polis) Deviance Residuals: Min 1Q Median 3Q Max -1.607 -0.638 0.237 0.433 2.099 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.606 1.695 2.13 0.033 RATIO -0.220 0.101 -2.18 0.029 (Intercept) * RATIO * --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.287 on 18 degrees of freedom Residual deviance: 14.221 on 17 degrees of freedom AIC: 18.22 Number of Fisher Scoring iterations: 6 8
exp (-0.296) [1] 0.7438 # R2 1 - (polis.glm$deviance/polis.glm$null) [1] 0.459 anova (polis.glm, test = "Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: PA Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 18 26.3 RATIO 1 12.1 17 14.2 0.00051 NULL RATIO *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 par (mar = c (4, 5, 0, 0)) plot (PA ~ RATIO, data = polis, type = "n", ann = F, axes = F) points (PA ~ RATIO, data = polis, pch = 16) xs <- seq ( min (polis$RATIO), max (polis$RATIO), len = 1000) ys <- predict (polis.glm, newdata = data.frame (RATIO = xs), type = "link", se = T) ys$lwr <- polis.glm$family$ linkinv (ys$fit - ys$se.fit) ys$upr <- polis.glm$family$ linkinv (ys$fit + ys$se.fit) ys$fit <- polis.glm$family$ linkinv (ys$fit) 9
lines (ys$fit ~ xs, col = "black") lines (ys$lwr ~ xs, col = "black", lty = 2) lines (ys$upr ~ xs, col = "black", lty = 2) axis (1) mtext ("Island perimeter to area ratio", 1, cex = 1.5, line = 3) axis (2, las = 2) mtext ( expression ( paste ( italic (Uta), " lizard presence/absence")), 2, cex = 1.5, line = 3) box (bty = "l") # LD50 ld50 <- -polis.glm$coef[1]/polis.glm$coef[2] Poisson regression p ( Y i ) = e − λ λ x x ! log ( µ ) = β 0 + β 1 x i + ... + β p x p Dispersion Spread assumed to be equal to mean. ( φ = 1) Dispersion Over-dispersion Sample more varied than expected from its mean • variability due to other unmeasured influences – quasi-poisson model • clumpiness – negative binomial model • due to more zeros than expected – zero-inflated model 10
Figure 2: plot of chunk unnamed-chunk-8 11
Figure 3: Figure 4: 12
Worked Example data.pois <- read.table ("../data/data.pois.csv", header = T, sep = ",", strip.white = T) head (data.pois) y x 1 1 1.025 2 2 2.697 3 3 3.626 4 2 4.949 5 4 6.025 6 8 6.254 library (car) scatterplot (y ~ x, data = data.pois) scatterplot (y ~ x, data = data.pois, log = "y") rug ( jitter (data.pois$y), side = 2) hist (data.pois$y) boxplot (data.pois$y, horizontal = TRUE) rug ( jitter (data.pois$y), side = 1) # plot(y~x, data.pois, log=’y’) with(data.pois, # lines(lowess(y~x))) library(car) # scatterplot(y~x,data=data.pois, log=’y’) # rug(jitter(data.pois$y), side=2) # Zero inflation proportion of 0’s in the data data.pois.tab <- table (data.pois$y == 0) data.pois.tab/ sum (data.pois.tab) FALSE 1 13
Figure 5: plot of chunk unnamed-chunk-9 14
Figure 6: plot of chunk unnamed-chunk-9 15
Figure 7: plot of chunk unnamed-chunk-9 16
Figure 8: plot of chunk unnamed-chunk-9 17
# proportion of 0’s expected from a Poisson # distribution mu <- mean (data.pois$y) cnts <- rpois (1000, mu) data.pois.tab <- table (cnts == 0) data.pois.tab/ sum (data.pois.tab) FALSE TRUE 0.997 0.003 # fit model data.pois.glm <- glm (y ~ x, data = data.pois, family = "poisson") data.pois.glm <- glm (y ~ x, data = data.pois, family = poisson (link = "log")) # Model validation par (mfrow = c (2, 3)) plot (data.pois.glm, ask = F, which = 1:6) # goodness of fit data.pois.resid <- sum ( resid (data.pois.glm, type = "pearson")^2) 1 - pchisq (data.pois.resid, data.pois.glm$df.resid) [1] 0.4897 # Deviance based 1 - pchisq (data.pois.glm$deviance, data.pois.glm$df.resid) [1] 0.5076 data.pois.glmG <- glm (y ~ x, data = data.pois, family = "gaussian") AIC (data.pois.glm, data.pois.glmG) df AIC data.pois.glm 2 90.32 data.pois.glmG 3 98.28 # Poisson deviance data.pois.glm$deviance 18
Figure 9: plot of chunk unnamed-chunk-9 19
[1] 17.23 # Gaussian deviance data.pois.glmG$deviance [1] 118.1 data.pois.resid/data.pois.glm$df.resid [1] 0.9717 data.pois.glm$deviance/data.pois.glm$df.resid [1] 0.957 # Or alternatively, via Pearson’s residuals Resid <- resid (data.pois.glm, type = "pearson") sum (Resid^2)/( nrow (data.pois) - length ( coef (data.pois.glm))) [1] 0.9717 summary (data.pois.glm) Call: glm(formula = y ~ x, family = poisson(link = "log"), data = data.pois) Deviance Residuals: Min 1Q Median 3Q Max -1.635 -0.705 0.044 0.562 2.046 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5600 0.2539 2.21 0.027 x 0.1115 0.0186 6.00 2e-09 (Intercept) * x *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 20
Recommend
More recommend