Workshop 11.1: Generalized linear models Murray Logan 26-011-2013 - PDF document

Workshop 11.1: Generalized linear models Murray Logan 26-011-2013 Other data types • Binary - only 0 and 1 (dead/alive) (present/absent) • Proportional abundance - range from 0 to 100 • Count data - min of zero General linear models General linear models Residual i = y i − E ( Y i ) General linear models General linear models E ( Y ) = β 0 + β 1 x 1 + ... + β p x p + ε, ε ∼ Dist ( ... ) � �� Link function Systematic General linear models E ( Y ) = β 0 + β 1 x 1 + ... + β p x p + e � �� Random Link function Systematic • Random component. E ( Y i ) ∼ N ( µ i , σ 2 ) • Systematic component β 0 + β 1 x 1 + ... + β p x p 1

• Link function µ = β 0 + β 1 x 1 + ... + β p x p Generalized linear models g ( µ ) = β 0 + β 1 x 1 + ... + β p x p + e � �� Random Link function Systematic • Random component. A nominated distribution (Gaussian, Pois- son, Binomial, Gamma, Beta,. . . ) • Systematic component • Link function Generalized linear models Response variable Probability distribution Link function Model name Continuous measurements Gaussian identity: µ Linear regression Binary Binomial � � π logit: log 1 − π Logistic regression � α + β.X � 2 Z 2 � 1 − 1 probit: exp dZ √ 2 π −∞ Probit regression 2

Complimentary log-log: log ( − log (1 − π )) Logistic regression Counts Poisson log: logµ Poisson regressionlog-linear model Negative binomial � � µ log µ + θ Negative biomial regression Quasi-poisson logµ Poisson regression OLS Maximum Likelihood Logistic regression Binary data � � π Link: log 1 − π Transform scale of linear predictor ( −∞ , ∞ ) into that of the response (0,1) Logistic regression � n � p x (1 − p ) n − x E ( Y ) = x Dispersion Spread assumed to be equal to mean. ( φ = 1) 3

Dispersion Over-dispersion Sample more varied than expected from its mean • variability due to other unmeasured influences – quasi- model • due to more zeros than expected – zero-inflated model Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") • Explore residuals plot (dat.glmL) Logistic regression Binary data • Fit model dat.glmL <- glm (y ~ x, data = dat, family = "binomial") • Explore residuals 4

plot (dat.glmL) • Explore goodness of fit – Pearson’s χ 2 residuals – Deviance ( G 2 ) Logistic regression Binary data • Explore model parameters summary (data.glmL } Slope parameter is on log odds-ratio scale Logistic regression Binary data • Explore model parameters summary (data.glmL } Slope parameter is on log odds-ratio scale • Quasi R 2 � � deviance quasiR 2 = 1 − null deviance Logistic regression Binary data • LD50 LD 50 = − intercept slope 5

Worked Example polis <- read.table ("../data/polis.csv", header = T, sep = ",", strip.white = T) head (polis) ISLAND RATIO PA 1 Bota 15.41 1 2 Cabeza 5.63 1 3 Cerraja 25.92 1 4 Coronadito 15.17 0 5 Flecha 13.04 1 6 Gemelose 18.85 0 plot (PA ~ RATIO, data = polis) abline ( lm (PA ~ RATIO, data = polis)) with (polis, lines ( lowess (PA ~ RATIO))) polis.glm <- glm (PA ~ RATIO, family = binomial, data = polis) par (mfrow = c (2, 3)) plot (polis.glm) ## Check the model for lack of fit via: Pearson ## chisq polis.resid <- sum ( resid (polis.glm, type = "pearson")^2) 1 - pchisq (polis.resid, polis.glm$df.resid) [1] 0.5715 # No evidence of a lack of fit # Deviance 1 - pchisq (polis.glm$deviance, polis.glm$df.resid) [1] 0.6514 # No evidence of a lack of fit # Estimate dispersal polis.resid/polis.glm$df.resid 6

Figure 1: plot of chunk unnamed-chunk-8 7

[1] 0.9019 # OR polis.glm$deviance/polis.glm$df.resid [1] 0.8365 # Number of zeros polis.tab <- table (polis$PA == 0) polis.tab/ sum (polis.tab) FALSE TRUE 0.5263 0.4737 summary (polis.glm) Call: glm(formula = PA ~ RATIO, family = binomial, data = polis) Deviance Residuals: Min 1Q Median 3Q Max -1.607 -0.638 0.237 0.433 2.099 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.606 1.695 2.13 0.033 RATIO -0.220 0.101 -2.18 0.029 (Intercept) * RATIO * --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.287 on 18 degrees of freedom Residual deviance: 14.221 on 17 degrees of freedom AIC: 18.22 Number of Fisher Scoring iterations: 6 8

exp (-0.296) [1] 0.7438 # R2 1 - (polis.glm$deviance/polis.glm$null) [1] 0.459 anova (polis.glm, test = "Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: PA Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 18 26.3 RATIO 1 12.1 17 14.2 0.00051 NULL RATIO *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 par (mar = c (4, 5, 0, 0)) plot (PA ~ RATIO, data = polis, type = "n", ann = F, axes = F) points (PA ~ RATIO, data = polis, pch = 16) xs <- seq ( min (polis$RATIO), max (polis$RATIO), len = 1000) ys <- predict (polis.glm, newdata = data.frame (RATIO = xs), type = "link", se = T) ys$lwr <- polis.glm$family$ linkinv (ys$fit - ys$se.fit) ys$upr <- polis.glm$family$ linkinv (ys$fit + ys$se.fit) ys$fit <- polis.glm$family$ linkinv (ys$fit) 9

lines (ys$fit ~ xs, col = "black") lines (ys$lwr ~ xs, col = "black", lty = 2) lines (ys$upr ~ xs, col = "black", lty = 2) axis (1) mtext ("Island perimeter to area ratio", 1, cex = 1.5, line = 3) axis (2, las = 2) mtext ( expression ( paste ( italic (Uta), " lizard presence/absence")), 2, cex = 1.5, line = 3) box (bty = "l") # LD50 ld50 <- -polis.glm$coef[1]/polis.glm$coef[2] Poisson regression p ( Y i ) = e − λ λ x x ! log ( µ ) = β 0 + β 1 x i + ... + β p x p Dispersion Spread assumed to be equal to mean. ( φ = 1) Dispersion Over-dispersion Sample more varied than expected from its mean • variability due to other unmeasured influences – quasi-poisson model • clumpiness – negative binomial model • due to more zeros than expected – zero-inflated model 10

Figure 3: Figure 4: 12

Worked Example data.pois <- read.table ("../data/data.pois.csv", header = T, sep = ",", strip.white = T) head (data.pois) y x 1 1 1.025 2 2 2.697 3 3 3.626 4 2 4.949 5 4 6.025 6 8 6.254 library (car) scatterplot (y ~ x, data = data.pois) scatterplot (y ~ x, data = data.pois, log = "y") rug ( jitter (data.pois$y), side = 2) hist (data.pois$y) boxplot (data.pois$y, horizontal = TRUE) rug ( jitter (data.pois$y), side = 1) # plot(y~x, data.pois, log=’y’) with(data.pois, # lines(lowess(y~x))) library(car) # scatterplot(y~x,data=data.pois, log=’y’) # rug(jitter(data.pois$y), side=2) # Zero inflation proportion of 0’s in the data data.pois.tab <- table (data.pois$y == 0) data.pois.tab/ sum (data.pois.tab) FALSE 1 13

# proportion of 0’s expected from a Poisson # distribution mu <- mean (data.pois$y) cnts <- rpois (1000, mu) data.pois.tab <- table (cnts == 0) data.pois.tab/ sum (data.pois.tab) FALSE TRUE 0.997 0.003 # fit model data.pois.glm <- glm (y ~ x, data = data.pois, family = "poisson") data.pois.glm <- glm (y ~ x, data = data.pois, family = poisson (link = "log")) # Model validation par (mfrow = c (2, 3)) plot (data.pois.glm, ask = F, which = 1:6) # goodness of fit data.pois.resid <- sum ( resid (data.pois.glm, type = "pearson")^2) 1 - pchisq (data.pois.resid, data.pois.glm$df.resid) [1] 0.4897 # Deviance based 1 - pchisq (data.pois.glm$deviance, data.pois.glm$df.resid) [1] 0.5076 data.pois.glmG <- glm (y ~ x, data = data.pois, family = "gaussian") AIC (data.pois.glm, data.pois.glmG) df AIC data.pois.glm 2 90.32 data.pois.glmG 3 98.28 # Poisson deviance data.pois.glm$deviance 18

[1] 17.23 # Gaussian deviance data.pois.glmG$deviance [1] 118.1 data.pois.resid/data.pois.glm$df.resid [1] 0.9717 data.pois.glm$deviance/data.pois.glm$df.resid [1] 0.957 # Or alternatively, via Pearson’s residuals Resid <- resid (data.pois.glm, type = "pearson") sum (Resid^2)/( nrow (data.pois) - length ( coef (data.pois.glm))) [1] 0.9717 summary (data.pois.glm) Call: glm(formula = y ~ x, family = poisson(link = "log"), data = data.pois) Deviance Residuals: Min 1Q Median 3Q Max -1.635 -0.705 0.044 0.562 2.046 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5600 0.2539 2.21 0.027 x 0.1115 0.0186 6.00 2e-09 (Intercept) * x *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 20

Workshop 11.1: Generalized linear models Murray Logan 26-011-2013 - PDF document

Workshop 11.1: Generalized linear models Murray Logan 26-011-2013 Other data types Binary - only 0 and 1 (dead/alive) (present/absent) Proportional abundance - range from 0 to 100 Count data - min of zero General linear models

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Workshop 7: (Generalized) Linear models Murray Logan July 19, 2017 Table of contents 1

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan 07 Feb 2017

Outline Background on dose response (concentration response) Different types of

Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft Julian Hammer

Thinking Fast and Slow with Deep Learning and Tree Search Thomas Anthony, Zheng Tian, and David

A Higher Structure Identity Principle Dimitris Tsementzis (cww B. Ahrens, P. North, M. Shulman)

Dose-response modelling using R Christian Ritz Faculty of Life Sciences, University of

11/1/2017 Environmental Risk Assessments at the Pest Management Regulatory Agency Presented at

Inter-MCU-Platform Hardware Analysis Towards a Clean-slate Timer-API for RIOT-OS INET Seminar /

Measurements on P2 and P3 FE ASIC and Experience of P2 FE ASIC in ProtoDUNE-SP Shanshan Gao on