lecture 3
play

Lecture 3 Residual Analysis + Generalized Linear Models Colin - PowerPoint PPT Presentation

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1 Residual Analysis 2 3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on


  1. Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1

  2. Residual Analysis 2

  3. 3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date

  4. Where to start? Well, it looks like stuff is going up on average … 4

  5. Where to start? Well, it looks like stuff is going up on average … 4 360 co2 350 1988 1992 1996 date 2.5 resid 0.0 −2.5 1988 1992 1996 date

  6. and then? Well there is some periodicity lets add the month … 5

  7. and then? Well there is some periodicity lets add the month … 5 2.5 resid 0.0 −2.5 1988 1992 1996 date 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date

  8. and then and then? Maybe there is some different effect by year … 6

  9. and then and then? Maybe there is some different effect by year … 6 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date 0.8 0.4 resid3 0.0 −0.4 −0.8 1988 1992 1996 date

  10. Too much 9.722e-03 9.978e-01 1.065e+00 ## as.factor(year)1990 as.factor(year)1989 ## as.factor(year)1988 -2.585e-01 ## as.factor(year)1991 -6.064e+00 ## as.factor(year)1987 as.factor(year)1986 monthSep ## 7.726e-01 as.factor(year)1992 -4.838e+00 ## NA ## ## as.factor(year)1997 3.768e-01 1.119e-01 -4.146e-01 as.factor(year)1996 as.factor(year)1993 as.factor(year)1995 ## as.factor(year)1994 -7.911e-01 1.236e-02 7.067e-01 ## -6.135e+00 4.821e-01 (lm = lm (co2~date + month + as.factor (year), data=co2_df)) date ## -4.177e+00 1.508e+00 -2.645e+03 ## monthAug (Intercept) monthFeb ## ## Coefficients: ## ## lm(formula = co2 ~ date + month + as.factor(year), data = co2_df) ## Call: ## monthDec monthJan ## -2.035e+00 monthOct monthNov monthMay ## -1.227e+00 -3.251e-01 ## ## monthMar monthJun monthJul ## -2.705e+00 -2.008e+00 -3.612e+00 7

  11. 8 360 co2 350 1988 1992 1996 date

  12. Generalized Linear Models 9

  13. Background A generalized linear model has three key components: 1. a probability distribution (from the exponential family) that describes your response variable 10 2. a linear predictor η = X β , 3. and a link function g such that g ( E ( Y | X )) = µ = η .

  14. Poisson Regression 11

  15. Model Specification A generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance). 12 Y i ∼ Poisson ( λ i ) log E ( Y | X ) = log λ = X β

  16. 13 Example - AIDS in Belgium AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year

  17. Frequentist glm fit g = glm (cases~year, data=aids, family=poisson) 14 pred = data_frame (year= seq (1981,1993,by=0.1)) pred$cases = predict (g, newdata=pred, type = ”response”) AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year

  18. Residuals Pearson residuals: Deviance residuals: Standard residuals: 15 Y i = Y i − ˆ r i = Y i − ˆ λ i Y i − ˆ Y i − E ( Y i | X ) λ i r i = Var ( Y i | X ) = √ √ ˆ λ i √ 2 ( y i log ( y i / ˆ λ i ) − ( y i − ˆ d i = sign ( y i − λ i ) λ i ))

  19. Deviance and deviance residuals Deviance can be interpreted as the difference between your model’s fit and n d i 2 Deviance is a measure of goodness of fit in a similar way to the residual sum of squares (which is just the sum of squared standard residuals). 16 the fit of an ideal model (where E (ˆ Y i ) = Y i ). D = 2 ( L ( Y | θ best ) − L ( Y | ˆ ∑ θ )) = i = 1

  20. Deviance residuals derivation 17

  21. Residual plots 18 standard pearson deviance 2.5 2.5 0 0.0 0.0 residual −2.5 −2.5 −50 −5.0 1985 1990 1985 1990 1985 1990 year

  22. 19 print (aids_fit) AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year

  23. Quadratic fit g2 = glm (cases~year+ I (year^2), data=aids, family=poisson) 20 pred2 = data_frame (year= seq (1981,1993,by=0.1)) pred2$cases = predict (g2, newdata=pred, type = ”response”) AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year

  24. Quadratic fit - residuals 21 standard pearson deviance 20 1 1 10 residual 0 0 0 −1 −1 −10 1985 1990 1985 1990 1985 1990 year

  25. Bayesian Poisson Regression Model ## ## } } ## beta[j] ~ dnorm(0,1/100) ## for(j in 1:2){ ## # Prior for beta ## ## } Y_hat[i] ~ dpois(lambda[i]) ## model{ ## # In-sample prediction ## ## log(lambda[i]) <- beta[1] + beta[2]*X[i] ## Y[i] ~ dpois(lambda[i]) ## for(i in 1:length(Y)){ ## # Likelihood ## 22

  26. Bayesian Model fit m = jags.model ( textConnection (poisson_model1), quiet = TRUE, data = list (Y=aids$cases, X=aids$year) ) update (m, n.iter=1000, progress.bar=”none”) n.iter=5000, progress.bar=”none” ) 23 samp = coda.samples ( m, variable.names= c (”beta”,”lambda”,”Y_hat”),

  27. MCMC Diagnostics 24 Trace of beta[1] Density of beta[1] −1 0.2 −3 0.0 −5 2000 3000 4000 5000 6000 7000 −6 −4 −2 0 Iterations N = 5000 Bandwidth = 0.3209 Trace of beta[2] Density of beta[2] 0.0040 400 0.0020 0 2000 3000 4000 5000 6000 7000 0.002 0.003 0.004 0.005 Iterations N = 5000 Bandwidth = 0.0001615

  28. 25 Model fit? AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year

  29. What went wrong? ## 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## <2e-16 *** Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 ## year -25.68 summary (g) 3Q ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median Max 1.546e+01 ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 26

  30. What went wrong? ## 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## <2e-16 *** Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 ## year -25.68 summary (g) 3Q ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median Max 1.546e+01 ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 26

  31. summary ( glm (cases~ I (year-1981), data=aids, family=poisson)) ## 0.007771 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## Null deviance: 872.206 <2e-16 *** on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 ## I(year - 1981) 0.202121 47.13 ## Max ## Call: ## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q ## -4.6784 0.070920 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.342711 27

  32. Revising the model ## ## } } ## beta[j] ~ dnorm(0,1/100) ## for(j in 1:2){ ## # Prior for beta ## ## } Y_hat[i] ~ dpois(lambda[i]) ## model{ ## # In-sample prediction ## ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) ## Y[i] ~ dpois(lambda[i]) ## for(i in 1:length(Y)){ ## # Likelihood ## 28

  33. MCMC Diagnostics 29 Trace of beta[1] Density of beta[1] 3.5 4 3.3 2 3.1 0 2000 3000 4000 5000 6000 7000 3.1 3.2 3.3 3.4 3.5 3.6 Iterations N = 5000 Bandwidth = 0.01336 Trace of beta[2] Density of beta[2] 40 0.21 20 0.18 0 2000 3000 4000 5000 6000 7000 0.18 0.19 0.20 0.21 0.22 0.23 Iterations N = 5000 Bandwidth = 0.001456

  34. Model fit 30 AIDS cases in Belgium (Linear Poisson Model) 300 cases 200 100 0 1985 1990 year

  35. Bayesian Residual Plots 31 standard pearson deviance 4 50 2 2 0 post_mean 0 0 −2 −2 −50 −4 −4 −100 −6 −6 1985 1990 1985 1990 1985 1990 year

  36. Model fit 32 AIDS cases in Belgium (Quadratic Poisson Model) 300 200 cases 100 0 1985 1990 year

  37. Bayesian Residual Plots 33 standard pearson deviance 40 2 2 20 1 1 post_mean 0 0 0 −1 −1 −20 −2 −2 1985 1990 1985 1990 1985 1990 year

  38. Negative Binomial Regression 34

  39. Overdispersion If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction. mean (aids$cases) ## [1] 124.7692 var (aids$cases) ## [1] 8124.526 35 One of the properties of the Poisson distribution is that if X ∼ Pois ( λ ) then E ( X ) = Var ( X ) = λ .

Recommend


More recommend