Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2018 1
Residual Analysis 2
3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date
Where to start? Well, it looks like stuff is going up on average … 4
Where to start? Well, it looks like stuff is going up on average … 4 360 co2 350 1988 1992 1996 date 2.5 resid 0.0 −2.5 1988 1992 1996 date
and then? Well there is some periodicity lets add the month … 5
and then? Well there is some periodicity lets add the month … 5 2.5 resid 0.0 −2.5 1988 1992 1996 date 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date
and then and then? There is still some long term trend in the data, maybe a fancy polynomial can help … 6
and then and then? There is still some long term trend in the data, maybe a fancy polynomial 6 can help … 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date 1.0 0.5 resid3 0.0 −0.5 1988 1992 1996 date
Putting it all together … -6.102e+00 NA NA ## poly(date, 5)1 < 2e-16 *** -44.832 1.346e-01 -6.036e+00 ## monthSep < 2e-16 *** -45.282 1.348e-01 ## monthOct NA < 2e-16 *** -35.577 1.349e-01 -4.799e+00 ## monthNov 3.631 0.000396 *** 1.344e-01 4.881e-01 ## monthMay -9.175 5.54e-16 *** 1.344e-01 -1.233e+00 NA ## poly(date, 5)2 -1.920e+00 -2.332 0.021117 * < 2e-16 *** p-value: < 2.2e-16 2872 on 16 and 139 DF, ## F-statistic: 0.9966 Adjusted R-squared: 0.997, ## Multiple R-squared: ## Residual standard error: 0.3427 on 139 degrees of freedom ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Signif. codes: ## --- -12.535 3.427e-01 3.462e-01 ## poly(date, 5)5 -4.340e+00 2.609 0.010062 * 3.428e-01 8.946e-01 ## poly(date, 5)4 < 2e-16 *** 11.358 3.451e-01 3.920e+00 ## poly(date, 5)3 -5.602 1.09e-07 *** ## monthMar 1.345e-01 l_final = lm (co2~date + month + poly (date,5), data=co2_df) 0.17565 ## date < 2e-16 *** 1.460e+01 -177.174 -2.587e+03 ## (Intercept) t value Pr(>|t|) Estimate Std. Error ## ## Coefficients: (1 not defined because of singularities) ## 1.06026 ## -0.72022 -0.19169 -0.00638 7.334e-03 Max 3Q Median 1Q Min ## ## Residuals: ## ## lm(formula = co2 ~ date + month + poly(date, 5), data = co2_df) ## Call: ## summary (l_final) 1.479e+00 201.649 -3.136e-01 -15.041 ## monthJun < 2e-16 *** -15.003 1.345e-01 -2.018e+00 ## monthJul < 2e-16 *** -20.286 1.345e-01 -2.729e+00 ## monthJan < 2e-16 *** 1.345e-01 < 2e-16 *** -2.022e+00 ## monthFeb < 2e-16 *** -26.404 1.350e-01 -3.566e+00 ## monthDec < 2e-16 *** -30.880 1.346e-01 -4.155e+00 ## monthAug 7
Final fit + Residualss 8 360 co2 350 1988 1992 1996 date 1.0 resid_final 0.5 0.0 −0.5 1988 1992 1996 date
Generalized Linear Models 9
Background A generalized linear model has three key components: 1. a probability distribution (from the exponential family) that describes your response variable 2. a linear predictor 𝜽 = 𝐘𝜸 , 3. and a link function such that (𝐹(𝐙|𝐘)) = 𝜽 . 10
Poisson Regression This is a special case of a generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance). 11 𝑍 𝑗 ∼ Poisson (𝜇 𝑗 ) log 𝐹(𝑍 𝑗 |𝐘 𝑗⋅ ) = log 𝜇 𝑗 = 𝐘 𝑗⋅ 𝜸
Example - AIDS in Belgium These data represent the total number of new AIDS cases reported in Belgium during the early stages of the epidemic. 12 AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year
Frequentist glm fit g = glm (cases~year, data=aids, family=poisson) mutate (cases = predict (g, newdata=., type = ”response”)) 13 pred = data_frame (year= seq (1981,1993,by=0.1)) %>% AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year
aids = aids %>% Residuals? The naive approach is to use standard residuals, 𝜇 𝑗 mutate (pred = predict (g, newdata=., type = ”response”)) %>% mutate (standard = cases - pred) ggplot (aids, aes (x=year, y=standard)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 14 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) = 𝑍 𝑗 − ̂
Residuals? The naive approach is to use standard residuals, 𝜇 𝑗 mutate (pred = predict (g, newdata=., type = ”response”)) %>% mutate (standard = cases - pred) ggplot (aids, aes (x=year, y=standard)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 14 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) = 𝑍 𝑗 − ̂ aids = aids %>% standard 0 −50 1985 1990 year
aids = aids %>% Accounting for variability Pearson residuals: √𝑊 𝑏𝑠(𝑍 𝑗 |𝑌) 𝜇 𝑗 𝜇 𝑗 mutate (pearson = (cases - pred)/ sqrt (pred)) ggplot (aids, aes (x=year, y=pearson)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 15 = 𝑍 𝑗 − ̂ 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) √ ̂
Accounting for variability 𝜇 𝑗 geom_point () + geom_segment ( aes (xend=year, yend=0)) Pearson residuals: mutate (pearson = (cases - pred)/ sqrt (pred)) ggplot (aids, aes (x=year, y=pearson)) + 𝜇 𝑗 √𝑊 𝑏𝑠(𝑍 𝑗 |𝑌) 15 = 𝑍 𝑗 − ̂ 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) √ ̂ aids = aids %>% 2.5 pearson 0.0 −2.5 1985 1990 year
Deviance Deviance is a way of measuring the difference between your glm’s fit and 𝑍 𝑗 |𝑌) = 𝑍 𝑗 ). It is defined as twice the log of the ratio between the likelihood of a perfect model and the likelihood of the given model, 𝜄|𝑍 )) 𝜄|𝑍 )) 16 the fit of a perfect model (where 𝐹( ̂ 𝐸 = 2 log (ℒ(𝜄 𝑐𝑓𝑡𝑢 |𝑍 )/ℒ( ̂ = 2(𝑚(𝜄 𝑐𝑓𝑡𝑢 |𝑍 ) − 𝑚( ̂
Derivation - Normal 17
Derivation - Poisson 18
glm output ## 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## <2e-16 *** Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 ## year -25.68 summary (g) 3Q ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median Max 1.546e+01 ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 19
dev_resid = function (obs,pred) aids = aids %>% Deviance residuals We can therefore think of deviance as 𝐸 = ∑ 𝑜 generalized residual. So in the Poisson case we can define, 𝜇 𝑗 )) sign (obs-pred) * sqrt (2*(obs* log (obs/pred)-(obs-pred))) mutate (deviance = dev_resid (cases, pred)) ggplot (aids, aes (x=year, y=deviance)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 20 𝑗=1 𝑒 2 𝑗 where 𝑒 𝑗 is a 𝑒 𝑗 = sign (𝑧 𝑗 − 𝜇 𝑗 )√2(𝑧 𝑗 log (𝑧 𝑗 / ̂ 𝜇 𝑗 ) − (𝑧 𝑗 − ̂
Deviance residuals 𝜇 𝑗 )) geom_point () + geom_segment ( aes (xend=year, yend=0)) ggplot (aids, aes (x=year, y=deviance)) + We can therefore think of deviance as 𝐸 = ∑ sign (obs-pred) * sqrt (2*(obs* log (obs/pred)-(obs-pred))) mutate (deviance = dev_resid (cases, pred)) 20 generalized residual. So in the Poisson case we can define, 𝑜 𝑗=1 𝑒 2 𝑗 where 𝑒 𝑗 is a 𝑒 𝑗 = sign (𝑧 𝑗 − 𝜇 𝑗 )√2(𝑧 𝑗 log (𝑧 𝑗 / ̂ 𝜇 𝑗 ) − (𝑧 𝑗 − ̂ dev_resid = function (obs,pred) aids = aids %>% 2.5 deviance 0.0 −2.5 −5.0 1985 1990 year
Comparing Residuals 21 standard pearson deviance 2.5 2.5 0 0.0 residual 0.0 −2.5 −2.5 −50 −5.0 1985 1990 1985 1990 1985 1990 year
Updating the model 22
Quadratic fit g2 = glm (cases~year+ I (year^2), data=aids, family=poisson) mutate (cases = predict (g2, newdata=., type = ”response”)) 23 pred2 = data_frame (year= seq (1981,1993,by=0.1)) %>% AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year
Quadratic fit - residuals 24 standard pearson deviance 2.5 2.5 0 0.0 0.0 −2.5 −2.5 −50 residual −5.0 standard2 pearson2 deviance2 20 1 1 10 0 0 0 −1 −1 −10
Bayesian Model 25
Bayesian Poisson Regression Model poisson_model = ”model{ # Likelihood for (i in 1:length(Y)) { Y[i] ~ dpois(lambda[i]) log(lambda[i]) <- beta[1] + beta[2]*X[i] # In-sample prediction Y_hat[i] ~ dpois(lambda[i]) } # Prior for beta for(j in 1:2){ beta[j] ~ dnorm(0,1/100) } }” 26
Fit Model n_burn=1000; n_iter=5000 m = rjags:: jags.model ( textConnection (poisson_model), quiet = TRUE, data = list (Y=aids$cases, X=aids$year) ) update (m, n.iter=1000, progress.bar=”none”) samp = rjags:: coda.samples ( n.iter=5000, progress.bar=”none” ) 27 m, variable.names= c (”beta”,”lambda”,”Y_hat”,”Y”,”X”),
Model Fit? geom_point () tidybayes:: spread_samples (samp, Y_hat[i], X[i],Y[i]) %>% 28 tidybayes:: stat_lineribbon ( aes (y=Y_hat), alpha=0.5) + ggplot ( aes (x=X,y=Y)) + ungroup () %>% 250 200 0.95 150 Y 0.8 0.5 100 50 0 1985 1990 X
Recommend
More recommend