Lecture 2 Diagnostics and Model Evaluation Colin Rundel 1/23/2017 1
From last time 2
Linear model and data library (rjags) library (dplyr) set.seed (01172017) beta = c (0.7,1.5,-2.2,0.1) X = cbind (X0, X1, X2, X3) Y = X %*% beta + eps 3 n = 100 eps = rnorm (n, mean=0, sd=1) X0 = rep (1, n) X1 = rt (n,df=5) X2 = rt (n,df=5) X3 = rt (n,df=5) d = data.frame (Y,X[,-1])
Bayesian model for(j in 1:4){ ## } sigma <- 1/sqrt(tau2) ## ~ dgamma(1, 1) tau2 ## # Prior for the inverse variance ## ## } ## beta[j] ~ dnorm(0,1/100) ## ## ## model{ # Prior for beta ## ## } ## mu[i] <- beta[1] + beta[2]*X1[i] + beta[3]*X2[i] + beta[4]*X3[i] ## ~ dnorm(mu[i],tau2) Y[i] ## for(i in 1:length(Y)){ ## # Likelihood ## 4
5 beta[1] beta[2] beta[3] 5 4 4 4 3 3 2 2 2 1 1 parameter beta[1] 0 0 0 density beta[2] 0.6 0.8 1.0 1.4 1.6 1.8 −2.5 −2.4 −2.3 −2.2 −2.1 −2.0 −1.9 beta[3] beta[4] sigma beta[4] 6 sigma 4 4 2 2 0 0 −0.2 0.0 0.2 0.7 0.8 0.9 1.0 1.1 value
Model Evaluation 6
Y 2 1 Y i 1 Y i Corr Y Y 2 Y 2 Y 2 1 Y i 1 Y i Model assessment? i Y 2 n i n 1 i Y i 2 n i n Error R 2 1 gives us the variability in Y explained by our model. Quick review: n i Y i Total n i Y i Y 2 Model n i 1 Y i Y i 2 7 If we think back to our first regression class, one common option is R 2 which
1 Y i 1 Y i Corr Y Y 2 Y 2 Y 2 1 Y i 1 Y i Model assessment? n i Y 2 n 1 i Error n i Y i 2 n i R 2 Y i 7 Total n Y i Model n n Quick review: gives us the variability in Y explained by our model. If we think back to our first regression class, one common option is R 2 which ∑ ∑ ∑ ( Y i − ¯ ( ˆ Y i − ¯ Y i − ˆ Y ) 2 = ) 2 + ( ) 2 i = 1 i = 1
Model assessment? n Y i Y Error Y i n Model Y 7 gives us the variability in Y explained by our model. Quick review: Total i n If we think back to our first regression class, one common option is R 2 which ∑ ∑ ∑ ( Y i − ¯ ( ˆ Y i − ¯ Y i − ˆ Y ) 2 = ) 2 + ( ) 2 i = 1 i = 1 ( ˆ Y i − ¯ ( ˆ Y i − ˆ ∑ n ) 2 ∑ n ) 2 R 2 = Corr ( Y , ˆ Y ) 2 = i = 1 i = 1 Y ) 2 = 1 − i = 1 ( Y i − ¯ i = 1 ( Y i − ¯ ∑ n ∑ n Y ) 2
Bayesian R 2 1.7027229 0.4478495 3.035438 -4.419457 2.2174576 0.3661503 ## [3,] 3.2258841 2.608588 -2.472159 1.1553329 ## [4,] ## [1,] -1.7193735 2.712832 -2.935192 -0.3472465 -0.5000324 1.8120911 2.612417 -3.349495 2.3028439 1.0398770 ## [5,] 1.2504531 1.996477 -2.424596 0.5437237 0.7191012 ## [2,] Yhat[5] While we can collapse our posterior parameter distributions to a single Yhat[4] Y and then R 2 , we don’t need to. Y_hat = matrix (NA, nrow=n_sim, ncol=n, dimnames= list (NULL, paste0 (”Yhat[”,1:n,”]”))) { beta_post = samp[[1]][i,1:4] sigma_post = samp[[1]][i,5] } Y_hat[1:5, 1:5] ## Yhat[1] Yhat[2] Yhat[3] 8 value (using mean, median, etc.) before calculating ˆ n_sim = 5000; n = 100 for(i in 1:n_sim) Y_hat[i,] = beta_post %*% t (X) + rnorm (n, sd = sigma_post)
9 Y - lm vs Bayesian lm ˆ Yhat[1] Yhat[2] Yhat[3] 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 parameter Yhat[1] 0.0 0.0 0.0 Yhat[2] density −2 0 2 4 0 2 4 6 −5.0 −2.5 0.0 Yhat[3] Yhat[4] Yhat[5] Yhat[6] Yhat[4] Yhat[5] 0.4 0.4 0.4 Yhat[6] 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 0.0 2.5 −4 −2 0 2 −4 −2 0 2 value
Posterior R 2 Max. ## [1] 0.9262839 For each posterior sample s we can calculate R 2 0.8672 0.9168 0.8410 0.8549 0.8536 ## [1,] 0.7531 summary ( lm (Y~., data=d))$r.squared Mean 3rd Qu. Min. 1st Qu. Median 10 ## summary ( c (R2_post)) %>% t () s = Corr ( Y , ˆ Y s ) 2 , R2_post = apply (Y_hat, 1, function(Y_hat_s) cor (Y, Y_hat_s)^2) 20 15 density 10 5 0 0.75 0.80 0.85 0.90 value
What if we collapsed first? [,1] Yhat[6] ## 1.3336394 2.7270929 -3.1659989 1.1631532 -0.1003662 -1.5807078 cor (Y_hat_post_mean, Y)^2 ## ## [1,] 0.9264776 Yhat[4] cor (Y_hat_post_med, Y)^2 ## [,1] ## [1,] 0.9264527 summary ( lm (Y~., data=d))$r.squared ## [1] 0.9262839 Yhat[5] Yhat[3] Y_hat_post_mean = apply (Y_hat, 2, mean) Yhat[6] head (Y_hat_post_mean) ## Yhat[1] Yhat[2] Yhat[3] Yhat[4] Yhat[5] ## Yhat[2] 1.3324163 2.7363221 -3.1772690 1.1531411 -0.1029881 -1.5918980 Y_hat_post_med head (Y_hat_post_med) ## Yhat[1] 11 = apply (Y_hat, 2, median)
1 Y i 1 Y i Corr Y Y 2 Y 2 Y 2 1 Y i 1 Y i What went wrong? i Y 2 n i n 1 i Y i 2 n i n R 2 If our criteria is to maximize R 2 , then nothing. Why this result? X i Remember that MLE LS , the latter of which is achieved by arg min n i 1 Y i 2 that tell us about arg min n i 1 Y i Y i 2 So if we have such that it minimizes the least squares criterion what does 12
1 Y i 1 Y i Corr Y Y 2 Y 2 Y 2 1 Y i 1 Y i What went wrong? i Y 2 n i n 1 i Y i 2 n i n R 2 If our criteria is to maximize R 2 , then nothing. Why this result? n arg min n that tell us about 12 Y i So if we have such that it minimizes the least squares criterion what does Remember that ˆ β MLE = ˆ β LS , the latter of which is achieved by ( Y i − X i · β ) 2 = arg min ∑ ∑ Y i − ˆ ( ) 2 β β i = 1 i = 1
What went wrong? n Y i Y If our criteria is to maximize R 2 , then nothing. Why this result? that tell us about Y i 12 arg min n Remember that ˆ β MLE = ˆ β LS , the latter of which is achieved by ( Y i − X i · β ) 2 = arg min ∑ ∑ Y i − ˆ ( ) 2 β β i = 1 i = 1 So if we have β such that it minimizes the least squares criterion what does ( ˆ Y i − ¯ Y i − ˆ ) 2 ( ) 2 ∑ n ∑ n R 2 = Corr ( Y , ˆ Y ) 2 = i = 1 i = 1 Y ) 2 = 1 − i = 1 ( Y i − ¯ i = 1 ( Y i − ¯ ∑ n ∑ n Y ) 2
inferences about key parameters) Some problems with R 2 13 • R 2 always increases (or stays the same) when a predictor is added • R 2 is highly susecptible to over fitting • R 2 is sensitive to outliers • R 2 depends heavily on current values of Y • R 2 can differ drastically for two equivalent models (i.e. nearly identical
Other Metrics 14
Root Mean Square Error Y i equation. n n s n n s 1 The traditional definition of rmse is as follows parameter / prediction of interest we can express this equation as In the bayesian context where we have posterior samples from each 15 n n � � ∑ � Y i − ˆ RMSE = ( ) 2 � 1 i = 1 � � � ∑ ∑ Y i − ˆ ( ) 2 RMSE = � Y i , s s = 1 i = 1 Note that as we just saw with R 2 using the first definition with ˆ s = 1 ˆ Y i = ∑ n s Y i , s / n does not necessarily give the same result as the 2nd
Continuous Rank Probability Score RMSE (and related metrics like MAE) are not directly applicable to true/observed value of Y . Y (the posterior predictive distribution for Y ) dz 16 distribution using a Continuous Rank Probability Score Y is given by a predictive Y i . We can probabilistic predictions since they require fixed values of ˆ generalize to a fully continuous case where ˆ ∫ ∞ ( ) 2 CRPS = Y ( z ) − 1 { z ≥ Y } F ˆ −∞ Y is the empirical CDF of ˆ where F ˆ and 1 z ≥ Y is the indicator function which equals 1 when z ≥ Y , the
Accuracy vs. Precision 17 dist1 (rmse = 2.002) dist2 (rmse = 2.798) dist3 (rmse = 1.002) dist4 (rmse = 2.22) 0.4 0.3 dist dist1 density 0.2 dist2 dist3 dist4 0.1 0.0 −5 0 5 10 −5 0 5 10 −5 0 5 10 −5 0 5 10 value
CDF vs Indicator ## dist1 dist2 dist3 dist4 ## 0.468 1.188 0.235 1.432 18 1.00 0.75 dist dist1 dist2 0.50 y dist3 dist4 0.25 0.00 −10 −5 0 5 10 value
Empirical Coverage One final method of assessing model calibration is assessing how well credible intervals, derived from the posterior predictive distributions of the Y s, capture the true/observed values. ## 90% CI empirical coverage ## 0.92 19
Empirical Coverage One final method of assessing model calibration is assessing how well 0.92 ## ## 90% CI empirical coverage 19 credible intervals, derived from the posterior predictive distributions of the Y s, capture the true/observed values. 5 capture 0 FALSE Y TRUE −5 0 10 20 30 40 50 index
Cross-validation 20
Cross-validation styles Kaggle style: k -fold: 21
Cross-validation in R with modelr test 5 ## 5 <S3: resample> <S3: resample> 4 ## 4 <S3: resample> <S3: resample> 3 ## 3 <S3: resample> <S3: resample> 2 ## 2 <S3: resample> <S3: resample> 1 ## 1 <S3: resample> <S3: resample> <list> <chr> <list> ## .id train library (modelr) ## ## # A tibble: 5 × 3 d_kfold ## <resample [16 x 4]> 5, 10, 18, 21, 31, 41, 45, 46, 57, 68, ... ## $test2 ## ## <resample [15 x 4]> 12, 19, 23, 30, 34, 35, 36, 48, 54, 58, ... ## $test1 ## ## <resample [69 x 4]> 1, 2, 3, 4, 6, 7, 8, 9, 11, 13, ... ## $train d_kaggle 22 d_kaggle = resample_partition (d, c (train=0.70, test1=0.15, test2=0.15)) d_kfold = crossv_kfold (d, k=5)
Recommend
More recommend