Conditional AIC For Mixed Effects Models Florin Vaida Division of Biostatistics and Bioinformatics, UCSD Vienna Workshop on Model Selection July 24, 2008 1
Model Selection for Mixed Effects Models • Setting: longitudinal data • Subjects i = 1 . . . m ; Observations j = 1 . . . n i • Model: Linear, Generalized Linear, Nonlinear Mixed Effects Models (LME, GLME, NLME) ǫ ij ∼ N (0 , σ 2 ) • NLME: y ij = h ( η ij ) + ǫ ij , • GLME: E ( y ij ) = h ( η ij ), y ij ∼ exponential family η ij = x ⊤ ij β + z ⊤ • Linear predictor ij b i iid • b i ∼ N (0 , G ) random effects; 2
Model Selection For Cluster Focus • Focus of inference/prediction on subjects (clusters) in the dataset, not on new subjects (clusters) • Model selection: – what covariates to include? – what random effects to include? – should I fit subject effects as fixed or random? • Vaida & Blanchard (2005): conditional Akaike information • For LME cAIC = − 2loglik + 2 K • conditional loglik and effective d.f. K • Extend such a formula to GLME and NLME? 3
Example: PK of Cadralazine 5 10 15 20 25 5 10 15 20 25 5 3 4 8 2 − 3 − 4 − 5 − 6 log(Concentration/log(Dose) (1/L) − 7 − 8 − 9 6 1 9 10 7 − 3 − 4 − 5 − 6 − 7 − 8 − 9 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 Time since drug administration (hrs) 4
y ij = β 0 i + β 1 i · t ij + e ij Two models: 1. Linear regression model : β 0 i , β 1 i = fixed parameters, subject-specific 2. Random effects model: β 0 i = β 0 + b 0 i , β 1 i = β 1 + b 1 i iid ( b 0 i , b 1 i ) ∼ N (0 , G ) Which model is better? 5
RE Population: observed vs fitted RE Individual: observed vs fitted SS: observed vs fitted − 2 − 2 − 2 − 3 − 3 − 3 − 4 − 4 − 4 Observed Values Observed Values Observed Values − 5 − 5 − 5 − 6 − 6 − 6 − 7 − 7 − 7 − 8 − 8 − 8 − 9 − 9 − 9 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 Fitted Values Fitted Values Fitted Values RE Population: residauls vs fitted RE Individual: residauls vs fitted SS: residuals vs fitted 3 0.4 0.4 2 0.2 0.2 1 Residuals Residuals Residuals 0.0 0.0 0 − 1 − 0.2 − 0.2 − 2 − 0.4 − 0.4 − 3 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 Fitted Values Fitted Values Fitted Values Figure 1: 6
Comparison of the two models Linear regression Random effects AIC -47.1 11.0 AIC: small is beautiful | ∆AIC | < 2 = similar fit > 10 = overwhelming evidence Why is AIC for linear regression model so much smaller? Something wrong with AIC? 7
Effective Degrees of Freedom for LME • LME: b i ∼ N (0 , σ 2 D 0 ) y i = X i β + Z i b i + e i , Or in general b ∼ N (0 , G = σ 2 D ) , e ∼ N (0 , σ 2 I ) y = Xβ + Zb + e, • Inference: Maximum likelihood for β, σ 2 , D (or REML) • ˆ b i = arg sup p ( b i | y, ˆ β, ˆ G ) = arg sup p ( y i , b i | ˆ β, ˆ G ) = BLUP, or Empirical Bayes • Hodges and Sargent (2001): Effective degrees of freedom ρ = trace( H ) where ˆ y = Hy Count b i as a fraction of a parameter 8
Effective Degrees of Freedom • Henderson’s “score” equations for ( β, b ): X ⊤ y X ⊤ Xβ + X ⊤ Zb = Z ⊤ y Z ⊤ Xβ + ( Z ⊤ Z + D − 1 ) b = • Corresponding to the formal linear model ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎝ y ⎝ X Z ⎝ β ⎝ e ⎠ = ⎠ + ⎠ ⎠ 0 0 − I b b • ˆ y = Hy , − 1 � ⎛ ⎞ ⎝ X ⊤ X X ⊤ Z � ⊤ � � H = X Z X Z ⎠ Z ⊤ X Z ⊤ Z + D − 1 9
ρ = trace( H ) ⎧ ⎫ − 1 � ⎛ ⎞ ⎝ X ⊤ X X ⊤ Z ⎪ ⎪ � ⊤ ⎨ ⎬ � � = trace X Z X Z ⎠ Z ⊤ X Z ⊤ Z + D − 1 ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ − 1 ⎛ ⎛ ⎞ ⎞ ⎝ X ⊤ X X ⊤ Z ⎝ X ⊤ X X ⊤ Z ⎪ ⎪ ⎨ ⎬ = trace ⎠ ⎠ Z ⊤ X Z ⊤ Z + D − 1 Z ⊤ X Z ⊤ Z ⎪ ⎪ ⎩ ⎭ Inspired by Hastie and Tibshirani (1990) for GAM. 10
Counting parameters for NLME • Our idea: take NLME η ij = x ⊤ ij β + z ⊤ y ij = h ( η ij ) + ǫ ij , ij b i • Linearization: η ij ) + h ′ (ˆ h ( η ij ) ≈ h (ˆ η ij )( η ij − ˆ η ij ) { h ′ (ˆ η ij ) x ij } β + { h ′ (ˆ = η ij ) z ij } b i + const • w ij = s ⊤ ij β + t ⊤ ij b i + ǫ ij η ij ) + h ′ (ˆ • Where w ij = y ij − h (ˆ η ij )ˆ η ij s i = h ′ (ˆ t ij = h ′ (ˆ η ij ) x ij , η ij ) z ij • Compute ρ for the linearized mixed effects model in w ij • Call ρ effective d.f. for NLME 11
Effective DF for GLME/NLME • Lu, Carlin and Hodges (2007), for GLME: • For a GLMM with linear predictor η ij = x ⊤ ij β + z ⊤ ij b i expand loglik η ij ) + 1 η ij ) + l ′ (ˆ 2 l ′′ (ˆ η ij ) 2 l ( η ij ) = l (ˆ η ij )( η ij − ˆ η ij )( η ij − ˆ − 1 ( u ij − η ij ) 2 + const , = 2 σ 2 ij σ 2 − 1 /E { l ′′ (ˆ η ij } = σ 2 / { h ′ ( η ∗ ij ) 2 } = ij ǫ ij ∼ N (0 , σ 2 • u ij ≈ η ij + ǫ ij , ij ) 12
Formal linear model ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎝ u ⎝ X Z ⎝ β ⎝ e ⎠ = ⎠ + ⎠ ⎠ 0 0 − I b b Var( ǫ ) = σ 2 W − 1 W = diag[ { h ′ ( η ij ) } 2 ] ρ = trace( H ) ⎧ ⎫ − 1 � ⎛ ⎞ ⎝ X ⊤ WX X ⊤ WZ ⎪ ⎪ � ⊤ ⎨ ⎬ � � = trace W X Z X Z ⎠ Z ⊤ WX Z ⊤ WZ + D − 1 ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ − 1 ⎛ ⎛ ⎞ ⎞ ⎝ X ⊤ WX X ⊤ WZ ⎝ X ⊤ WX X ⊤ WZ ⎪ ⎪ ⎨ ⎬ = trace ⎠ ⎠ Z ⊤ WX Z ⊤ WZ + D − 1 Z ⊤ WX Z ⊤ WZ ⎪ ⎪ ⎩ ⎭ ρ = effective degrees of freedom of GLME/NLME 13
Effective DF for NLME • Result: The two definitions of ρ for NLME are equivalent. They are based on different linearizations: on the scale of y ij and on the scale of η ij , respectively. • For NLME/GLME ρ depends also on η ij through W = diag[ { h ′ ( η ij ) } 2 ]. ij , W ∗ and estimated ˆ • Relevant values: true ρ ∗ , using “true” η ∗ ρ , η ij , ˆ using ˆ W . • H , ρ correspond to the score equations for ( β, b ): X ⊤ Wy X ⊤ WXβ + X ⊤ WZb = Z ⊤ Wy Z ⊤ WXβ + ( Z ⊤ WZ + D − 1 ) b = which are the PQL equations of Breslow and Clayton (1993) for GLME. 14
Model selection using Akaike information � � − 2 E f ( y ∗ ) log g ( y ∗ | ˆ AI = E f ( y ) θ ( y )) How good is model g ( ·| θ ) at predicting new data y ∗ from model f ( · ), based on the sample y from f ( · )? Akaike information is not about finding the “true model”. Estimator: AI ≈ AIC = − 2 log g ( y | ˆ θ ( y )) + 2 K K = d f = # parameters in the model Asymptotically, AIC ≈ unbiased for AI 15
Conditional AIC • Assume truth f ( | b 0 ) and model g ( | θ, b ) GLMM/NLMM/LMM • f ( y | b 0 ) = conditional distribution, b 0 true value of random eff. • Definition: Conditional Akaike Info (V & B 2005) cAI = − 2 E f ( y,b 0 ) E f ( y ∗ | b 0 ) log g ( y ∗ | ˆ β ( y ) , ˆ b ( y )) where y ∗ iid with y , conditional on same b 0 • cAI is appropriate for comparing models at subject-specific level • E.g., when chosing between fixed and random subject effects 16
Theorem: Conditional AIC for LME Assume that y ∼ LME and model class g contains the operational model f ; σ 2 , D are known. Then cAIC = − 2 log g ( y | ˆ β ( y ) , ˆ b ( y )) + 2 ρ is an unbiased estimator of the conditional Akaike information. • g ( y | ˆ β, ˆ b ) is the conditional distribution • ρ = effective d.f. • Unknown σ 2 , correction = ρ + 1 asymptotically, small sample correction available • No correction needed for unknown D asymptotically 17
Back to Cadralazine data: Random effects model Linear regression model mAIC cAIC AIC Asymptotic 11 . 0 − 44 . 5 − 47 . 1 Finite-sample corrected 12 . 6 − 42 . 3 − 22 . 8 REML − − 43 . 7 − 40 . 6 18
Theorem: Conditional AIC for GLMM/NLMM Assume that y ∼ GLMM or NLMM and σ 2 , D are known. Then, under regularity conditions cAIC = − 2 log g ( y | ˆ β, ˆ b ) + 2 ρ is an asymptotically unbiased estimator of the conditional Akaike information. • Regularity conditions include m/N → 0 as N → ∞ ; min | Z i | ≥ c 0 (Jiang, Jia and Chen, 2001). They ensure consistency of β, b i . • No results yet for unknown σ 2 ; correction = ρ + 1? 19
Simulation study Bias of cAIC as an estimator of cAI = − 2 cond loglik + 2( ρ + 1) σ n i .50 .25 .125 24 1.7 0.5 0.0 12 3.0 1.0 0.4 6 4.9 3.0 1.9 3 9.5 9.7 8.7 10 clusters, n i observations each; One-compartment PK model Bias reduces with increasing n i and decreasing σ/ || D || . Bias includes effects of unknown D and model non-linearity 20
Cadralazine Data 5 10 15 20 25 8 2 2 1.5 1 0.5 0 7 5 3 4 Cadralizine concentration (mg/L) 2 1.5 1 0.5 0 6 1 9 10 2 1.5 1 0.5 0 5 10 15 20 25 5 10 15 20 25 Time since drug administration (hrs) Mean random var cAIC ρ σ 2 exp { β 1 i + β 2 t ij } 10.23 − 473 . 88 β 1 i σ 2 exp { β 1 i + β 2 i t ij } − 473 . 88 β 1 i , β 2 i 10.23 σ 2 exp { β 1 i + β 2 i t ij } − 577 . 18 none 30 i 21
Recommend
More recommend