6 The simplest models ..........................................................................................................................................................................................................................52 6.1 One-way ANOVA with random effects ...................................................................................................................................................................................52 6.2 Estimating the one-way ANOVA model..................................................................................................................................................................................53 6.2.1 Mixed model approach...................................................................................................................................................................................................57 6.3 EBLUPs...................................................................................................................................................................................................................................58 7 Slightly more complex models ..........................................................................................................................................................................................................60 7.1 Means as outcomes regression.................................................................................................................................................................................................60 7.2 One-way ANCOVA with random effects.................................................................................................................................................................................61 7.3 Random coefficients model .....................................................................................................................................................................................................61 7.4 Intercepts and Slopes as outcomes...........................................................................................................................................................................................62 7.5 Nonrandom slopes...................................................................................................................................................................................................................63 7.6 Asking questions: CONTRAST and ESTIMATE statement ...................................................................................................................................................63 8 A second look at multilevel models...................................................................................................................................................................................................67 8.1 What is a mixed model really estimating.................................................................................................................................................................................67 ( | ) 8.2 Var Y X and T ...........................................................................................................................................................................................................67 8.2.1 Random slope model ......................................................................................................................................................................................................67 8.2.2 Two random predictors...................................................................................................................................................................................................69 8.2.3 Interpreting Chol(T) .......................................................................................................................................................................................................70 8.2.4 Recentering and balancing the model.............................................................................................................................................................................72 8.2.5 Random slopes and variance components parametrization ............................................................................................................................................72 8.2.6 Testing hypotheses about T ..........................................................................................................................................................................................72 8.3 Examples .................................................................................................................................................................................................................................76 8.4 Fitting a multilevel model: contextual effects .........................................................................................................................................................................77 8.4.1 Example..........................................................................................................................................................................................................................78 9 Longitudinal Data ..............................................................................................................................................................................................................................86 3
9.1 The basic model.......................................................................................................................................................................................................................91 9.2 Analyzing longitudinal data...................................................................................................................................................................................................100 9.2.1 Classical or Mixed models ...........................................................................................................................................................................................100 9.3 Pothoff and Roy.....................................................................................................................................................................................................................102 9.3.1 Univariate ANOVA.......................................................................................................................................................................................................104 9.3.2 MANOVA repeated measures....................................................................................................................................................................................... 113 9.3.3 Random Intercept Model with Autocorrelation............................................................................................................................................................ 115 9.3.4 Comparing Different Covariance Models..................................................................................................................................................................... 118 9.3.5 Exercises on Pothoff and Roy.......................................................................................................................................................................................120 10 Bibliography (to be changed)...................................................................................................................................................................................................121 11 Outputs and pictures ................................................................................................................................................................................................................146 11.1 what...................................................................................................................................................................................................................................146 11.2 Looking at a single school: Public School P4458.............................................................................................................................................................146 1 Introduction The last two decades have seen rapid growth in the development and use of models suitable for multilevel or longitudinal data. We can identify at least four broad approaches: the use of derived variables, econometric models, latent trajectory models using structural equations models and mixed (fixed and random effects) models. This course focuses on the use of mixed models for multilevel and longitudinal data. Mixed models have a wide potential for applications to data that are otherwise awkward or impossible to model. Some key applications are to nested data structures, e.g. students within classes within schools within school boards, and to longitudinal data, especially ‘messy’ data where 4
measurements are taken at irregular times, e.g. clinical data from patients measured at irregular times or panel data with changing membership. Another application is to panel data with time-varying covariates, e.g. age where each cohort has a variety of ages and the researcher is interested in studying age effects. New books and articles on multilevel and longitudinal models are being released much faster than one can read or afford them. One way to get started for someone who intends to start with SAS would be to read: � Judith D. Singer and John B. Willett (2003). Applied Longitudinal Data Analysis. Oxford University Press, New York. An accessible and comprehensive treatment of methods for longitudinal analysis. In addition to the use of mixed models, this book includes material on latent growth models and on event history analysis. � Stephen W. Raudenbush and Anthony S. Bryk (2002) Hierarchical linear models : applications and data analysis methods , (2 nd edition). Sage, Thousand Oaks, CA. � Peter J. Diggle, Patrick Heagerty, Kyng-Yee Liang and Scott L. Zeger (2002) Analysis of Longitudinal Data, (2 nd edition) . Oxford University Press, Oxford. � Tom A. B. Snijders and Roel J. Bosker (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling . Sage, London. An excellent book with a broad conceptual coverage. � ‘‘Using SAS PROC MIXED to fit Multilevel Models, Hierarchical Models, and Individual Growth Models’’ by Judith Singer Singer(1998). � Mixed Effects Models in S and S-Plus Pinheiro and Bates (2000). The choice of software to fit mixed models continues to grow. Many packages are in active development and offer new functionality. It is safe to hypothesize that no package is uniformly superior to all the others. 5
� PROC MIXED in SAS is a solid workhorse although it has been relatively static for some years. The ability to specify a wide variety of structures for both the within-cluster covariance matrix and for the random effects covariance matrix is an important feature. Graphics in SAS are improving and the description of version 9 mentions the ability to produce some important diagnostic plots automatically. NLMIXED is a more recent and interesting addition for non-linear models and for modeling binary and Poisson outcomes. � SPSS 11.5 has a MIXED command to fit mixed models. See Alastair Leyland (2004) A review of multilevel modeling in SPSS. � An excellent special-purpose programme is MLwiN developed by a group associated with Harvey Goldstein at the Institute of Education at the University of London. It uses a graphical interface that is very appealing for multilevel modelling and produces some very interesting plots quite easily. Its basic linear methods are well integrated with more advanced methods: logistic mixed models, MCMC estimation, to make the transition relatively easy. One shortcoming is its unstructured parametrization of the variance matrix of random effects. MLwiN seems to be able to handle very large data sets as long as they fit in RAM. � NLME by Bates and Pinheiro is a library in R and S-Plus. This is my favourite working environment but it’s best for those who enjoy programming. It’s a good environment for non-linear normal error models and can be adapted to fit logistic hierarchical models. R and S- Plus are very strong for graphics and for programmability which allows you to easily refine and reuse analytic solutions. � GLLAMM in STATA is a very interesting program that is in active development and can fit a wide variety of models including those with latent variables. � The current version of HLM has, like MLwiN, has a graphical interface. It is developed by Bryk and Raudenbush and is a powerful and popular program Some resources on the web include: � UCLA Academic Technology Services seminars at http://www.ats.ucla.edu/stat/seminars/ contains an excellent collection of seminars 6
on topics in statistical computing including extensive materials on multilevel and longitudinal models. � Two multilevel statistics books are available free online: The Internet Edition of Harvey Goldstein’s Extensive but challenging Multilevel Statistical Models can be downloaded from http://www.arnoldpublishers.com/support/goldstein.htm � Applied Multilevel Analysis by Joop Hox is available at http://www.fss.uu.nl/ms/jh/publist/amaboek.pdf. � Multilevel Modelling Newsletters produced twice yearly by The Multilevel Models Project in the Institute of Education at the University of London: http://multilevel.ioe.ac.uk/publref/newsletters.html � The web site for the Multilevel Models Project: http://www.ioe.ac.uk/multilevel/ � One can subscribe to an email discussion list at http://www.mailbase.ac.uk/lists/multilevel/ 2 Preliminaries We begin by reviewing a few concepts from regression and multivariate data analysis. Our emphasis is on concepts that are frequently omitted, or, at least, not well explored in typical courses. β the effect of 2.1 Is X ? 1 1 Consider the familiar regression equation: 7
= β + β + β ( ) (1) E Y X X 0 1 1 2 2 To make this more concrete, suppose we are studying the relationship between = = Y=Health , and . X Weight X Height 1 2 = β + β + β ( ) (2) E Health Weight Height 0 1 2 β is the ‘effect’ of changing Weight . What if we are really interested in the ‘effect’ of ExcessWeight . Maybe we should replace Weight so that 1 with ExcessWeight . Let’s suppose that = − φ + φ ( ) (3) ExcessWeight Weight Height 0 1 φ + φ where is the ‘normal’ Weight for a given Height . What happens if we fit the model? 1 Height 0 = γ + γ + γ ( ) (4) E Health ExcessWeight Height 0 1 2 γ to be the effect of ExcessWeight . where we now expect 1 If we substitute the definition of ExcessWeight in (3) into (4), we get: 8
= γ + γ + γ ( ) E Health ExcessWeight Height 0 1 2 = γ + γ − φ + φ + γ ( ( )) Weight Height Height 0 1 0 1 2 = γ − γ φ + γ + γ − γ φ ( ) ( ) Weight Height 0 1 0 1 2 1 1 and, using the fact that identical linear functions have identical coefficients: γ = β + β φ 0 0 1 0 γ = β 1 1 γ = β + β φ 2 2 1 1 The coefficient we expected to change is the only one that stays the same! β in (2) as the effect of changing Weight keeping Height constant, which turns out to be the This illustrates the importance of thinking of 1 same thing as changing ExcessWeight keeping Height constant as long as ExcessWeight can be defined as a linear function of Weight and Height . In multiple regressions, the other variables in the model are as important in defining the meaning of a coefficient as the variable to which the coefficient is attached. Often, the major role of the target variable is that of providing a measuring stick for units. This fact will play an important role when we consider controlling for ‘contextual variables.’ See the appendix for an explanation of this phenomenon using the matrix formulation for regression. 9
2.2 Visualizing multivariate variance Understanding multivariate variance is important to unravel some mysteries of random coefficient models. With univariate random variables, it’s generally easier to think in terms of the standard deviation ( ) = . What is the SD VAR equivalent of the SD for a bivariate random vector? The easiest way of thinking about this uses the standard ‘concentration ellipse’ (which may also be known as the ‘dispersion ellipse’ or the ‘variance ellipse’) for the bivariate normal. With a sample, we can define a standard sample data ellipse with the following properties: � The centre of the ‘standard ellipse’ is at the point of means. � The shadow of the ellipse onto either axis gives mean ± 1 SD. � The shadow of the ellipse onto ANY axis gives mean ± 1 SD for a variable represented by projection onto the axis, i.e. any linear combination of X and X . 1 2 � The vertical slice through the center of the ellipse gives the residual SD of X given X . When X and X have a multivariate normal 2 1 1 2 10
distribution, this is also the conditional SD of X given X . 2 1 � Mutatis mutandis for X given X . 1 2 � The line through the points of vertical tangency is the regression line of X on X . 2 1 � Mutatis mutandis for X on X 1 2 The fact that the regression line goes through the points of vertical tangency and NOT the corners of the rectangle containing the ellipse (NOR the principal axis of the ellipse) is the essence of the regression paradox . Indeed, the very word regression comes from the notion of ‘regression to mediocrity’ embodied in the fact that the regression line (i.e. the Expected value of X given X ) is flatter than the ‘SD line’ (the 2 1 line through the corners of the rectangle). � The bivariate SD: Take any pair of conjugate axes on the ellipse as shown in the figure above. (In 2D, axes are conjugate if each axis is parallel to the tangent of the ellipse at the point where the other axis intersects the ellipse ... a picture is worth 26 words.) Make them column vectors in a 2 × 2 matrix and you have the ‘square root’ of the variance matrix, i.e. the SD for the bivariate random vector. The choice of conjugate axes is not unique – neither is the ‘square root’ of a matrix. Depending on your choice of conjugate axes, you can get various factorizations of the variance matrix: upper or lower Choleski, symmetric non-negative definite ‘square root’, principal components factorization (singular value decomposition), etc. 11
2.3 Data example: Simpson and Robinson We first introduce the data set we will use to illustrate concepts in multilevel modelling. We use an example from Byrk and Raudenbush (1992) in which mixed effects models are presented as hierarchical linear models. This seems to me to be the easiest way to present and develop the ideas. We will use a subset of the data used in their book. It consists of data on math achievement, SES, minority status and sex of students in 40 schools. The schools are classified as “public” or “catholic”. One goal of the analysis is to study the relationship between math achievement and SES in different settings. (diagram from Bryk and Raudenbush, 1992) 12
Let just consider the relationship between math achievement, Y, and SES, X . Consider three ways (we’ll see more later) of estimating the relationship between X and Y. 1) An ecological regression of the individual school means of Y on the individual school means of X. 2) Pooled regression of Y on X ignoring school entirely. This estimates what is sometimes called the marginal relationship between Y and X. 3) A regression of Y on X adjusting for school, i.e. including school as a categorical variable or, equivalently, first ‘partialing out’ the effect of schools. This estimates the conditional relationship between Y and X. Robinson’s paradox to the fact that 1) and 3) can have opposite signs. Simpson’s paradox refers to the fact that 2) and 3) can also have opposite signs. In fact, estimate 2) will lie between 1) and 3). How much closer it is to 1) than to 3) depends on the variance of X and Y within schools. 2.4 Matrix formulation of regression = X β + ε is such a universal and convenient shorthand that we need to spell out what it means and how it is used. Y Consider the equation for a single observation assuming 2 X variables: = β + β + β + ε = L 1, , Y x x i N 0 1 1 2 2 i i i i ε iid σ with (0, 2 ). N i We pile these equations one on top of the other: 13
= β + β + β + ε Y x x 1 0 11 1 21 2 1 = β + β + β + ε Y x x 2 0 12 1 22 2 2 M = β + β + β + ε Y x x 0 1 1 2 2 i i i i M = β + β + β + ε Y x x 0 1 1 2 2 N N N N β remain the same from line to line but Y s , x s and s ε change. Using vectors and matrices and exploiting the rules for multiplying Note that the s matrices: ε ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 Y x x 1 11 21 β 1 ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 ε 1 ⎢ ⎥ Y x x ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 = 12 22 β = 2 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ M M M ⎢ ⎥ β ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ 2 ε 1 ⎣ Y ⎦ ⎣ x x ⎦ ⎣ ⎦ 1 2 N N N N or, in short-hand: = + Y X β ε = In multilevel models with, say J schools indexed by 1,..., and with the j th school having n students, we block students of the same j J j school together. We can then write for the j th school: ⎡ ⎤ ε ⎡ 1 ⎤ ⎡ ⎤ Y x x 1 j 11 21 1 ⎡ ⎤ ⎢ ⎥ j j β j ⎢ ⎥ ⎢ ⎥ ε 1 0 j x x ⎢ ⎥ ⎢ Y ⎥ ⎢ ⎥ ⎢ ⎥ 12 22 2 2 j j j j = β + ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 M M M j M M ⎢ ⎥ ⎢ ⎥ β ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ε 1 2 ⎢ ⎥ j ⎢ x x ⎥ ⎢ ⎥ Y ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 1 2 n j n j n j n j j j j i 14
or, in short hand: = + Y X β ε j j j j j = We can stack schools on top of each other. If all schools are assumed to have the same value for β β , then we can stack the Xs vertically: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Y X ε 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = + Y X β ε ⎢ j ⎥ ⎢ j ⎥ ⎢ j ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Y X ε ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ J J J or, in shorter form: = + Y X β ε β If the are different we need to stack the X s diagonally: j s j ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Y ⎡ ⎤ β ε X 0 0 L L 1 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M M O M L M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ + Y 0 X 0 β ε L L ⎢ j ⎥ j ⎢ j ⎥ ⎢ j ⎥ ⎢ ⎥ M M M O M ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 0 X M L Y ⎣ ⎦ β ε ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ J J J J or, in shorter form: = + Y X β ε = + Now, you know why you see Y X β ε so often! 15
σ 2 Something that gets used over and over again is the fact that, if ε (0, I ) then the OLS (ordinary least-squares) estimator is N ) = − ' 1 ' β ( X X ) X Y ols − σ 2 ' 1 with variance ( X X ) If the components of ε are not iid but where Σ is a known variance matrix (or, at least, known up to a proportional factor) then ε (0, ) Σ N the GLS (generalized least-squares) estimator is ) = ' − 1 − 1 ' − 1 β ( X Σ X ) X Σ Y GLS − − ' 1 1 with variance ( X Σ X ) . 3 General regression analysis 3.1 Detailed description of data set For multilevel modeling we will use a subset of a data set presented at length in Bryk and Raudenbush (1992). The major variables are math achievement (mathach) and socio-economic status (ses) measured for each student in a sample of high school students in each of 40 selected schools in the U. S. Each school belongs to one of two sectors: 25 are Public schools and 15 are Catholic schools. There are 1,720 students in the sample. The sample size from each school ranges from 14 students to 61 students. The data are available as a text file. The following is a listing of the first 50 lines of the file: school minority female ses mathach size sector meanses 1 1317 0 1 0.062 18.827 455 1 0.351 2 1317 0 1 0.132 14.752 455 1 0.351 3 1317 0 1 0.502 23.355 455 1 0.351 16
4 1317 0 1 0.542 18.939 455 1 0.351 5 1317 0 1 0.582 15.784 455 1 0.351 6 1317 0 1 0.702 13.677 455 1 0.351 7 1317 0 1 0.812 20.236 455 1 0.351 8 1317 0 1 0.882 12.862 455 1 0.351 9 1317 0 1 0.992 11.621 455 1 0.351 10 1317 0 1 1.122 4.508 455 1 0.351 11 1317 0 1 1.372 20.748 455 1 0.351 12 1317 0 1 -0.098 11.064 455 1 0.351 13 1317 0 1 -0.578 11.502 455 1 0.351 14 1317 1 1 0.082 13.373 455 1 0.351 15 1317 1 1 0.122 7.142 455 1 0.351 16 1317 1 1 0.132 18.362 455 1 0.351 17 1317 1 1 0.132 20.261 455 1 0.351 18 1317 1 1 0.272 3.220 455 1 0.351 19 1317 1 1 0.302 6.973 455 1 0.351 20 1317 1 1 0.322 10.394 455 1 0.351 21 1317 1 1 0.362 21.405 455 1 0.351 22 1317 1 1 0.472 9.257 455 1 0.351 23 1317 1 1 0.482 12.283 455 1 0.351 24 1317 1 1 0.482 21.405 455 1 0.351 25 1317 1 1 0.492 8.382 455 1 0.351 26 1317 1 1 0.612 10.956 455 1 0.351 27 1317 1 1 0.642 17.246 455 1 0.351 28 1317 1 1 0.722 11.027 455 1 0.351 29 1317 1 1 0.832 4.810 455 1 0.351 30 1317 1 1 0.932 8.961 455 1 0.351 31 1317 1 1 0.942 23.736 455 1 0.351 32 1317 1 1 0.952 9.337 455 1 0.351 33 1317 1 1 0.972 11.794 455 1 0.351 17
34 1317 1 1 1.152 20.039 455 1 0.351 35 1317 1 1 1.462 10.661 455 1 0.351 36 1317 1 1 -0.008 10.066 455 1 0.351 37 1317 1 1 -0.008 11.290 455 1 0.351 38 1317 1 1 -0.028 7.031 455 1 0.351 39 1317 1 1 -0.068 17.869 455 1 0.351 40 1317 1 1 -0.088 8.057 455 1 0.351 41 1317 1 1 -0.108 10.121 455 1 0.351 42 1317 1 1 -0.108 10.493 455 1 0.351 43 1317 1 1 -0.108 17.203 455 1 0.351 44 1317 1 1 -0.158 4.756 455 1 0.351 45 1317 1 1 -0.258 15.555 455 1 0.351 46 1317 1 1 -0.288 21.405 455 1 0.351 47 1317 1 1 -0.848 11.531 455 1 0.351 48 1317 1 1 -1.248 8.253 455 1 0.351 49 1374 0 0 0.322 16.663 2400 0 -0.007 50 1374 0 0 0.362 24.041 2400 0 -0.007 The first 48 lines are students belonging to school labelled 1317. The last 2 lines are the first cases of the second school in the sample labelled 1374. The remaining variables are: minority 1 if a student belongs to a minority group female 1 if the student is female (note that school 1317 appears to be a ‘girl’s’ school ses a roughly standardized measure of socio-economic status mathach performance on a math achievement test size the size of the school sector 1 if the school is Catholic, 0 if Public meanses a measure of the mean ses in the school – not just the mean ses of the sample The uniform quantile plot of each variable gives a good snapshot of the data. Think of lining up each variables from shortest to tallest and plotting the result: 18
school minority female 0.8 0.8 6000 0.4 0.4 2000 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of 1720 (NA: 0 ) Fraction of 1720 (NA: 0 ) Fraction of 1720 (NA: 0 ) ses mathach size 25 1 1500 15 0 -1 5 500 0 -2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of 1720 (NA: 0 ) Fraction of 1720 (NA: 0 ) Fraction of 1720 (NA: 0 ) sector meanses id P9158 0.5 0.8 P6415 P4642 0.0 0.4 C9586 C9347 C9104 C9021 C6469 C6366 -0.5 C6074 C4511 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 Fraction of 1720 (NA: 0 ) Fraction of 1720 (NA: 0 ) N = 1720 Nmiss = 0 Figure 1: Uniform quantile plots of high school data 19
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 C4868 C2658 C2755 C1436 C9586 C9021 C9104 C6469 25 20 15 10 5 0 P1461 C6074 C4511 C5192 C9347 C6366 C1317 C3039 25 20 15 10 5 0 P5838 P5783 P1909 P2030 P6897 P2336 P7919 P3332 25 mathach 20 15 10 5 0 P6808 P1374 P7101 P8367 P3152 P2651 P8202 P4642 25 20 15 10 5 0 P8854 P3657 P3377 P9158 P9340 P8775 P2995 P6415 25 20 15 10 5 0 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 ses Figure 2: "Trellis" plot of high school data with least-squares lines for each school. 20
male female -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 C4868 C2658 C2755 C1436 C9586 C9021 C9104 C6469 25 20 15 10 5 0 P1461 C6074 C4511 C5192 C9347 C6366 C1317 C3039 25 20 15 10 5 0 P5838 P5783 P1909 P2030 P6897 P2336 P7919 P3332 25 mathach 20 15 10 5 0 P6808 P1374 P7101 P8367 P3152 P2651 P8202 P4642 25 20 15 10 5 0 P8854 P3657 P3377 P9158 P9340 P8775 P2995 P6415 25 20 15 10 5 0 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 ses Figure 3: Trellis plot of high school data with sex of students. 21
First suppose we are dealing with only one school. Let Y and X be the math achievement score and SES respectively of the ith student. We can i i formulate the model: = β + β + Y X r 0 1 i i i β is the average change in math achievement for a unit change in SES, β is the expected math achievement at SES = 0, and i where r is the 1 0 σ random deviation of the ith student from the expected (linear) pattern. We assume i r to be iid (0, 2 ) . N Public School 4458 3.2 Looking at school 4458 25 20 > summary(lm(MathAch ~ SES, zz1)) 15 Coefficients: Value Std. Error t value Pr(>|t|) 10 MathAch (Intercept) 6.9992 1.2380 5.6534 0.0000 5 SES 1.1318 1.0074 1.1235 0.2671 Residual standard error: 4.463 on 46 degrees 0 of freedom Multiple R-Squared: 0.02671 -5 -3 -2 -1 0 1 2 SES 22
P 4458: Estimate of int. and slope 25 Estimated coefficients in “beta” space: 20 • Shadows are ordinary 95% confidence intervals. • Shadow on the horizontal axis is a 95% 15 confidence interval for the true slope. b0 10 • Note that the interval includes β = β = 0 so we would accept H : 0 . 5 1 0 1 0 3.3 Model to compare two schools -5 We suppose that the relationship between math -5 0 5 10 achievement and SES might be b(SES) β by school: different in two schools. We can index s = β + β + Y X r 0 1 ij j j ij ij j = = σ 2 where 1,2 and 1,..., where n is the number of students measured in school j . Again, we assume ij r to be iid (0, ) . i n N j j We can fit this model and ask various questions: β the same? 1. Do the two schools have the same structure, i.e. are the s 2. Perhaps the intercepts are different with one school ‘‘better’’ than the other but the slopes are the same. 23
3. Perhaps the slopes are different but one school is better than the other over the relevant range of X. 4. Maybe there’s an essential interaction with one school better for high X and the other for low X – i.e. the two lines cross over the observed range of Xs. σ and test equality of variances (conditional given X). 2 5. We could also allow different 6. We should also be ready to question the linearity implied by the model. The following regression output allows us to answer some of these questions. If it seems reasonable we could refit a model without an interaction term. Note that inferences based on ordinary regression only generalize to new samples of students from the same schools. They do not justify statements comparing the two sectors. 24
Comparing 2 schools: P4458 and C1433 > summary(lm(MathAch ~ SES * Sector, dataset)) Coefficients: P4458 and C1433 Value Std. Error t value Pr(>|t|) 25 (Intercept) 6.9992 1.1618 6.0244 0.0000 SES 1.1318 0.9454 1.1972 0.2348 20 Sector 11.3997 1.6096 7.0823 0.0000 SES:Sector 0.7225 1.5340 0.4710 0.6390 15 Residual standard error: 4.188 on 79 degrees of freedom MathAch 10 Multiple R-Squared: 0.7418 F-statistic: 75.67 on 3 and 79 degrees of freedom, the p-value is 0 5 Correlation of Coefficients: 0 (Intercept) SES Sector SES 0.8540 Sector -0.7218 -0.6164 -5 SES:Sector -0.5263 -0.6163 -0.0410 -3 -2 -1 0 1 2 SES > fit$contrasts $Sector: Catholic Public 0 Catholic 1 25
3.4 Comparing two Sectors One possible approach: two-stage analysis or ‘derived variables’: Stage1: Perform a regression on each School to get ˆ β for each school j Analyze the ˆ Stage2: β s from each school using multivariate methods to estimate the mean line in each sector. Use j multivariate methods to compare the two estimated mean lines 26
All Schools: Estimates of int. and slope (data space) All Schools: Estimates of int. and slope (beta space) 25 25 20 20 15 15 MathAch 10 b0 10 5 5 0 0 -5 -5 -3 -2 -1 0 1 2 -5 0 5 10 SES b(SES) Figure 4: The figure on the left shows the mean lines from each sector. The figure on the right shows each estimated intercept and slope from each school (i.e. ˆ β s). The large ellipses are the sample dispersion ellipses for the ˆ β s from each sector. The smaller ellipses are 95% confidence ellipses for j j the mean lines from each sector. 27
3.5 What’s wrong with 3.3? Essentially, it gives the least-squares line of each school equal weight in estimating the mean line for each sector. If every school sample had: 1) the same number of subjects 2) the same set of values of SES σ (we’ve been tacitly assuming this anyways) 3) the same 2 Then giving each least-squares line the same weight would make sense. In this case, the shape of the confidence ellipses in beta space would be identical for every school and the 2 stage procedure would give the same result as a classical two-level ANOVA. This data has different N’s and different sets of values for SES in each school. The 2-stage process ignores the fact that some ˆ β ’s are j measured with greater precision than others (e.g. if the i th school has a large N or a large spread in SES, its ˆ β is measured with more j precision). We could attempt to correct this problem by using weights proportional to ˆ σ 2 − = ' 1 Var( β | β ) ( X X ) j j j j but this is the variance of ˆ β as an estimate of the true β for the ith School, NOT as an estimate of the underlying β for the sector. j j We would need to use weights proportional to: ˆ = = σ 2 − + ' 1 Var( β ) V ( X X ) Τ j j j j σ and T . 2 but we don’t know V without knowing j Another approach would be to use ordinary least-squares with School as a categorical predictor. This has other drawbacks. Comparing sectors is difficult. We can’t just include a “Sector” effect because it would be collinear with Schools. We could construct a Sector contrast but the confidence intervals and tests would not generalize to new samples of schools, only to new samples of students from the same sample of schools. 28
4 The Hierarchical Model We develop the ideas for mixed and multilevel modeling in two stages: 1. Multilevel models as presented in Bryk and Raudenbush (1992) in which the unobserved parameters at the lower level are modeled at the higher level. This is the representation used in HLM, the software developed by Bryk and Raudenbush and, to a limited extent in MLwiN. 2. Mixed model in which the levels are combined into a combined equation with two parts: one for ‘fixed effects’ and the other for ‘random effects.’ This is the form used in SAS and in many other packages. Although the former is more complex, it seems more natural and provides a more intuitive approach to the models. We will use the high school Math Achievement data mentioned above as an extensive example. We think of our data as structured in two levels: students within schools and between schools . We also have two types of predictor variables: 1. within-school variables: Individual student variables: SES, female, minority status. These variables are also known by many other names, e.g. inner variables, micro variables, level-1 variables 1 , time-varying variables in the longitudinal context. 2. between-school variables: Sector: Catholic or Public, meanses, size. These variables are also known as outer variables, macro variables, level-2 variables, or time-invariant variables in a longitudinal context. A between-school variable can be created from a within-school variable by taking the average of the within-school variable within each school. Such a derived between-school variable is known as a ‘contextual’ variable. Of course, such a variable is useful only if the average differs from school to school. Balanced data in which the set of values of within-school variables would be the same in each school does not give rise to contextual variables. 1 In some hierarchical modeling traditions the numbering of levels is reversed going from the top down instead of going from the bottom up. One needs to check which approach an author is using. 29
Basic structure of the model: 1. Each school has a true regression line that is not directly observed 2. The observations from each school are generated by taking random observations generated with the school’s true regression line 3. The true regression lines come from a population or populations of regression lines 4.1 Within School model: For school i : (For now we suppose all schools come from the same population, e.g. only one Sector) β β ⎡ ⎤ ⎡ ⎤ 0 0 = ⎢ ⎥ = ⎢ ⎥ β j j 1) True but unknown for each school β β ⎢ ⎥ ⎢ ⎥ j ⎢ SES ⎥ ⎢ 1 ⎥ ⎣ ⎦ ⎣ ⎦ j j 2) The data are generated as = β + β + ε ε σ 2 ~ (0, ) independent of β ' Y X N s 0 0 ij ij ij ij j j j 4.2 Between School model: β β ⎡ ⎤ ⎡ ⎤ 0 0 = ⎢ ⎥ = ⎢ ⎥ j j β We start by supposing that the are sampled from a single population of schools. In vector notation: β β ⎢ ⎥ ⎢ ⎥ j ⎢ SES ⎥ ⎢ 1 ⎥ ⎣ j ⎦ ⎣ j ⎦ 30
= + β γ u u ~ ( , ) 0 Τ N j j j where τ τ ⎡ ⎤ 00 0 = 1 Τ ⎢ ⎥ τ τ ⎢ ⎥ 0 ⎣ 1 11 ⎦ is a variance matrix. Writing out the elements of the vectors: β ⎡ ⎤ γ ⎡ ⎤ ⎡ ⎤ ⎛ τ τ ⎞ ⎡ ⎤ ⎡ ⎤ 0 ⎡ ⎤ u u 0 0 0 = j = 0 + j j 00 01 β , ~ , ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ N ⎢ ⎥ ⎢ ⎥ β γ τ τ j 0 ⎣ ⎦ u u ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎝ ⎠ 1 1 1 j 1 j j 10 11 Note: β = τ Var( ) 0 00 i β = τ Var( ) 1 11 i β β = τ = τ Cov( , ) 0 1 10 01 i i 4.3 A simulated example To generate an example we need to do something with SES although its distribution is not part of the model. In the model the values of SES are taken as given constants. We will take: ⎡ ⎤ ⎡ ⎤ 12 16 8 = = σ 2 = γ , Τ , 20 ⎢ ⎥ ⎢ ⎥ 2 8 25 ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ β we generate ~ (30) ~ (0,1) Once we have generated N Poisson and SES N j j Here’s our first simulated school in detail: 31
j = 1: For − − ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 6.756 12 6.756 5.244 = = + = + = u , β γ u ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − − − 6.612 2 6.612 4.612 j j j ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ N = 23 j SES : -1.05 -0.78 1.05 -1.01 0.77 1.85 0.87 -1.18 0.18 2.08 -1.14 -1.71 -0.64 -0.41 0.86 1.29 0.04 0.23 0.90 0.50 -2.10 -1.89 0.38 ε : j 4.46 -0.73 0.30 7.63 -7.03 1.20 -6.23 -4.66 6.17 0.75 -1.43 0.46 3.64 -2.39 2.24 2.60 3.96 0.71 -3.74 3.30 4.42 -4.59 -3.61 = β + β + ε Y SES 0 : 1 ij j j ij ij 14.53 8.09 0.70 17.56 -5.34 -2.10 -4.99 6.03 10.58 -3.59 9.09 13.57 11.83 4.75 3.51 1.88 9.00 4.91 -2.66 6.24 32
19.37 9.38 -0.13 33
Simulated school (data space) 25 20 15 MathAch 10 5 0 -5 -3 -2 -1 0 1 2 SES Figure 5: Simulation: mean population regression line γ 34
Simulated school (data space) 25 20 15 MathAch 10 5 0 -5 -3 -2 -1 0 1 2 SES = + β γ u Figure 6: Simulated school: True regression line in School 1: j j 35
Simulated school (data space) 25 20 15 MathAch 10 5 0 -5 -3 -2 -1 0 1 2 SES = β + β + ε Figure 7: School 1 regression line with data generated by Y SES 0 1 ij ij ij i i 36
Simulated school (data space) 25 20 15 MathAch 10 5 0 -5 -3 -2 -1 0 1 2 SES β , data, and least-squares line ˆ Figure 8 : Simulated school: True regression line β i i 37
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) Figure 9 : Simulated school in beta space with true mean line represented by a point. 38
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) Figure 10 : Simulated school: population mean line in beta space with dispersion ellipse with matrix T for random slopes and intercepts. Note that shadows of the ellipse yield the mean plus or minus 1 standard deviation 39
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) Figure 11: A random ‘true’ intercept and slope from the population. This one happens to be somewhat atypical but not wholly implausible. 40
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) ( ) − 1 for ˆ Figure 12 : ‘True’ intercept and slope with dispersion ellipse with matrix σ 2 X ' X β . j j j 41
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) Figure 13 : Observed value of ˆ β . j 42
Simulated school (beta space) 25 20 15 b0 10 5 0 -5 -5 0 5 10 b(SES) ( ) − 1 = + σ 2 V T X ' X is almost coincident with the dispersion ellipse with matrix T . Figure 14: The blue dispersion ellipse with matrix j j j 43
σ or smaller dispersion for SES, these dispersion ellipse for the true 2 Note that with smaller N, larger β (with matrix T ) and the j ( ) − 1 dispersion ellipse for ˆ = + σ 2 β as an estimate of γ (with matrix V T X ' X ) could differ much more than they do here. Also j j j j ( ) − 1 σ 2 note that the statistical design of the study can make X ' X smaller but, typically, not T . j j 4.4 Between-School Model: What γ means Instead of supposing that we have a single population of schools we now add the between-school model that will allow us to suppose that there are two populations of schools: Catholic and Public and that the population mean slope and intercept may be different in the two sectors. Let W represent the between-school variable sector variable that is the indicator variable for Catholic W is equal to 1 if school j is Catholic and 0 if it is public. 2 schools: j We have two regression models, one for intercepts and one for the slopes: β = γ + γ + W u 0 00 01 j j oj β = γ + γ + W u 1 10 11 1 j j j γ coefficients by setting We can work out the following interpretation of the W to 0 for Public schools and then to 1 for Catholic ij j schools. The interpretation is analogous to that of the ordinary regression to compare two schools except that we are now comparing the two sectors. In Public schools: β = γ + γ × + = γ + 0 u u 0 00 01 0 00 0 j j j β = γ + γ × + = γ + 0 u u 1 10 11 1 10 1 j j j In Catholic schools: 2 Between-school variables are not limited to indicator variables. Any variables suitable as a predictor in a linear model could be used as long as it is a function of schools, i.e. has the same value for every subject within each school. 44
β = γ + γ × + = γ + γ + 1 u u 0 00 01 0 00 01 0 j j j β = γ + γ × + = γ + γ + 1 u u 1 10 11 1 10 11 1 j j j Thus: γ 1. is the mean achievement intercept for Public schools, i.e. the mean achievement when SES is 0. 00 γ + γ γ 2. is the mean achievement intercept for Catholic schools so that is the difference in mean intercepts between 00 01 01 Catholic and Public schools. γ 3. is the mean slope in Public schools. 10 γ + γ γ 4. is the mean slope in Catholic schools so that is the mean difference in (or difference in mean) slopes between 10 11 11 Catholic and Public schools. 5. u is the unique ‘‘effect’’ of school j on the achievement intercept, conditional given W . 0 j 6. u is the unique ‘‘effect’’ of school j on the slope, conditional given W. 1 j Now, u and u are Level 2 random variables (random effects) which we assume to have 0 mean and variance-covariance 0 j 1 j matrix: τ τ ⎛ ⎞ Τ = ⎜ 00 01 ⎟ τ τ ⎝ ⎠ 10 11 β β are not directly observable. This is a multivariate model with the complication that the dependent variables, , 0 j 1 j As mentioned above, one way to proceed would be to use a two-stage process: β β with least-squares within each school, and 1. Estimate , 0 j 1 j 2. use the estimated values in a Level-2 analysis with the model above. Some problems with this approach are: ˆ ˆ β , β might have a different variance due to differing 1. Each n s and different predictor matrices X in each school. A 0 1 i i j i 45
Level 2 analysis that uses OLS will not take these factors in consideration. 2. Even if X (thus n ) is the same for each school, we might be interested in getting information on T itself, not on i j ˆ = + σ − 2 ' 1 var( β ) T ( X X ) i ˆ ˆ β , β might be reasonable estimates of the ‘parameters’ β and β but, as ‘estimators’ of the random variables β and 3. 0 1 0 i 1 i 0 i i i β they ignore the information contained in the distribution of β and β . 1 i 0 i 1 i 4. Some level 1 models might not be estimable, so information from these schools would be lost. 5 Combined (composite) model 5.1 From the multilevel to the combined form Since β = γ + γ + W u 0 00 0 1 0 j j j β = γ + γ + W u 1 10 11 1 j j j We combine the models by substituting the between school model above into the within school model : = β + β + Y X r 0 1 i j j j ij ij Substituting, we get ( ) ( ) = β + β + Y X r 0 1 ij j j ij ij ( ) = γ + γ + W u 00 01 0 j j + γ + τ + + ( ) W u X r 00 11 1 j j i j i j We then rearrange the term to separate fixed parameters from random coefficients: 46
( ) ( ) = β + β + Y X r 0 1 ij j j ij ij ( ) = γ + γ + W u 00 01 0 j j + γ + γ + + ( ) W u X r 00 11 1 j j ij ij = γ + γ + γ + γ W X W X 00 0 1 10 1 1 j ij j ij + + + u u X r 0 1 j j ij ij The last two lines looks like the sum of two linear models: 1) an ordinary linear model with coefficients that are fixed parameters: γ + γ + γ + γ W X W X 00 01 10 11 j ij j ij γ γ γ γ with fixed parameters , , , , and 00 01 10 11 2) a linear model with random coefficients and an error term: + + u u X r 0 1 j j ij ij with random ‘parameters’ u and u . 0 j 1 j Note the following: 1. the fixed model contains both outer variables and inner variables as well as an interaction between inner and outer variables. This kind of interaction is called a ‘cross-level’ interaction. It allows the effect of X to be different in each Sector. 2. the random effects model only contains an intercept and an inner variable. There are very arcane situations in which it might make sense to include an outer variable in the random effects portion of the model which we will consider briefly later. 47
Understanding the connection between the multilevel model and the combined model is useful because some packages require the model to be specified in its multilevel form (e.g. MLWin) while others require the model to be specified in its combined form as two models: the fixed effects model and the random effects model (e.g. SAS PROC MIXED, R and S-Plus lme() and nlme()). 5.2 GLS form of the model δ represent the combined error Another way of looking at this model is to see it as a linear model with a complex form of error. Let ij term – also known as the composite error term: δ = + + u u X r 0 1 ij j j ij ij We can then write the model as: = γ + γ + γ + γ + δ Y W X W X 00 01 10 11 ij j ij j ij ij δ s are not identically σ 2 This looks like an ordinary linear model except that the (0, ) and are not independent since the same N ij δ s in the j th school. If we let u and u contribute to the random error for all δ be the vector of errors in the j th school we can 0 j 1 j ij j express the distribution of the combined errors as follows: ( ) − 1 + σ ≠ 2 δ ~ (0, T X X ' ), δ and δ are independent for . N j k j i i j k σ were known then the variance-covariance matrix of the random errors could be computed and the model fitted with 2 If Τ and Generalized Least-Squares (GLS). σ unknown, we can iteratively estimate them and use the estimated values to fit the linear parameters, 2 γ by GLS. With Τ and st σ are estimated. Using full likelihood yields what is often called ‘‘IGLS,’’ 2 There are variants depending on the way in which Τ and ‘‘ML,’’ or ‘‘FIML.’’ Using the conditional likelihood of residuals given ˆ Y yields ‘‘RIGLS’’ or ‘‘REML’’ (R for restricted or reduced). 48
5.3 Matrix form Take all observations in school j and assemble them into vectors and matrices: (this is called the Laird-Ware formulation of the model from Laird and Ware (1982)) = + + Y X γ Z u r j j j j j where ⎡ 1 ⎤ ⎡ 1 ⎤ W X W X X ⎡ ⎤ 1 1 1 j j j j j ⎢ ⎥ ⎢ ⎥ Y 1 1 1 ⎢ j ⎥ W X W X X ⎢ ⎥ ⎢ ⎥ 2 2 2 = = j j j j = j Y , X , Z M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ j j M M M M j M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Y ⎣ ⎦ 1 1 n j ⎢ W X W X ⎥ ⎢ X ⎥ j ⎣ ⎦ ⎣ ⎦ j n j j n j n j j j j γ ⎛ ⎞ ⎛ ⎞ r 1 00 j ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ γ r u ⎜ ⎟ ⎜ ⎟ 2 0 = j = 01 = j = u , γ , r , 1, , ⎜ ⎟ L j J ⎜ ⎟ ⎜ ⎟ γ j j M u ⎝ ⎠ 1 j 10 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ γ r ⎝ ⎠ ⎝ ⎠ 11 n j j σ 2 The distribution of the random elements is: u (0, ), Τ r (0, Ι ) with u independent of r . N N j j j j Now we put the school matrices together into big matrices: = + + Y X γ Zu r where ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Y X u r 1 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = = = Y , X , u , r M M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Y X u r ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ J J J J 49
⎡ Z 0 0 ⎤ L 1 ⎢ ⎥ 0 Z 0 L ⎢ ⎥ = ⎢ 2 Z ⎥ M M O M ⎢ ⎥ 0 0 Z L ⎣ ⎦ J with ⎛ ⎞ ⎡ ⎤ ⎡ 0 T 0 0 ⎤ L ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ 0 0 T 0 L ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ u , N ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ M M M O M ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ 0 0 0 T ⎣ ⎦ ⎣ L ⎦ ⎝ ⎠ and σ r ( , 0 2 I ) N which might be deceptive because the ‘‘ I ” is now much larger than before. The new block diagonal matrix for the variance of u is && . often with the same symbol as the variance of u . To avoid confusion we can use T j 5.4 Notational Babel Mixed models were simultaneously and semi independently developed by researchers in many different disciplines, each developing its own notation. The notation we are using here is that of Bryk and Raudenbush (1992) which has been very influential in social research. Many publications use this notation. It differs from the notation used in SAS documentation whose development was more influenced by seminal statistical work in animal husbandry. It is, of course, perfectly normal to fit models in SAS but to report findings using the notation in common use in the subject matter area. A short bilingual dictionary follows. Fortunately, Y , X and Z are used with the same meaning. 50
Bryk and Raudenbush SAS γ Fixed effects parameters β Cluster random effect β b γ Cluster random effect (centered) u Variance of random effects T G Within cluster error variance R Σ 5.5 The GLS fit With the matrix formulation of the model, it is easy to Express the GLS estimator of γ . First denote: = = && + σ ' 2 V Var( ) δ ZTZ I Then the GLS estimator is: = − − − γ ˆ X V X ' 1 1 X V Y ' 1 ( ) V can result in an estimate that is very different from its OLS analogue 3 − 1 We will see that the presence of 3 One ironic twist concerns small estimated values of σ . Normally this would a cause for rejoicing; however it can result in a nearly 2 − singular V . Although this need not imply that X V X is nearly singular, some algorithms seem not to take advantage of this. ' 1 51
6 The simplest models We have now built up the notation and some theory for a fairly general form of the linear mixed model with both inner and outer variables and a random effects model with a random intercept and a random slope. We will now consider the interpretation of simpler models in which we keep only some components of the more general model. Even when we are interested in the larger model, it is important to understand the simple ‘sub-models’ because they are used for hypothesis testing in the larger model. We will also consider some extensions of the concepts we have seen so far in the context of some of these simpler models. 6.1 One-way ANOVA with random effects This is the simplest random effects models and provides a good starting point to illustrate the special characteristics of these models. Level 1 model: = β + Y r 0 ij j ij Level 2 model: β = γ + u 0 00 0 j j Combined model: = γ + + Y u r 00 0 ij j ij = + = τ + σ 2 Var( ) Var( ) Y u r 0 00 ij j ij Note the intraclass correlation coefficient: ρ = τ τ + σ 2 /( ) 00 00 Also note that within each school: = γ E( ) Y . 0 j j σ 2 β = Var( | ) Y . 0 j j n j 52
but across the population: = γ E( ) Y . 0 j j σ 2 β = γ + Var( | ) Y . 0 00 j j n j This is an example of two very useful facts: 1. the unconditional (sometimes called ‘marginal’ but not by economists) mean is equal to the mean conditional mean , 2. the unconditional variance is equal to the mean of the conditional variance plus the variance of the conditional mean , i.e.: = β + β Var( ) E(Var( | ) Var(E( | )) Y Y Y . . 0 . 0 j j j j j = σ + β 2 Var( ) 0 j = σ + γ 2 00 = β + β Var( ) E(Var( | ) Var(E( | )) Y Y Y . . 0 . 0 j j j j j = σ 2 + β Var( ) 0 j = σ + γ 2 00 6.2 Estimating the one-way ANOVA model There are three kinds of parameters that need to be estimated: γ 1. fixed effect parameters : in this case there is only one: , 00 τ σ , 2 2. variance-covariance components : and 00 β τ 3. random effects : or, equivalently, combined with : u . 0 j 00 0 j 53
We use a different approach for each type of parameter. The fixed effects parameters are like linear regression parameters except that they are estimated from observations that are not independent. Instead of using OLS (ordinary least-squares) we use GLS (generalized least-squares) using the estimates of the variance-covariance components as the variance matrix in the GLS procedure. The variance-covariance parameters are estimated using ML (maximum likelihood) or REML (restricted maximum likelihood) . Note that each step above assumes that the other one has been completed. What really happens is that estimation goes back and forth between the two steps until convergence. The random effects are not just parameters. They are realizations of random variables. This means that we have two sources of information about them: we can ‘estimate’ them from the observed data and we can ‘guess’ them from their distribution. Putting these two sources of information together is the essence of Bayesian estimation, or empirical Bayesian estimation because the distribution of [ ] = τ the random effects, determined by T , is estimated from the data and model. The random effects are predicted (in contrast with 00 ‘estimated’) using EBLUPs (Empirical Best Linear Unbiased Predictors) with the empirical posterior expectation : β β ( , , | , , ) L L E Y Y 01 0 1 J n i.e. the expected value of what is unknown given what is known. We will look at the estimation of the three types of parameters in detail in this example. β β First we consider the analysis of the data using OLS in which we treat , , as non-random parameters. The coding of the L 01 0 J β school effect determines what is estimated by the intercept term . It is a weighted linear combination of the s: 0 j J = Σ ψ β w 0 w j j 1 ψ is the If the coding uses ‘‘true’’ contrasts (each column of the coding matrix sums to 0) the weights are all equal to 1/J and w β ordinary mean of s: 0 j 54
= Σ 1 J ψ β 0 w j J 1 In this case 1 J Σ ψ = = ˆ Y Y w j Schools J 1 With ‘‘sample size’’ coding, e.g. L V V V V − 1 2 3 1 J 0 0 0 L School n 1 J 0 0 0 L School n 2 J 0 0 0 L School n 3 J 0 0 0 0 L School 4 M M M M O M 0 0 0 M School n − 1 J J − − − − L School n n n n − 1 2 3 1 J J each column of the design matrix sums to 0 and the intercept will estimate: Σ β J n = ψ = 1 0 j j j Σ w J n = 1 j j which weights each school according to its sample size. This can be thought of as the mean of the population of students instead of the population of schools . The estimator would be the overall average of Y : Σ J n Y = ψ = 1 = = j j j Y Y .. Σ w Students J n = 1 j j We are not limited to these two obvious choices. A more appropriate set of weights could be school size, with coding: 55
L V V V V 1 2 3 − 1 J 0 0 0 L School s 1 J 0 0 0 L School s 2 J 0 0 0 L School s 3 J 0 0 0 0 L School 4 M M M M O M 0 0 0 M School s − 1 J J − − − − L School s s s s 1 2 3 − 1 J J the intercept would estimate: Σ β J s = ψ = 1 0 j j j Σ s J s = 1 j j In each case the form of the estimate is a weighted mean of the individual school averages: J = Σ ψ ˆ w Y w j j = 1 j with variance: σ 2 = Σ J ψ β β 2 Var( ˆ | , , ) L w n 01 0 w J j = 1 j j = where the weights, w , sum to 1. Note that the variance is minimized when the weights are proportional to n , i.e. / where n w n n j j j j 2 / . = Σ σ is the total sample size: . In this case the variance is Thus, the student mean is the parameter estimated with the least n n n j j variance. 56
6.2.1 Mixed model approach = Σ γ β s. Any weighted mean ˆ w ψ With a mixed model we want to estimate instead of a particular linear combination of of j w Y 00 0 j j j γ Y s will be unbiased for because 00 j Σ ψ = ˆ E( ) E( ) w Y w j j j Σ = β E( ) w 0 j j j Σ = γ w 00 j j = γ 00 Σ j w = w s are weights with 1. if the j j ψ as an estimator of γ γ β Now, to calculate the variance of ˆ w , we first need the variance of Y as an estimator of with 00 00 0 j j random: = γ + σ 2 Var( ) / Y n 00 j j Thus: ∑ ψ = 2 γ + σ 2 Var( ˆ ) ( / ) w n 00 w j j j τ + σ The optimal estimator is obtained by taking weights inversely proportional to ( 2 / ). n 00 j Consider the implications: τ σ , the weights will be nearly constant and ˆ w 2 ψ will be close to 1. If is much larger than . Y 00 Schools τ σ , the weights will be nearly proportional to 2 2. Conversely, if is much smaller than n and the estimator will be close to 00 j . Y . Students β τ If it is not reasonable to treat the s as a random sample from the same (0, ) distribution then these two estimators could N 0 j 00 estimate two quantities with very different meanings. Consider, for example, what would happen if there is a strong relationship 57
β τ σ − 2 between and n s. What gets estimated is governed by the ratio 00 / purely statistical consideration quite disconnected a 0 j j from any interpretation of the estimator. It is important to appreciate that your estimator is determined by considerations that might not be relevant. In SAS, the (minimal) commands would be 4 : PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = ; RANDOM INTERCEPT / SUBJECT=SCHOOL; RUN; 6.3 EBLUPs This interesting topic can, alas, be skipped. It played a central role in the early development of mixed models for animal husbandry where an important practical problem was estimating the reproductive qualities of a bull from the characteristics of its progeny. In most applications of mixed models in the social sciences, the focus is on the estimation of the fixed parameters and much less so on the ‘prediction’ of the random effects. Estimating the u s involves using two sources of information: the data and their distribution as random variables. First consider the 0 j β OLS estimator for : 0 j ˆ β = Y 0 . j j γ σ are the ‘‘true’’ 2 Now, to get the Empirical Best Linear Unbiased Predictor of u s, we pretend that the estimated values of and 0 j 00 values and we calculate the conditional expectation of u s given y s. This is done most easily using the matrix formulation of the 0 j ij model and a formula for the conditional expectation in the multivariate case. We use partitioned matrices to express the joint distribution of Y and u : 4 To use the HS data set, download the self-extracting file from the course website. Save it in a convenient directory. Click on its icon to create the SAS data set HS.SD2. From SAS, create a library named MIXED that points to this directory. You can then use the data set using the syntax in this example. 58
⎛ ⎞ ⎡ && + σ && ⎤ ⎡ Y ⎤ ⎡ X γ ⎤ ' 2 Z Τ Z I Z Τ ⎜ , ⎟ ⎢ ⎥ N ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ && && u 0 ' Τ Z Τ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎝ ⎠ A ‘‘well-known’’ formula gives: ˆ && − = 1 − E( | u Y ) Τ ZV ( Y X γ ) && = ' + σ 2 where V Z Τ Z I . This formula with a bit more mechanical work will give us the EBLUP below, but we will derive it intuitively: 1. We could estimate u with the ‘‘obvious’’ OLS estimate: 0 j ˆ = β − γ = − γ ˆ j ˆ ˆ u Y 0 0 00 . 00 j j σ 2 as an estimate of u this has variance / n 0 j . j τ 2. We could also guess that u is equal to 0 (the mean of its distribution) and our guess would have variance . 0 j 00 How can we ‘‘best’’ combine these independent sources of information? By using weights proportional to inverse variance! This gives us the EBLUP of u : 0 j 1 1 + ˆ 0 u 0 σ 2 τ ˆ / j n u 00 0 = = j j % u 0 1 1 σ 2 j / n + + 1 j σ 2 τ / τ n 00 j 00 ˆ j This has the effect of shrinking u towards 0 by a factor of 0 1 σ 2 / 1 n = j 1 1 σ 2 / n + + 1 j σ 2 τ / τ n 00 j 00 59
σ τ 2 Consider how the amount of shrinking depends on the relative values of , and n . There will be more shrinkage if 00 j τ 1. is small: i.e. the distribution of u is known to be close to 0. 00 0 j σ is large: i.e. β 2 2. Y has large variation as an estimate of . 0 j 0 j 3. n is small: ditto. j β β % The EBLUP estimator of (we’ll call it works exactly the same way with the OLS estimator (analyzing each school 0 j 0 j γ ˆ separately) which gets shrunk towards the overall estimator . This is in exactly the same spirit as shrinkage estimators derived from 00 Bayesian, Empirical Bayes or frequentist approaches. Bradley Efron and Carl Morris wrote an interesting article on the topic in Scientific American, Efron and Morris(1977). 7 Slightly more complex models 7.1 Means as outcomes regression Level 1 model: = β + β + Y r 0 0 ij j ij Level 2 model: β = γ + γ + W u 0 00 01 0 j j j Combined model: = γ + γ Y W 00 01 ij j + + u r 0 j ij Note that = + Var( ) Var( ) Y u r 0 ij j ij as above but, in this model, Var( ) is a conditional variance, conditional given W. Y ij In SAS, the commands for the means as outcomes model would be: PROC MIXED DATA = MIXED.HS; 60
CLASS SCHOOL; MODEL Y = W ; RANDOM INTERCEPT / SUBJECT = SCHOOL; RUN 7.2 One-way ANCOVA with random effects Level 1 model: = β + β + Y X r 0 1 ij j j ij ij Level 2 model: β = γ + u 0 00 0 j j β = γ 1 10 j Combined model: = γ + γ Y X 00 10 ij ij + + u r 0 j ij In SAS, the commands for one-way ANCOVA with random effects are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X ; RANDOM INTERCEPT / SUBJECT = SCHOOL; RUN; 7.3 Random coefficients model Level 1 model: = β + β + Y X r 0 1 ij j j ij ij Level 2 model: β = γ + u 0 00 0 j j β = γ + u 1 10 1 j j 61
with: ⎛ ⎞ ⎡ ⎤ τ τ ⎡ ⎤ u 0 j = = 00 01 ⎜ ⎟ Var ⎢ ⎥ T ⎢ ⎥ ⎜ ⎟ τ τ ⎣ ⎦ u ⎣ ⎦ ⎝ ⎠ 1 10 11 j Combined model: = τ + τ Y X 00 10 ij ij + + + u u X r 0 1 j j ij ij In SAS, the commands for the random coefficients model are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X ; RANDOM INTERCEPT X / SUBJECT = SCHOOL TYPE = UN; RUN; 7.4 Intercepts and Slopes as outcomes This corresponds to the full model presented in 4.4 above. The SAS commands for this model are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X W X*W; RANDOM INTERCEPT X / SUBJECT = SCHOOL TYPE = UN; RUN; Note the X*W term. It is called a cross-level interaction. It has the function of allowing the mean slope with respect to X to vary with W . 62
7.5 Nonrandom slopes τ = τ = Consider the full model but with 0 (hence 0 also, otherwise T would not be a variance matrix). This is a model in which 11 01 ˆ β from school to school is wholly consistent with the expected variation within schools and there is no need to the variation in 1 j τ > postulate that 0. 11 It is left as an exercise to specify the SAS commands to produce this model. 7.6 Asking questions: CONTRAST and ESTIMATE statement The CONTRAST and ESTIMATE statements in SAS allow you to ask specific questions about the model. We will use the ‘Intercept and slopes as outcome’ model above and consider asking some specific questions such as: Is there a significant difference between Public and Catholic schools at various values of SES, e.g. some meaningful quantiles Proportion 10% 50% 90% Quantile of SES -0.889 0.192 1.162 There are a few steps that make it easier to ask a clear question: 1. Sketch the fitted model 2. Identify what you want to ask on the sketch. 3. Transform your question into coefficients for terms in the model. 4. Write a suitable CONTRAST or ESTIMATE statement. We will first apply this approach to the question: Is there a significant difference between the two sectors at the 10%-quantile. Let’s have a look at a summary of the data: SCHOOL MINORITY FEMALE SES MATHACH SIZE SECTOR MEANSES Min. 1317 0.0000 0.000 -2.3280 -2.832 153 0.0000 -0.7510 1st Qu. 2995 0.0000 0.000 -0.4180 7.782 493 0.0000 -0.1010 Median 5783 0.0000 1.000 0.1920 13.620 1068 0.0000 0.1985 Mean 5453 0.1779 0.561 0.1481 13.110 1070 0.4203 0.1541 63
3rd Qu. 8202 0.0000 1.000 0.7845 18.700 1523 1.0000 0.4480 MaX. 9586 1.0000 1.000 1.8320 24.990 2403 1.0000 0.7590 Then we fit the ‘intercept and slopes as outcomes’ with random slopes model and have a look at estimated coefficients. We replace Y and X with the names of the variables in the data set. PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR/ SDDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(2); RUN; Note that we use the ‘S’ option in the model statement to force printing the estimated coefficients. Some of the output produced follows: Sol ut i on f or Fi xed Ef f ect s St andar d Ef f ect Est i m at e Er r or DF t Val ue Pr > | t | I nt er cept 11. 6997 0. 4282 31. 9 27. 32 <. 0001 SES 2. 8919 0. 2659 44. 7 10. 88 <. 0001 SECTO R 2. 4715 0. 7033 31. 2 3. 51 0. 0014 SES* SECTO R - 1. 1004 0. 4492 14. 8 - 2. 45 0. 0273 Type 3 Test s of Fi xed Ef f ect s Num Den Ef f ect DF DF F Val ue Pr > F SES 1 44. 7 118. 30 <. 0001 SECTO R 1 31. 2 12. 35 0. 0014 SES* SECTO R 1 14. 8 6. 00 0. 0273 We specify the values of the X matrix for the two levels we want to compare: Intercept SES SECTOR SES*SECTOR Cath at SES = -0.889 1 -0.889 1 -0.889 Public at SES = -0.889 1 -0.889 0 0 Difference 0 0 1 -0.889 64
We can now modify PROC MIXED to get the desired estimate: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR / SDDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(2); ESTIMATE ‘Sector diff at 10-pctile of SES’ SECTOR 1 SES*SECTOR -.889 / CL E; RUN; with partial output: St andar d Label Est i m at e Er r or DF t Val ue Pr > | t | Sect or di f f at 10- pct i l e of SES 3. 4497 0. 8596 40. 7 4. 01 0. 0003 Est i m at es Label Al pha Lower Upper Sect or di f f at 10- pct i l e of SES 0. 05 1. 7132 5. 1862 Thus we see that the estimated MATHACH is 3.4497 higher in the Catholic sector than in the Public sector at the 10th percentile of SES. Exercise 1 Estimate the difference at the 90th percentile of SES. We now ask a question that requires 2 constraints: Are the two lines identical? The significant values for the SES*SECTOR interaction and for the SECTOR main effect suggest that they are not! However, these two tests don’t constitute a formal test of the equality of the two lines. We could test equality at two points, say, SES = 1 and -1: Intercept SES SECTOR SES*SECTOR Cath at SES = 1 1 1 1 1 Public at SES = 1 1 1 0 0 Difference 0 0 1 1 Cath at SES = -1 1 -1 1 -1 Public at SES = -1 1 -1 0 0 Difference 0 0 1 -1 65
To test two constraints simultaneously, we use the CONTRAST statement and separate each constraint with a comma: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR / S DDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(2); CONTRAST ‘No sector effect’ SECTOR 1 SES*SECTOR 1, SECTOR 1 SES*SECTOR -1 / E; RUN; Note that the CONTRAST statement does not take the CL option because it only performs a test, it does not produce confidence intervals. Exercise 2: See if you get the same answer with different equivalent constraints: e.g. if the coefficients of SECTOR and SES*SECTOR are both 0 then the two lines are identical. Try the statement: CONTRAST ‘equivalent?’ SECTOR 0 , SES*SECTOR 0 / E; Exercise 3: Writing ESTIMATE and CONTRAST statements is more challenging when a predictor is a CLASS variables. Turn SECTOR into a class variable by adding its name to the class statement: CLASS SCHOOL SECTOR; and try to obtain the estimates and perform the tests in exercises 1 and 2. Thinking in terms of rows of the X matrix is very useful to get the right specification of coefficients. 66
8 A second look at multilevel models We will look more deeply into multilevel models to see what our estimators are really estimating. We will learn about the potential bias of estimators when between cluster and within cluster effects are different and we will see the role of contextual variables (e.g. cluster averages of inner variables) and within cluster residuals (i.e. inner variables centered by cluster means). 8.1 What is a mixed model really estimating See “Handout M” which provides an introduction to the problem of bias of the mixed model parameter estimate when the between- subject and within-subject effects of a predictor are different. The problem can be solved in part by introducing a contextual variable. ( | ) 8.2 Var Y X and T There are two ways of visualizing T : directly as Var( ) or through its effect on the shape of the variance of Y as a function of the u j predictor X. This relationship is more than a mathematical curiosity. It is important for an understanding of the interpretation of the components of T and for an understanding the meaning of hypotheses involving T . We will consider the case of a single predictor and the case of two predictors. 8.2.1 Random slope model It is easy to visualize how a random sample of lines would produce a graph with larger variance as we move to extreme values of X. 67
The model is: = γ + γ + + + ε Y X u u X 0 1 0 1 ij j j ij ij with ⎛ ⎞ ⎡ ⎤ τ τ ⎡ ⎤ u 0 j = = 00 01 Var ⎜ ⎟ T ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ τ τ ⎣ ⎦ u ⎣ ⎦ ⎝ ⎠ 1 10 11 j and ε = σ 2 Var( ) . ij Thus τ τ ⎡ ⎤ ⎡ 1 ⎤ = 00 01 + σ 2 Var( | ) [1 ] Y X X ⎢ ⎥ ⎢ ⎥ τ τ ⎣ ⎦ ⎣ ⎦ X 10 11 τ σ τ τ = + 2 + + 2 2 X X 00 01 11 68
which is a quadratic function in X. Quadratic functions have straightforward properties that we can exploit to understand the meaning = − τ τ τ − τ τ + σ 2 2 of T . Using calculus we can show that the minimum variance occurs at / . The minimum is / . An X 01 11 00 01 11 alternative way of showing this avoids calculus by using the technique of ‘completing the square’: = τ + σ 2 + τ + τ 2 Var( | ) 2 Y X X X 00 01 11 = τ − τ 2 τ + σ 2 + τ − − τ τ 2 ( / ) ( ( / )) X 00 01 11 11 01 11 τ , − − τ τ 2 = − τ τ The squared factor of ( ( / )) achieves its minimum of 0 when / . X X 11 01 11 01 11 The default form of T for the RANDOM statement in SAS, e.g. RANDOM INTERCEPT SES / SUBJECT = SCHOOL; τ = τ = is a diagonal matrix, that is a matrix in which 0 . Now forcing 0 is equivalent to forcing the model to have minimum 01 01 variance over the origin of X, which is in general a quite arbitrary assumption. Since the origin of X is usually arbitrary, the τ = assumption 0 violates a principle of invarianc e: the fitted model should not change when something arbitrary is changed. A 01 familiar example of the principle of invariance is the principle of marginality that says that you shouldn’t drop main effects that are included in interactions in the model. The interpretation of these main effects depends on the choice of origin or coding of the other interacting variables. Of course, if 0 is not an arbitrary origin then the principle of invariance is not necessary violated by considering τ = the possibility that 0. 01 8.2.2 Two random predictors With two random predictors we can visualize the ‘iso-variance’ contours in predictor space (see additional notes entitled “Variance – Making T simpler”. τ τ τ ⎡ ⎤ ⎡ 1 ⎤ 00 01 02 ⎢ ⎥ ⎢ ⎥ = τ τ τ + σ 2 Var( | ) [1 ] Y X X X X ⎢ ⎥ ⎢ ⎥ 1 2 10 11 12 1 ⎢ τ τ τ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ X ⎦ 20 21 22 2 We can show that the minimum occurs at: 69
− 1 τ τ τ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x 1min = − 11 12 10 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ τ τ τ ⎣ x ⎦ ⎣ ⎦ ⎣ ⎦ 21 22 20 2min and the contour ellipses have the same shape as the variance ellipses for a distribution with variance − 1 τ τ ⎡ ⎤ 11 12 ⎢ ⎥ τ τ ⎣ ⎦ 21 22 τ = If we set 0 , then the contour ellipses will not be tilted with respect to the X and X axes. This corresponds to an 12 1 2 β are independent of values of β , i.e. the strength of the effect of assumptions that the values of X is not related to the strength of 1 i 2 i 1 the effect of X , a possibly reasonable hypothesis. 2 τ to 0 also affords an opportunity to reduce the number of parameters in T . With even larger random models it becomes Setting 12 interesting to consider T matrices of the form: τ τ τ τ ⎡ ⎤ 00 01 02 03 ⎢ ⎥ τ τ 0 0 ⎢ ⎥ = ⎢ 10 11 T τ τ ⎥ 0 0 20 22 ⎢ ⎥ τ τ 0 0 ⎣ ⎦ 30 33 8.2.3 Interpreting Chol(T) A very useful option for the RANDOM statement in PROC MIXED is ‘GC’ to print the lower triangular Choleski root of the T (G in SAS) matrix. For a model with a single random slope we have: τ τ ⎡ ⎤ = ⎢ 00 01 T ⎥ τ τ ⎣ ⎦ 10 11 70
τ τ − τ τ where is the SD of the height of true regression lines when X=0, is the SD of slopes and / is the value of X where 00 11 01 11 regression lines have minimum SD. The lower triangular Choleski root of T is a lower triangular matrix, C with non-negative diagonal elements that is a ‘square ' = root’ of T in the sense that CC T . The columns of C are ‘conjugate axes’ of the variance ellipse. The elements of the matrix are: ⎡ ⎤ τ 0 00 = ⎢ ⎥ C ⎢ τ τ τ − τ τ ⎥ 2 / / ⎣ ⎦ 01 00 11 01 11 whose elements may be interesting but not highly interpretable. If we fit the model with the intercept last, however, we get more interesting results. If we specify RANDOM SES INT / GC ... etc. the T matrix will be τ τ ⎡ ⎤ = ⎢ 11 10 T ⎥ τ τ ⎣ ⎦ 01 00 and the lower-triangular Choleski root: ⎡ ⎤ τ 0 ⎡ 0 ⎤ c 11 = 11 = ⎢ ⎥ C ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ τ τ τ − τ 2 τ c c / / ⎣ ⎦ 01 00 01 11 00 01 11 is a much more interesting matrix because its diagonal elements have interpretations that are invariant with respect to relocating X: τ � is the SD of slopes 11 71
τ − τ τ 2 � / is the SD of the height of true regression lines at the value of X where this SD is minimized 00 01 11 − = − τ τ � Less directly interpretable but useful: / / is the value of X where the minimum occurs c c 01 11 01 11 8.2.4 Recentering and balancing the model If the value of X (or values of X and X if there are two random slopes) where the minimum SD of true regression occurs is quite far 1 2 from the origin, we have seen that this induces a strong correlation in T . If this is large enough, the matrix may be close to singular (a phenomenon analogous to multicollinearity in ordinary regression) and the optimization algorithm can fail to converge or appear to converge but to a sub-optimal point. Recentering the Xs can help considerably with convergence and with the numerical reliability of estimates. Also, if the SD for different slopes is of different orders of magnitude, rescaling the Xs to get similar SDs is desirable. Colinearity is [See notes on using C to recenter and ‘sphere’ T ] [See notes on freeing the RANDOM model from the fixed MODEL: we can center the FIXED model and the RANDOM model independently of each other] 8.2.5 Random slopes and variance components parametrization [SEE NOTES] 8.2.6 Testing hypotheses about T The obvious way of testing a hypothesis about Γ is to fit the full model and the restricted model and perform a likelihood ratio test to compare the two models. For Example, suppose you want to test whether you need a random slope model with one random X effect or whether a simpler random intercept model would be adequate. Fit both models (with the same fixed effects) and record the deviance for each model. The deviance is a synonym for -2 Log Likelihood and is shown in the ‘Fit Statistics’ in the SAS output. Suppose the random intercept model shows: 72
Fit Statistics -2 Log Likelihood 424.6 and the random slope model shows Fit Statistics -2 Log Likelihood 429.9 Next we need to figure out the difference in the number of parameters. Since we used the same fixed model the only difference in parameters comes from the variances of the random model. The smaller model has: = τ T [ ] 00 The larger model has: τ τ ⎡ ⎤ = ⎢ 00 01 T ⎥ τ τ ⎣ ⎦ 10 11 τ = τ and, since T is symmetric with , the larger model has two more parameters than the smaller model. The usual procedure for 01 10 likelihood ratio tests calls for taking the difference of the two deviances and comparing it to a Chi-Squared distribution with q = 2 degrees of freedom = difference in the number of parameters. So, here, we would have: χ = 2 − = 429.9 424.6 5.3 Using a suitable table for the Chi-Square with two degrees of freedom, we find that the p-value is 0.07065. Although this is a very common approach to testing this kind of hypothesis, it is somewhat flawed. The reason is that the τ is that the edge of the parameter space of the full model. (Technically, the null hypothesis is not in the interior of the hypothesis 11 full model). The usual asymptotic theory for Likelihood Ratio cannot be relied upon. τ = Intuitively, what happens is that, when 0 , the fit is as likely to show underdispersion as overdispersion for the between 11 τ will be 0. Now, τ contributes to the value of Chi-Square through τ 2 cluster slope variance. When there is underdispersion, ˆ ˆ ˆ only 11 11 11 73
τ is positive (about 1/2 of the time). The hypothesis τ = τ whether τ is positive ˆ 2 ˆ when 0 contributes to the Chi-Square through 11 10 10 10 or not. As a result the distribution of the test statistic under the null hypothesis can be thought of as being a Chi-Square with 1 degree of freedom half the time and a Chi-Square with 2 degrees of freedom the other half. The standard approach above assumes the Chi- Square has 2 degrees of freedom all of the time. A Chi-Square with 1 degree of freedom has a ‘smaller’ distribution than one with 2, so this has the effect of making the actual distribution of the statistic smaller than its reference distribution. The result is that the p-values computed with the reference distribution will be too large: if the statistic is too small then the probability in the upper tail will be too large. The p-value too large implies that the results are interpreted as less significant than they ought to be, i.e. the true probability of Type I error is lower than you think it is . You might think this is fine because that means the test is overly conservative. But remember that this also means that the true probability of Type II error is higher than it would be if you used an ‘exact’ test . A test that is conservative with respect to rejection is liberal with respect to acceptance. This might be seen as a problem since researchers are frequently interested in dropping random slopes to reduce the complexity of the random model. We can find the correct p-value for the problem of testing whether to add one random slope and all its covariances to a model that already has k random slopes and intercept together with all covariances, i.e. the full model has: τ ⎡ ⎤ + 0, 1 k ⎢ ⎥ T M ⎢ ⎥ = ⎢ kk T τ ⎥ + , 1 k k ⎢ ⎥ τ τ τ L ⎣ ⎦ + + + + 1,0 1, 1, 1 k k k k k = τ = = τ = τ = The restricted model has T T and provides a test of : 0 . The first k parameters are covariance L H + + + 0 0, 1 , 1 1, 1 kk k k k k k τ + parameters which can take values greater than or less than 0 but the last parameter, + , can only take non-negative values. The 1, 1 k k Chi-Squared for the likelihood ratio test has a nominal Chi-Squared distribution with q = k+1 degrees of freedom but its real distribution under the null hypothesis is that of a mixture of a Chi-Square with with k+1 degrees of freedom and a Chi-Square with k degrees of freedom. We can adjust for this in two ways. If you have output providing the nominal p-value, you can adjust the p-value downward but using the factor shown in the following figure. 74
Alternatively, if you have computed a deviance by taking the difference of two deviances, you can use the table below to find a range for the true p-value: 75
Cr i t i cal val ues f or m i xt ur e of t wo Chi - Squar es wi t h df = q and q- 1 df 0. 1 0. 05 0. 01 0. 005 0. 001 0. 0005 0. 0001 1e- 005 1 1. 64 2. 71 5. 41 6. 63 9. 55 10. 83 13. 83 18. 19 2 3. 81 5. 14 8. 27 9. 63 12. 81 14. 18 17. 37 21. 94 3 5. 53 7. 05 10. 50 11. 97 15. 36 16. 80 20. 15 24. 91 4 7. 09 8. 76 12. 48 14. 04 17. 61 19. 13 22. 61 27. 54 5 8. 57 10. 37 14. 32 15. 97 19. 69 21. 27 24. 88 29. 96 6 10. 00 11. 91 16. 07 17. 79 21. 66 23. 29 27. 02 32. 24 7 11. 38 13. 40 17. 76 19. 54 23. 55 25. 23 29. 06 34. 41 8 12. 74 14. 85 19. 38 21. 23 25. 37 27. 10 31. 03 36. 51 9 14. 07 16. 27 20. 97 22. 88 27. 13 28. 91 32. 94 38. 53 For comparison, this is a table of critical values for the Chi-Square distribution 0. 1 0. 05 0. 01 0. 005 0. 001 0. 0005 0. 0001 1e- 005 1 2. 71 3. 84 6. 63 7. 88 10. 83 12. 12 15. 14 19. 51 2 4. 61 5. 99 9. 21 10. 60 13. 82 15. 20 18. 42 23. 03 3 6. 25 7. 81 11. 34 12. 84 16. 27 17. 73 21. 11 25. 90 4 7. 78 9. 49 13. 28 14. 86 18. 47 20. 00 23. 51 28. 47 5 9. 24 11. 07 15. 09 16. 75 20. 52 22. 11 25. 74 30. 86 6 10. 64 12. 59 16. 81 18. 55 22. 46 24. 10 27. 86 33. 11 7 12. 02 14. 07 18. 48 20. 28 24. 32 26. 02 29. 88 35. 26 8 13. 36 15. 51 20. 09 21. 95 26. 12 27. 87 31. 83 37. 33 9 14. 68 16. 92 21. 67 23. 59 27. 88 29. 67 33. 72 39. 34 8.3 Examples We will illustrate many of the concepts introduced so far with our subsample of the Bryk and Raudenbush data. First we consider a fixed effects model for the relationship between MATHACH and SES: SAS input: proc glm data = mixed.hs; class school; model mathach = ses school / solution; run; Selected SAS output: The G LM Pr ocedur e Dependent Var i abl e: M ATHACH Sum of 76
Sour ce DF Squar es M ean Squar e F Val ue Pr > F M odel 40 19726. 86814 493. 17170 13. 21 <. 0001 Er r or 1679 62701. 74930 37. 34470 Cor r ect ed Tot al 1719 82428. 61743 R- Squar e Coef f Var Root M SE M ATHACH M ean 0. 239321 46. 60822 6. 111031 13. 11149 Sour ce DF Type I SS M ean Squar e F Val ue Pr > F SES 1 11770. 23946 11770. 23946 315. 18 <. 0001 SCHO O L 39 7956. 62868 204. 01612 5. 46 <. 0001 Sour ce DF Type I I I SS M ean Squar e F Val ue Pr > F SES 1 4171. 108312 4171. 108312 111. 69 <. 0001 SCHO O L 39 7956. 628677 204. 016120 5. 46 <. 0001 St andar d Par am et er Est i m at e Er r or t Val ue Pr > | t | I nt er cept 13. 41999619 B 0. 80723095 16. 62 <. 0001 SES 2. 32422573 0. 21992118 10. 57 <. 0001 SCHO O L 1317 - 1. 04494131 B 1. 18939271 - 0. 88 0. 3798 SCHO O L 1374 - 3. 66214705 B 1. 40930069 - 2. 60 0. 0094 ........ We note that the effect of SES is estimated as 2.32.This is the ‘within-school’ effect of SES. It is a weighted average of the individual within-school effects of SES. 8.4 Fitting a multilevel model: contextual effects In this section we address two problems: 1. To what extent is the effect of SES an individual effect and to what Extent is it a school effect? Our models so far have conflated these two effects. 2. Can we estimate the individual effect of SES without contaminating our estimate with the school effect? We solve both problems at once by introducing contextual effects. We decompose SES into two components: 1. average SES in each school (SES_MEAN) and 77
2. the deviation of a student from the average SES in the school (SES_ADJ). There are other ways of creating these two variables and we will discuss them later. The impact of these two variables lies in each variable providing a control for the other. In a model with these two variables, the coefficient of SES_ADJ measure the individual effect keeping the school effect constant. It can be shown that this is the ‘within’ effect of a ‘fixed model’ as the term is used in econometrics. The effect of SES_MEAN keeps SES_ADJ equal to 0, i.e. the school effect compares two schools and two students with constant SES_ADJ which is equivalent to having the individual SES of each student equal to the school SES_MEAN. Thus the effect of SES_MEAN is unconditional while the effect of SES_ADJ is conditional. We will use SAS to go through the steps of creating SES_MEAN and SES_ADJ. If nothing is gained by decomposing SES into SES_MEAN and SES_ADJ, the fitted coefficients for SES_MEAN and SES_ADJ will not differ significantly since, in this case, the original SES fits as well as the two separate variables: ( ) = + β − + β + ... ( . ) ... E Y X X j X . ij ij j = + β + ... ... X ij Thus we can use an ESTIMATE statement to test equality. This test is analogous to the test for a random effects model in econometrics. 8.4.1 Example We will work through an analysis the HS data including a few extra touches to show some capabilities of SAS. We first define formats for some of the coded variables and summarize the data: These formats allow SAS to show the names of categories in some printed output: proc format; /* formating variables */ value sexfmt 1=female 0=male; value schfmt 1=Catholic 0=Public; value minfmt 1=minority 0=non_minority; run; The formats are identified with their respective variables: data hs; 78
set mixed.hs; format female sexfmt. minority minfmt. sector schfmt. ; label ses=‘‘social economic status’’ mathach=‘‘mathematics achievement’’; run; The dataset is sorted by the numerical variable SCHOOL so we can avoid declaring it as a CLASS variable in PROC MIXED: proc sort data=hs; /* sort by school so we can use it as a numeric var */ by school; run; We have a look at the data: proc contents data = hs; run; proc summary data=hs print min q1 median q3 maX; var school minority female ses mathach size sector meanses; run; proc means data=hs ; var ses mathach size meanses; class sector ; run; proc freq data=hs; tables sector female minority; run; We now try a first run with PROC MIXED deliberately using TYPE = UN in the RANDOM statement. 79
ods output SolutionF=fixest; /*Data set’fixest’ includes fixed estimates from fitted model*/ ods output SolutionR=ranest; /*Data set ‘ranest’ includes random estimates of coefficients*/ proc mixed data=hs; model mathach = ses sector ses*sector / outp = predictsch outpm = predictpop solution solution ddfm = satterth corrb covb covbi cl; random int ses / sub = school type = un solution G GC GCORR GI ; run; The LOG output gives us a quiet warning: NOTE: Convergence criteria met. NOTE: Estimated G matrix is not positive definite. The output listing for G shows that SAS is very confused: Est i m at ed G M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 I nt er cept 1 3. 5693 0. 5808 2 SES 1 0. 5808 Est i m at ed I nv( G ) M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 I nt er cept 1 1. 7218 2 SES 1 1. 7218 - 10. 5818 Est i m at ed Chol ( G ) M at r i x 80
Row Ef f ect Subj ect Col 1 Col 2 1 I nt er cept 1 1. 8893 2 SES 1 0. 3074 Est i m at ed G Cor r el at i on M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 I nt er cept 1 1. 0000 1. 0000 2 SES 1 1. 0000 1. 0000 The blank for the lower right element of the G matrix is supposed to indicate something close to 0 but, if that is the case the matrix is not even a variance matrix. This become clear when we look at the Inv(G) matrix which should also be a variance matrix but has a negative diagonal element. Under these circumstances the Chol(G) matrix and the Correlation matrix should not exist but SAS valiantly reports some numbers for them. We try again, this time with a better parametrization of the G matrix. We also reverse SES and the INT term in the random statement: proc mixed data=hs; model mathach = ses sector ses*sector / outp=predictsch outpm=predictpop solution solution ddfm=satterth corrb covb covbi cl; random ses int/ sub = school type = fa0(2) solution G GC GCORR GI; run; We now get a legitimate variance matrix for G but it is ‘singular’ i.e. the variation falls entirely on a line. The Choleski root tells us that the point of minimum variance is far from the range of the data. Est i m at ed G M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 SES 1 0. 07119 0. 5014 2 I nt er cept 1 0. 5014 3. 5309 Est i m at ed I nv( G ) M at r i x 81
Row Ef f ect Subj ect Col 1 Col 2 1 SES 1 14. 0467 2 I nt er cept 1 Est i m at ed Chol ( G ) M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 SES 1 0. 2668 2 I nt er cept 1 1. 8791 Est i m at ed G Cor r el at i on M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 SES 1 1. 0000 1. 0000 2 I nt er cept 1 1. 0000 1. 0000 This suggests that the variability in slopes is very small after accounting for heterogeneity accounted for by SECTOR. We refit with only a random intercept. proc mixed data=hs; model mathach = ses sector ses*sector / outp=predictsch outpm=predictpop solution ddfm=satterth corrb covb covbi cl; random int /sub=school type=fa0(1) solution; run; We compare the deviances ( -2 Res Log Like) of the two models and find an increase of 1.2 for 2 degrees of freedom so we have not lost anything significant by dropping the two variance parameters. Note that we are comparing two models that do not differ in their MODEL statements which allows us to compare deviances even though the models have been fit with the default METHOD = REML 5 . The table of estimates of fixed effects shows an interesting story ... so far. Everything seems quite significant. Sol ut i on f or Fi xed Ef f ect s St andar d Ef f ect Est i m at e Er r or DF t Val ue Pr > | t | Al pha I nt er cept 11. 6996 0. 4285 1225 27. 31 <. 0001 0. 05 SES 2. 8917 0. 2659 1716 10. 88 <. 0001 0. 05 5 Models using METHOD = REML, the default, can be compared only if they do not differ in their MODEL statements, i.e. only the RANDOM and REPEATED models differ. If the model differ in their MODEL statement as well full likelihood must be used with MODEL = ML. 82
SECTO R 2. 4716 0. 7037 1250 3. 51 0. 0005 0. 05 SES* SECTO R - 1. 1003 0. 4492 1716 - 2. 45 0. 0144 0. 05 We will now consider contextual effects: why do high SES schools do better? Is it the school or is the kid? So far we haven’t made a distinction. Is there an effect of school SES distinct from child SES? Although it doesn’t seem likely they could, in principle, even go in opposite directions! We turn SES into 2 variables: aggregate ‘between’ school ses and a within school residual or ‘school-ajusted ses’. We first create a data set with one observation per school containing the average SES_MEAN for each school. proc means data=hs; var ses; by school; output out=new mean=ses_mean; run; proc print data = new; run; Just to be on the safe side we sort before merging: proc sort data=new; by sschool; run; We now merge the school averages back into the main data set and we create a variable with each students’ deviation, SES_ADJ, in SES from the school average. data hs; merge hs new; by school; ses_adj = ses - ses_mean; run; We are ready to run PROC MIXED on the new variables. We add the new inner variable SES_ADJ to the RANDOM model. 83
proc mixed data = hs cl method = ml; model mathach = ses_adj ses_mean sector ses_adj*sector ses_mean*sector / cl ddfm=satterth; random ses_adj int / sub = school type = fa0(2) G GC GCORR; run; The G matrix is still close to singular and the point of minimum variance is far outside the range of SES: Est i m at ed Chol ( G ) M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 ses_adj 1 0. 1351 2 I nt er cept 1 1. 3193 0. 02311 Est i m at ed G Cor r el at i on M at r i x Row Ef f ect Subj ect Col 1 Col 2 1 ses_adj 1 1. 0000 0. 9998 2 I nt er cept 1 0. 9998 1. 0000 We refit with a random intercept and get the following estimates of fixed effects: St andar d Ef f ect Est i m at e Er r or DF t Val ue Pr > | t | Al pha I nt er cept 11. 7392 0. 3511 1130 33. 44 <. 0001 0. 05 ses_adj 2. 6815 0. 2736 1672 9. 80 <. 0001 0. 05 ses_m ean 6. 5073 0. 9292 1115 7. 00 <. 0001 0. 05 SECTO R 1. 4300 0. 7730 1044 1. 85 0. 0646 0. 05 ses_adj * SECTO R - 1. 0102 0. 4600 1672 - 2. 20 0. 0282 0. 05 ses_m ean* SECTO R - 1. 9568 1. 7190 1022 - 1. 14 0. 2552 0. 05 This output suggests that SES_MEAN and SES_ADJ have very different effects, at least within the Public school system which is the reference level for the coding of SECTOR. There is also a suggestion that SECTOR might not be important in this new model. We rerun the model with the following CONTRAST statement: contrast ‘sector’ SECTOR 1, ses_adj*SECTOR 1, ses_mean*SECTOR 1; which allows us to test the hypothesis that all terms involving SECTOR are simultanously equal to 0. The result is not entirely clear: Cont r ast 84
Num Den Label DF DF F Val ue Pr > F sect or 3 1201 2. 75 0. 0415 Perhaps the apparent interaction between sector and SES might show up as an interaction between SES_MEAN and SES_ADJ suggesting that high SES schools have flatter MATHACH/SES slopes than low SES schools. The apparently greater fairness of Catholic schools in the early analyses might have been due to the fact that they tend to have higher values of SES_MEAN. Another possible explanation is that failing to separate the within school effect of SES from the between school effect contaminated the estimate of the between-school effect with the lower within-school effect. This is, ironically, the reverse of the concern usually expressed in econometrics. Having a lower coefficient than it should, SES, as a proxy for SES_MEAN, in the between-school model underadjusted for the effect of SES and left variation to be explained by SECTOR. Once SES_MEAN was uncoupled with SES_ADJ, it accounted for the variation that had been attributed to SECTOR. Note also that residual plots show a ceiling effect in MATHACH which would tend to flatten the slope for schools with higher MATHACH. Thus the apparently greater ‘fairness’ of Catholic schools could be an artifact of measurement. Exercises : The following questions are left as exercises: 1. Use the ESTIMATE or CONTRAST statement to carry out a formal test of the equality of the coefficients of SES_MEAN and SES_ADJ. This test is an analogue of the Wu-Hausman test in econometrics. 2. Explore the usefulness of a SES_MEAN*SES_ADJ interaction. How would you interpret it? 85
9 Longitudinal Data The structure of longitudinal data shares much with that of multilevel data. Turn the school into a subject and the students in the school into times or ‘occasions’ on which the subject is measured and, except for the fact that measurements are ordered in time, the structures are essentially the same. Mixed models offer only one of of number of ways of approaching the analysis of longitudinal data but mixed models are extensions of some of the traditional ways and provide more flexibility. We will briefly review traditional methods and then discuss the use of mixed models for longitudinal data. The best method depends on a number of factors: � The number of subjects : if the number is very small it might not be feasible to treat subjects as random. The analysis then focuses on an ‘anecdotal’ description of the observed subject and does not generalize to the population of subjects. Statistical models play a role in analyzing the structure of responses within subjects especially if the subjects were measured on a large number of occasions. This situation is typical or research in psychotherapy where a few subjects are observed over a long time. With long series and few subjects, time series analysis can be an appropriate method. � The number of occasions and whether the timing of occasions is fixed or variable from subject to subject . With a small number (<10?) of fixed occasions and with the same ‘within subject’ design for each subject, the traditional methods of repeated measures analysis can be used. There are two common repeated measures models: univariate repeated measures and multivariate repeated measures . In univariate repeated measures models the random variability from subject to subject is equivalent (almost) to a random intercept mixed model. In multivariate repeated measures, no constraints are placed on the intercorrelations between occasions. These two models can be thought of as extreme cases: a model with only one covariance parameters and a model with all possible covariance parameters. These two models turn out to be special cases of mixed models which provide the ability to specify a whole range of models between these extremes, models that allow simpler covariance structures based on the structure of the data and the fact that, having been observed in time, observations that are close in time are often expected to be more strongly correlated than observations that are farther in time. � Mixed model methods are particularly well suited to data with many subjects and variable within-subject designs, either due to variable occasions or to time-varying covariates. We will look at four data examples to illustrate the breadth of applications of mixed models to longitudinal data: 86
� The classical Potthoff and Roy (1964) data of jaw sizes of 27 children (16 boys and 11 girls) at ages 8, 10, 12 and 14. This is a simple yet interesting example of longitudinal data and helps understand the major concepts in a simple context. � PSID: Panel Study of Income Dynamics: An excerpt from the PSID conducted by ISR at the University of Michigan at Ann Arbor will illustrate the application of methods on a survey data set with fixed occasions and time-varying covariates. � IQ recovery of post-coma patients: Long-term non-linear model with highly variable timing for occasions. � Migraine diaries: A dichotomous dependent variable with a non-linear long-term response model including seasonal periodic components and long-term non-stationarity. The four data sets can be downloaded as self-extracting files from the course web site at http://www.math.yorku.ca/~georges/Courses/Repeated/ After downloading the files into a convenient directory in Microsoft Windows, double click on the file icon to extract the SAS ssd file. Then, from SAS, define the directory as a library. In the following code, it is assumed that the name of the SAS library is ‘MIXED’. The SAS data set names for the four data sets above are: PR, PSID, IQ and MIGRAINE, respectively. 87
8 10 12 14 8 10 12 14 F 5 F 7 F 8 F 4 F 11 30 25 20 F 10 F 6 F 9 F 3 F 1 F 2 30 25 20 M 17 M 15 M 12 M 21 30 Y 25 20 M 25 M 14 M 20 M 22 M 26 M 19 30 25 20 M 24 M 16 M 13 M 23 M 18 M 27 30 25 20 8 10 12 14 8 10 12 14 8 10 12 14 Age Figure 15: Pothoff and Roy (1964) data on growth of jaw sizes of 16 boys and 11 girls. Note possibly anomalous case M 20. 88
120 100 viq 80 0 10 20 30 Years post coma Figure 16: Recovery of verbal IQ after coma. 89
120 100 viq 80 0 12 24 36 48 Months post coma Figure 17: Recovery of Verbal IQ post coma (first four years) 90
9.1 The basic model We can think of longitudinal data as multilevel data with t (for a time index) replacing the within cluster index i . We could keep j for subjects but to be consistent with many authors such as Snijders and Bosker (1999) we will switch from j to i . We also switch the order of the indices. Y denotes subject (cluster) i observed on occasion t . Not only do we change notation, we also change the order it of the indices: the cluster index comes first. The other difference is that at least one of the X variables will be time or a function of time. A simple model for the Pothoff and Roy data with a linear trend in time that varies randomly from child to child would look just like a multilevel model with the following level 1 model: = β + β + ε Y X 0 1 it i i it it where i ranges over the 27 subjects, t is the time index: t=1,2,3,4, and X takes on the actual years : 8,10,12,14. it ε . It no longer seems reasonable to Now we come to the main departure from multilevel models: what to assume about it ε unquestioningly assume that s are independent as t varies for a given i. Observations close in time might be more strongly related it than observations that are farther apart. (Note that this relationship does not necessarily lead to a positive correlation.) The Laird-Ware form of the model for a single child is: = + Y X β ε i i i i ε = + + X γ Z u i i i i = = if X Z X . With the matrices written out, we get: i i 91
ε ⎡ ⎤ ⎡ 1 ⎤ ⎡ ⎤ Y X 1 1 1 i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎡ β ε 1 ⎤ Y X ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 = 2 0 + 2 i i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ β ε 1 ⎦ Y X 3 3 1 3 i i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ε 1 ⎣ ⎦ ⎣ Y ⎦ X ⎣ ⎦ 4 4 4 i i i ε ⎡ 1 ⎤ ⎡ 1 ⎤ ⎡ ⎤ X X 1 1 1 i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎡ γ ε 1 ⎡ ⎤ 1 ⎤ X X u ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = 2 0 + 2 0 + 2 i i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ γ ⎢ ⎥ ⎢ ε ⎥ 1 1 ⎣ ⎦ ⎣ ⎦ X X u 3 1 3 1 3 i i i ⎢ i ⎥ ⎢ ⎥ ⎢ ⎥ ε 1 1 ⎣ ⎦ ⎣ ⎦ X X ⎣ ⎦ 4 4 4 i i i with ⎛ ⎞ τ τ ⎡ ⎤ ⎡ ⎤ u 0 = = 00 01 Var i T ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ τ τ ⎣ ⎦ ⎣ ⎦ ⎝ u ⎠ 1 10 11 i with a standard multilevel model we would have: ⎛ ε ⎞ ⎡ ⎤ ⎡ σ ⎤ 2 0 0 0 1 i ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ε σ 2 0 0 0 ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ = 2 = i Σ Var ⎜ ⎟ ⎢ ε ⎥ ⎢ ⎥ σ 2 0 0 0 3 ⎜ ⎟ i ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ ε σ 2 0 0 0 ⎣ ⎦ ⎣ ⎦ ⎝ ⎠ 4 i but this is often not realistic with data ordered in time. A model that allows autocorrelations has a form called AR(1) or ‘auto- regressive of order 1’ would have the following model for Σ : 92
⎛ ε ⎞ ⎡ ⎤ ⎡ ρ ρ ρ ⎤ 2 3 1 1 i ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ε ρ ρ ρ 2 1 ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ = 2 = σ 2 Σ Var i ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ε ρ ρ ρ 2 1 3 ⎜ ⎟ i ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ ε ρ ρ ρ 3 2 1 ⎣ ⎦ ⎣ ⎦ ⎝ ⎠ 4 i Here we need to be careful because our model is identified only through the marginal variance: = ' + V ZTZ Σ We now have 5 variance parameters in the model and 10 variance estimates. It might seem that identifiability should not be a problem but we need to be on guard for possible near collinearities. We can’t add arbitrary complexity to Σ without thinking that the estimation of the parameters in both T and Σ take place together. Writing out the V matrix we see that we don’t have 10 independent variance estimates. If we centre the time variable, we get: − ⎡ ⎤ ⎡ 1 3 ⎤ ρ ρ 2 ρ 3 1 ⎢ ⎥ ⎢ ⎥ − τ τ 1 1 ⎡ ⎤ ⎡ 1 1 1 1 ⎤ ρ ρ ρ 2 1 ⎢ ⎥ ⎢ ⎥ = 00 01 + σ 2 V ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ τ τ − − 1 1 3 1 1 3 ρ 2 ρ ρ 1 ⎣ ⎦ ⎣ ⎦ 10 11 ⎢ ⎥ ⎢ ⎥ 1 3 ρ 3 ρ 2 ρ 1 ⎣ ⎦ ⎣ ⎦ ⎡ τ − τ + τ + σ τ − τ + τ + σ ρ τ − τ − τ + σ ρ τ − τ + σ ρ ⎤ 2 2 2 2 2 3 6 9 4 3 2 3 9 00 01 11 00 01 11 00 01 11 00 11 ⎢ ⎥ τ − τ + τ + σ τ − γ + σ ρ τ + τ − τ + σ ρ 2 2 2 2 2 2 3 ⎢ ⎥ = 00 01 11 00 11 00 01 11 ⎢ ⎥ τ + τ + τ + σ τ + τ + τ + σ ρ 2 2 2 4 3 00 01 11 00 01 11 ⎢ ⎥ τ + τ + τ + σ 2 6 9 ⎣ ⎦ 00 01 11 so that the upper triangle of the variance matrix is: 93
⎡ τ − τ + τ + σ ⎤ ⎡ ⎤ 2 6 9 v 11 00 01 11 ⎢ ⎥ ⎢ ⎥ τ − τ + τ + σ ρ 2 4 3 v ⎢ ⎥ ⎢ ⎥ 12 00 01 11 ⎢ ⎥ ⎢ ⎥ τ − τ − τ + σ ρ 2 2 2 3 v 13 00 01 11 ⎢ ⎥ ⎢ ⎥ τ − τ + σ ρ 2 3 9 v ⎢ ⎥ ⎢ ⎥ 14 00 11 ⎢ ⎥ ⎢ ⎥ τ − τ + τ + σ 2 2 v 22 = ⎢ 00 01 11 ⎥ ⎢ ⎥ τ − τ + σ ρ 2 v ⎢ ⎥ ⎢ ⎥ 23 00 11 ⎢ ⎥ ⎢ ⎥ τ + τ − τ + σ ρ 2 2 2 3 v 24 ⎢ ⎥ ⎢ ⎥ 00 01 11 τ + τ + τ + σ 2 ⎢ 2 ⎥ ⎢ v ⎥ 33 00 01 11 ⎢ ⎥ ⎢ ⎥ τ + τ + τ + σ ρ 2 4 3 v ⎢ ⎥ ⎢ 34 ⎥ 00 01 11 τ + τ + τ + σ ⎢ ⎥ ⎢ 2 ⎥ 6 9 ⎣ v ⎦ ⎣ ⎦ 44 00 01 11 Heuristically, to estimate the variance parameters, we need to estimate V and then backsolve for a set of variance parameters that reproduce V . The transformation is not linear since it involves different powers of ρ but we can consider the Jacobian: − ⎡ ⎤ ⎡ 1 6 9 1 0 ⎤ v 11 ⎢ ⎥ ⎢ ⎥ − ρ σ 1 4 3 2 v ⎢ ⎥ ⎢ ⎥ 12 ⎢ ⎥ ⎢ − − ρ 2 σ ρ 2 ⎥ 1 2 3 2 v 13 ⎢ ⎥ ⎢ ⎥ − ρ 3 σ ρ 2 2 1 0 9 3 v ⎢ ⎥ ⎢ ⎥ 14 ⎢ ⎥ ⎢ ⎥ − ∂ ∂ 1 2 1 1 0 V v = = 22 = Φ = τ τ τ σ 2 ρ J where [ ] ⎢ ⎥ ⎢ ⎥ ∂Φ ∂Φ − ρ σ 2 00 01 11 1 0 1 ⎢ v ⎥ ⎢ ⎥ 23 ⎢ ⎥ ⎢ ⎥ − ρ 2 σ ρ 2 1 2 3 2 v 24 ⎢ ⎥ ⎢ ⎥ 1 2 1 1 0 ⎢ ⎥ ⎢ ⎥ v 33 ⎢ ⎥ ⎢ ⎥ ρ σ 2 1 4 3 v ⎢ ⎥ 34 ⎢ ⎥ ⎢ ⎥ ⎢ 1 6 9 1 0 ⎥ ⎣ ⎦ ⎣ ⎦ v 44 94
The information on Φ is related to the Grammian matrix ' J J of the Jacobian matrix J whose normalized form has a condition ρ > index of over 1,000 for values of .62 . The fact that the variance parameters of this model cannot be estimated accurately individually does not mean that it is not a good model for the purpose of estimating fixed effects. To get a reasonable estimate of the autocorrelation, ρ , longer series are needed for at least some subjects. As mentioned earlier, both univariate and multivariate repeated measures can be represented as ‘extreme’ cases of mixed models. Univariate repeated measures are equivalent to a random intercept model which generates a ‘compound symmetric’ V matrix: In this model X generates a model for time, generally a saturated model (i.e. equivalent to a categorical model) for time. Z consists of a column of 1s. Using a cubic polynomial model for the centered Pothoff and Roy data: ε ⎡ ⎤ − γ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 3 9 27 1 Y 1 1 0 j i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ε − − γ 1 1 1 1 1 [ Y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 2 = 1 + + j ] i u ⎢ ⎥ ⎢ ⎥ ⎢ ε ⎥ ⎢ ⎥ γ ⎢ ⎥ 0 j 1 1 1 1 1 Y 3 3 2 ⎢ j ⎥ ⎢ j ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ε γ ⎢ ⎥ 1 3 9 27 1 ⎢ ⎥ Y ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 4 4 3 j j = τ ε = σ 2 with Var( ) and Var( ) I . Thus: u 0 00 i i ⎡ ⎤ ⎡ τ + σ τ τ τ ⎤ ⎡ ⎤ 1 σ 2 2 0 0 0 00 00 00 00 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ τ τ + σ τ τ 1 σ 2 2 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = τ + = ⎢ 00 00 00 00 V [ ][1 1 1 1] ⎢ ⎥ ⎥ ⎢ ⎥ 00 τ τ τ + σ τ 1 σ 2 2 0 0 0 00 00 00 00 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ τ τ τ τ + σ 1 σ 2 2 0 0 0 ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 00 00 00 00 A drawback of this model is that it assumes the same covariance between pairs of observations whether they are close or far in time. τ = Note that the random slopes model (without autocorrelation) may seem better in this regard. Setting 0 which is the same as 01 assuming that the point of minimum variance is at the origin: 95
− ⎡ ⎤ ⎡ 1 3 ⎤ σ 2 0 0 0 ⎢ ⎥ ⎢ ⎥ − τ τ 1 1 ⎡ ⎤ ⎡ 1 1 1 1 ⎤ σ 2 0 0 0 ⎢ ⎥ ⎢ ⎥ = 00 01 + V ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ τ τ − − 1 1 3 1 1 3 σ 2 0 0 0 ⎣ ⎦ ⎣ ⎦ 10 11 ⎢ ⎥ ⎢ ⎥ 1 3 σ 2 0 0 0 ⎣ ⎦ ⎣ ⎦ ⎡ − τ + τ + σ − τ + τ − τ − τ − τ ⎤ 2 6 9 4 3 2 3 9 01 11 01 11 01 11 11 ⎢ ⎥ − τ + τ + σ − τ τ − τ 2 2 2 3 ⎢ ⎥ = τ + ⎢ 01 11 11 01 11 U 00 ⎥ τ + τ + σ τ + τ 2 2 3 01 11 00 11 ⎢ ⎥ τ + τ + σ 2 9 ⎣ ⎦ 00 11 where is a matrix of 1's. U τ τ τ and σ may approximate an AR(1) correlation structure but the estimation of τ τ τ in the 2 Suitable values of , , , and 00 01 11 00 01 11 model is driven by the variability of the ‘true’ regression lines between subjects. Any resulting correlation within is an artifact and does not necessarily reflect the true value of autocorrelation in the population. Using a random intercept and an AR(1) parameter uses one fewer parameters than (137) and yields better identification for ρ : ⎡ ⎤ ⎡ ⎤ 1 ρ ρ 2 ρ 3 1 ⎢ ⎥ ⎢ ⎥ 1 ρ ρ ρ 2 1 ⎢ ⎥ ⎢ ⎥ = τ + σ 2 V [ ][1 1 1 1] ⎢ ⎥ ⎢ ⎥ 00 1 ρ 2 ρ ρ 1 ⎢ ⎥ ⎢ ⎥ 1 ρ 3 ρ 2 ρ 1 ⎣ ⎦ ⎣ ⎦ The MANOVA repeated measures design is saturated for V . It can be represented as a mixed models through T or through Σ . For example if X yields a saturated model, we could have: = + + Y X γ Zu ε i i i 96
γ ε ⎡ ⎤ ⎡ − − ⎤ ⎡ ⎤ ⎡ − − ⎤ ⎡ ⎤ ⎡ ⎤ 1 3 9 27 1 3 9 27 Y u 1 0 0 1 i i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ γ ε − − − − 1 1 1 1 1 1 1 1 Y u ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 = 1 + 1 + 2 i i i ⎢ ⎥ ⎢ ⎥ ⎢ γ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ε ⎥ 1 1 1 1 1 1 1 1 u Y 2 2 3 3 ⎢ i ⎥ ⎢ ⎥ ⎢ i ⎥ ⎢ i ⎥ ⎢ ⎥ ⎢ ⎥ γ ε 1 3 9 27 1 3 9 27 ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ u ⎦ ⎣ ⎦ Y 3 3 4 4 i i i which yields ' ⎛ ⎞ − − τ τ τ τ − − ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 3 9 27 1 3 9 27 Y 1 00 01 02 03 i ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − − τ τ τ τ − − 1 1 1 1 1 1 1 1 Y ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 = 10 11 12 13 Var i ⎜ ⎟ ⎢ ⎥ ⎢ ⎥ ⎢ τ τ τ τ ⎥ ⎢ ⎥ 1 1 1 1 1 1 1 1 Y 3 20 21 22 23 ⎜ i ⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎜ ⎟ τ τ τ τ 1 3 9 27 1 3 9 27 ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎝ Y ⎠ 4 30 31 32 33 i ⎡ ⎤ σ 2 0 0 0 ⎢ ⎥ σ 2 0 0 0 ⎢ ⎥ + ⎢ ⎥ σ 2 0 0 0 ⎢ ⎥ σ 2 0 0 0 ⎣ ⎦ = + σ 2 V Φ I where Φ is free to be any variance matrix. However, σ is not identifiable and fitting this model using software for mixed models 2 σ to be equal to 0 or some very small quantity. 2 requires the ability to constrain The other approach is to use a saturated model for Σ and no random part. = + Y X γ ε i i − − γ ε ⎡ ⎤ ⎡ 1 3 9 27 ⎤ ⎡ ⎤ ⎡ ⎤ Y 1 0 1 i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − − γ ε 1 1 1 1 Y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 2 = 1 + 2 i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ γ ε 1 1 1 1 Y 3 2 3 i i ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ γ ε 1 3 9 27 ⎣ Y ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 4 3 4 i i yielding 97
σ σ σ σ ⎡ ⎤ 11 12 13 14 ⎢ ⎥ σ σ σ σ ⎢ ⎥ = 21 22 23 24 = V Σ σ σ σ σ ⎢ ⎥ 31 32 33 34 ⎢ ⎥ σ σ σ σ ⎣ ⎦ 41 42 43 44 which is equivalent in form and substance to a MANOVA repeated measures model. We can therefore think of mixed models as extensions of univariate and multivariate repeated measures allowing models with variance matrices of intermediate complexity. They also allow one to fit univariate and multivariate repeated measures models to incomplete data yielding valid estimates provided the missing cases are ‘missing at random’. We fit these models by specifying both the Φ and the Σ matrices. In SAS the structure of the Σ matrix is controlled by the REPEATED statement. The options are listed in SAS help documentation and a summary can be found below starting on page 121. Some of the most important options to specify the TYPE of covariance structure are: Structure Discription Parms (i,j)th element σ ρ − AR(1) Autoregressive 2 2 | | i j ARMA(1,1) ARMA(1) 3 − σ 2 γρ | | − δ + δ [ i j (1 ) ] ij ij σ σ ρ − ARH(1) Heterog. AR(1) T+1 | | i j i j ANTE(1) Ante-dependence 2T-1 σ σ Π − ρ 1 j = i j k i k FA0(T) Choleski unstructured T(T+1)/2 Σ min( , ) i j λ λ = 1 ik jk k σ UN Unstructured T(T+1)/2 ij CS Compound Symmetry 2 σ + τδ 2 1 ij − Spatial covariance 6 structure to SP(POW)( t ) 2 σ ρ | | 2 t t i j implement Continuous AR(1) Model with respect to time variable t 6 Spatial covariance structure are designed for geographic applications where the correlation between observations is a function of their spatial distance. This model applied to the one-dimensional time variable produces a Continuous AR(1) model. 98
δ represents the ‘Kronecker delta’ which is equal to 1 when i=j and 0 if i ≠ Note that . j ij AR(1) is used when the dependence over time follows an ‘auto-regressive process of order 1’ which means that the random error for each observation is equal to a constant times the previous error plus a new error (innovation). ARMA combines an autoregressive process with a ‘moving average’ process. It can be used to model situations where the observed error is composed of an underlying ‘hidden’ AR(1) process plus independent errors. ARH(1) is similar to AR(1) except that it allows observations on each occasion to have different variances. The correlation structure is identical to AR(1) . ANTE(1) is similar to ARH(1) except that the ‘autocorrelation’ between adjoining occasions can vary from one adjoining pair to another. The autocorrelation between pairs that are more than one unit of time apart is determined by the autocorrelations between intermediate adjoining pairs. A model that accommodates highly variable times from subject to subject is a Continuous AR(1) model that can be obtained with the spatial covariance structure SP(POW)( t ) where t is the name of the time variable. FA0(T) where T is the number of occasions per subject produces a Choleski parametrization which may be superior to an unstructured ( UN ) model since FA0(T) imposes a positive-semi definite structure for Σ . This approach, like UN, only works when the set of times is the same for each subject. Both FA0(T) and UN can accommodate some missing occasions for some subjects. Methods for doing this are discussed in section 9.3.5 on page 118 below. CS produces a T matrix with a compound symmetry structure which is (almost) identical to the matrix obtained with a random intercept model. One subtle difference is that CS can allow negative covariances which would not be possible under SAS’s positive constraints for variance with a random intercept model. The covariance structures for some of the types listed above: ⎡ ρ ρ ρ ⎤ 2 3 1 ⎢ ⎥ ρ ρ ρ 2 1 ⎢ ⎥ σ 2 AR(1) ⎢ ⎥ ρ ρ ρ 2 1 ⎢ ⎥ ρ ρ ρ ⎢ 3 2 ⎥ 1 ⎣ ⎦ 99
⎡ ⎤ γ γρ γρ 2 1 ⎢ ⎥ γ γ γρ 1 ⎢ ⎥ σ 2 ARMA(1,1) ⎢ ⎥ γρ γ γ 1 ⎢ ⎥ γρ 2 γρ γ ⎢ 1 ⎥ ⎣ ⎦ ⎡ σ σ σ ρ σ σ ρ σ σ ρ ⎤ 2 2 3 1 1 2 1 3 1 4 ⎢ ⎥ σ σ ρ σ σ σ ρ σ σ ρ 2 2 ⎢ ⎥ 2 1 2 2 3 2 4 ARH(1) ⎢ ⎥ σ σ ρ σ σ ρ σ σ σ ρ 2 2 3 1 3 2 3 3 4 ⎢ ⎥ σ σ ρ σ σ ρ σ σ ρ σ ⎢ 3 2 2 ⎥ ⎣ ⎦ 4 1 4 2 4 3 4 ⎡ ⎤ σ 2 σ σ ρ σ σ ρ ρ σ σ ρ ρ ρ 1 1 2 1 1 3 1 2 1 4 1 2 3 ⎢ ⎥ σ σ ρ σ 2 σ σ ρ σ σ ρ ρ ⎢ ⎥ 1 2 1 2 2 3 2 2 4 2 3 ANTE(1) ⎢ ⎥ σ σ ρ ρ σ σ ρ σ 2 σ σ ρ 1 3 1 2 2 3 2 3 3 4 3 ⎢ ⎥ σ σ ρ ρ ρ σ σ ρ ρ σ σ ρ σ 2 ⎣ ⎦ 1 4 1 2 3 2 4 2 3 3 4 3 4 ⎡ ρ ρ ρ ⎤ 4 9 1 ⎢ ⎥ SP(POW)( time ) ρ ρ ρ 3 8 1 ⎢ ⎥ σ 2 Supposing a subject with ⎢ ⎥ ρ ρ ρ 4 3 5 1 times 1,2, 5 and 10 7 ⎢ ⎥ ρ ρ ρ 9 8 5 1 ⎣ ⎦ 9.2 Analyzing longitudinal data 9.2.1 Classical or Mixed models A comment from a SAS FAQ comparing PROC MIXED with PROC GLM: σ and ρ have the same value. 7 Note that the times and the number of times – hence the indices – can change from subject to subject but 2 100
Recommend
More recommend