Generalized linear models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012 Printed: October 29, 2012 File: glim-slides.tex
2 Contents
3 1 Densities for generalized linear models Consider densitites of the form � yθ − b ( θ ) � f Y ( y ; θ, φ ) = exp − c ( y, φ ) (1) a ( φ ) where φ is a fixed parameter (not part of the model).
4 1.1 Mean and variance For random variables with densities of the above form we have: ( ∗ ) E ( y ) = b ′ ( θ ) ( ∗∗ ) V ar( y ) = b ′′ ( θ ) a ( φ ) (2) ∂θ f Y ( y ; θ, φ ) = y − b ′ ( θ ) ∂ Notice that a ( φ ) f Y ( y ; θ, φ ) . Assuming that the order of integration and differentiation can be interchanged we get: � y − b ′ ( θ ) d ∂ � � 0 = f Y dy = ∂θ f Y dy = f Y dy (3) dθ a ( φ ) 1 �� � 1 � yf Y dy − b ′ ( θ ) a ( φ ) [ µ − b ′ ( θ )] = f Y dy = a ( φ ) from which (2) (*) follows.
5 By differentiating twice we get d 2 � y − b ′ ( θ ) ∂ � � � 0 = f Y dy = f Y dy dθ 2 ∂θ a ( φ ) � � � 2 � − b ′′ ( θ ) � y − b ′ ( θ ) = a ( φ ) + f Y dy a ( φ ) From this (2) (**) follows since � �� y − µ � 2 � − b ′′ ( θ ) 0 = f Y dy (4) a ( φ ) a ( φ ) ( y − µ ) 2 f Y dy − b ′′ ( θ ) 1 � � = f Y dy a ( φ ) 2 a ( φ ) a ( φ ) 2 V ar( Y ) − b ′′ ( θ ) 1 = a ( φ )
6 2 The components of a generalized linear models Consider the situation where Y 1 , . . . , Y n are independent random variables and Y i has density f Y ( y ; θ i , φ ) of the form (1). In the following it is assumed that a ( φ ) = φ/w where w is a known weight. 2.1 Linear predictor and link function To each Y there is associated a vector x of explanatory variables and a fixed number ω . The linear predictor is η = η ( β ) = ω + x ⊤ β. I generalized linear models (GLIMs) the linear predictor is related to the mean µ = E ( Y ) by the link function g () η = g ( µ ) as follows µ = g − 1 ( η ) = g − 1 ( ω + x ⊤ β )
7 The density (1) is parameterized by the parameter θ and we therefore need to establish the connections between θ and µ . From (2) we have that E ( Y ) = µ = b ′ ( θ ) . Letting the inverse function of b ′ denoted by H . Then θ = ( b ′ ) − 1 ( µ ) = H ( µ ) = H ( g − 1 ( ω + x ⊤ β )) which establishes the link between the parameter θ and x ⊤ β .
8 2.2 Variance function From (2) we also have that V ar( Y ) = b ′′ ( θ ) a ( φ ) . Since θ = H ( µ ) we have V ar( Y ) = b ′′ ( H ( µ )) a ( φ ) . The function V ( µ ) = b ′′ ( H ( µ )) is called the variance function as it expresses how the variance depends on the mean. 2.3 Canonical link The connection between θ and the linear predictor η is given by θ = ( b ′ ) − 1 ( g − 1 ( η )) . So if b ′ = g − 1 then θ = η . In this case g is said to be the canonical link function. Using the canonical link function ensures that the log likelihood is concave such it has a unique maximum.
9 2.4 Example: Bernoulli distribution and canonical link Consider the Bernoulli distribution Pr ( Y = y ) = p y (1 − p ) 1 − y , y ∈ { 0 , 1 } Rewrite as � � Pr ( Y = y ) = exp y log p + ( n − y ) log(1 − p ) � p � = exp y log 1 − p + log(1 − p ) � � = exp yθ − log(1 + exp( θ )) e θ p where we have introduced the logit as θ = g ( p ) = log 1 − p . Notice that p = 1+ e θ . e θ Hence b ( θ ) = log(1 + e θ ) , w = 1 and φ = 1 . Now b ′ ( θ ) = 1+ e θ = p and e θ b ′′ ( θ ) = (1+ e θ ) 2 = p (1 − p ) .
10 3 Estimation: Iteratively reweighted least squares Consider the situation where Y 1 , ..., Y n are independent where the distribution of Y i has density of the form (1) with weight w i . We have • E ( Y i ) = µ i • V ar( Y i ) = V ( µ i ) φ/w i • g ( µ i ) = η i = ω i + x ⊤ i β Then the log–likelihood becomes l ( β ) = 1 � � w i ( y i θ i − b ( θ i )) − c ( y i , φ ) φ i i where θ i = H ( µ i ) = H ( g − 1 ( η i )) = H ( g − 1 ( ω i + x ⊤ i β )) .
11 Imagine a Taylor expansion of g ( Y i ) around µ i : g ( Y i ) ≈ g ( µ i ) + g ′ ( µ i )( Y i − µ i ) = Z i The key to the algorithm is to work with adjusted dependent variables z i . To actually calculate these we need an initial guess of µ i . More about this later. We have E ( Z i ) = g ( µ i ) = η i and V ar( Z i ) = ( g ′ ( µ i )) 2 V ( µ i ) φ/w i . In vector form we get � � ( g ′ ( µ i )) 2 V ( µ i ) /w i E ( Z ) = η = ω + Xβ, V ar( Z ) = φ diag = φW ( µ ) The least squares estimate of β is ˆ β = ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( z − ω ) Notice: Both W and z depend on β so some iterative scheme is needed.
12 Initialization: Since g ( µ i ) = ω i + x ⊤ i β we must have g ( y i ) ≈ ω i + x ⊤ i β . So we may start by applying the link function to data, and obtaining an initial estimate of β as β 0 = ( X ⊤ X ) − 1 X ⊤ ( g ( y ) − ω ) β m from the m th iteration we calculate Iteration: Given ˆ z m +1 = Xβ m + diag ( y − µ m ) = Xβ m + r g ′ ( µ m � � i ) Update W accordingly as W = W ( µ m ) and calculate β m +1 = ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( z m +1 − ω ) Notice that β m +1 = β m + ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( r − ω ) This means that α = β m +1 − β m is determined as the MLE in the model r − ω ∼ N ( Xα, W ) .
13 Saturated model, deviance and estimation of φ 3.1 Recall that the log–likelihood l has the form l ( β ) = l ( µ, y ) = l 1 ( µ, y ) /φ . The saturated model is obtained by making no restrictions on µ so in this case µ = y and l ( y, y ) is the log–likelihood for the saturated model. ˆ The goodness of fit of a generalized linear model is investigated using the deviance: D (ˆ µ, y ) = − 2( l (ˆ µ, y ) − l ( y, y )) = − 2( l 1 (ˆ µ, y ) − l 1 ( y, y )) /φ = D 1 (ˆ µ, y ) /φ Under certain assumptions on the covariates X , it can be shown that D will be asymptotically χ 2 n − p distributed where p is the rank of X .
14 • When φ is known, this result provides a goodness–of–fit test of the model. • When φ is unknown, this results provides a natural estimate of φ as φ = D 1 (ˆ µ, y ) ˜ n − p (but in this case we do not have a goodness–of–fit test).
15 3.2 Binomial distribution Suppose Y i ∼ Bin ( n i , µ i ) where g ( µ i ) = η i = x ⊤ i β for i = 1 , . . . , N . There are different ways of putting the binomial distribution into the GLIM framework. First notice that φ = 1 and w i = 1 so these terms can be ignored. For later use, let X = [ x 1 : · · · : x N ] ⊤ and v i = { g ′ ( µ i ) } 2 µ i (1 − µ i ) ; We may regard Y i as a sum of n i independent Bernoulli variables: n i � Y i = u ij , u ij ∼ Bern ( p i ) j =1 We may therefore fit the binomial distribution this way, but this is potentially very inefficient in terms of storage and computing time. An alternative approach is described below.
16 In the Bernoulli setup: • The row vector of covariates x ⊤ i must then be repeated n i times: Let ˜ i , ˜ X = [ ˜ X 1 : · · · : ˜ X i = 1 n i x ⊤ X N ] ⊤ . With M = � i n i , X is an M × p matrix. • Let also ˜ V i = v i I n i ; ˜ V = diag( V 1 , . . . , V N ) . z i = ( z i 1 , . . . , z in i ) ⊤ and • Finally, let z ij = g ( µ i ) + g ′ ( µ i )( u ij − µ i ) ; ˜ z N ) ⊤ . z = (˜ ˜ z 1 , . . . , ˜ We omit the subscripts n i indicating dimensions of 1 and I . We then have [ 1 x 1 1 ⊤ : · · · : 1 X ⊤ ˜ ˜ V − 1 x N 1 ⊤ ] = v 1 v N 1 z 1 + . . . 1 X ⊤ ˜ ˜ V − 1 ˜ x 1 1 ⊤ ˜ x N 1 ⊤ ˜ z = z N v 1 v N [ n 1 1 + · · · + n M X ⊤ ˜ V − 1 ˜ ˜ x 1 x ⊤ x N x ⊤ X = N ] v 1 v N
17 Recall that z ij = g ( µ i ) + g ′ ( µ i )( u ij − µ i ) . Let � y i z i = 1 z ij = 1 � � 1 ⊤ z i = g ( µ i ) + g ′ ( µ i ) ¯ n i ˜ − µ i n i n i n i j z N ) ⊤ . Then and let ¯ z = (¯ z 1 , . . . , ¯ z i ) = 1 { g ′ ( µ i ) } 2 µ i (1 − µ i ) = v i E (¯ z i ) = g ( µ i ) V ar(¯ n i n i Let V = diag( v 1 n 1 , . . . , v N n N ) . Then [ n 1 x 1 : · · · : n N X ⊤ V − 1 = x N ] v 1 v N n 1 z 1 + · · · + n N X ⊤ V − 1 ¯ z = x 1 ¯ x N ¯ z N v 1 v N n 1 1 + · · · + n N X ⊤ V − 1 X x 1 x ⊤ x N x ⊤ = N v 1 v N
18 Notice that 1 ⊤ ˜ z i = � j z ij = n i ¯ z i . This means that X ⊤ ˜ X ⊤ ˜ V − 1 ˜ ˜ ˜ V − 1 ˜ z = X ⊤ V − 1 ¯ X = X ⊤ V − 1 X z, So iterating using ( ˜ X, ˜ V ) or ( X, V ) produces the same result.
19 4 Ressources • The standard reference on generalized linear models is McCullagh and Nelder Generalized Linear Models • http://www.commanster.eu/ms/glm.pdf
20 5 EXERCISES: Generalized linear models The function model.matrix() can be used for generating a model matrix X from a given formula and a dataset. So in the following we shall assume that we have a vector of responses y , a model matrix X and possibly also a weight vector w . 1. Implement your own function myglm() which takes ( y, X, w ) as input. In doing so, you may find it helpful to use that the various families are implemented in R . See e.g. ?binomial . You may seek inspiration from the function glm.fit . 2. The result from your function could be of a certain class, and you could create interesting methods. For example, it would be nice to obtain the approximate covariance matrix of ˆ β , which is ( X ⊤ W − 1 X ) − 1
Recommend
More recommend