Cox’s proportional hazards model and Cox’s partial likelihood Rasmus Waagepetersen October 8, 2020 1 / 27
Non-parametric vs. parametric Suppose we want to estimate unknown function, e.g. survival function. Approaches: ◮ Non-parametric using Kaplan-Meier. Advantage: no assumption regarding type of distribution. Disadvantage: requires iid observations. ◮ Parametric model. Advantage: we only need to estimate a few parameters that completely characterize distribution (e.g. exponential or Weibull) - gives low variance of estimates. Can be extended to non- iid observations using regression on covariates. Disadvantage: assumed model class may be (or always is) incorrect leading to model error or in other words, bias. Possible to combine the best of two approaches ? 2 / 27
Semi-parametric approach - Cox’s proportional hazards model Sir David Cox in a ground-breaking paper (‘Regression models and life tables’, 1972) suggested the following model for the hazard function given covariates z ∈ R p : h ( t ; z ) = h 0 ( t ) exp( z T β ) , β ∈ R p . Here h 0 ( · ) completely unspecified function except that it must be non-negative. Thus model combines great flexibility via non-parametric h 0 ( · ) with the possibility of introducing covariate effects via exponential term exp( z T β ) This model has become standard in medical statistics. 3 / 27
Some properties Cumulative hazard: � t H ( t ; z ) = exp( z T β ) h 0 ( u ) d u = exp( z T β ) H 0 ( t ) 0 Survival function S ( t ; z ) = S 0 ( t ) exp( z T β ) S 0 ( t ) = exp( − H 0 ( t )) Proportional hazards: h ( t ; z ) h ( t ; z ′ ) = exp(( z − z ′ ) T β ) i.e. constant hazard ratio for two different subjects - curves can not cross ! - this should be checked in any application. 4 / 27
Estimation - partial likelihood Model useless if we can not estimate parameter β . Problem: we can not use likelihood when h 0 ( · ) unspecified. Second break-through contribution of Cox: invention of partial likelikehood for estimating β . Suppose we have observations ( t i , δ i ) as well as (fixed) covariates z 1 , . . . , z n , i = 1 , . . . , n . We assume no ties (all t i distinct) and define D ⊆ { 1 , . . . , n } as D = { l | δ l = 1 } - i.e. the index set of death times. For any t ≥ 0 we further define the risk set R ( t ) = { l | t l ≥ t } i.e. the index set of subjects at risk at time t . 5 / 27
The partial likelihood The partial likelihood is exp( z T l β ) � L ( β ) = k ∈ R ( t l ) exp( z T � k β ) l ∈ D Cox suggested to estimate β by maximizing L ( β ). ◮ does not depend on h 0 ◮ does not depend on actual death times - only their order ◮ censored observations only appear in risk set (as for Kaplan-Meier) Cox’s idea has proven to work very well - but why ? Lots of people have tried to make sense of this partial likelihood. 6 / 27
Cox’s intuition Consider for simplicity the case of no censoring and let t (1) , . . . , t ( n ) denote the set of ordered death times. We can equivalently represent data as the set of inter-arrival times v i = t ( i ) − t ( i − 1) (taking t 0 = 0) together with the information r 1 , r 2 , . . . , r n about which subject died at each time of death - i.e. r i = l if subject l was the i th subject to die. Cox then factored likelihood of ( v 1 , . . . , v n , r 1 , . . . , r n ) as (using generic notation for densities and probabilities) f ( v 1 ) p ( r 1 | v 1 ) f ( v 2 | v 1 , r 1 ) p ( r 2 | v 1 , v 2 , r 1 ) · · · f ( v n | v 1 , . . . , v n − 1 , r 1 , . . . , r n − 1 ) p ( r n | v 1 , . . . , v n , r 1 , . . . , r n − 1 ) 7 / 27
Cox argued that terms f ( v i | . . . ) could not contribute with information regarding β since the interarrival times can be fitted arbitrary well when h 0 is unrestricted - we can essentially just choose h 0 to consist of ‘spikes’ at each death time. Thus estimation of β should be based on remaining factors n � L ( β ) = p ( r i | H i ) i =1 where H i = { v 1 , . . . , v i , r 1 , . . . , r i − 1 } history/previous observations. Here p ( r i | H i ) is the probability that subject r i is the i th person to die given the previous observations. 8 / 27
More precisely, let R i denote the random index of the i th subject that dies ( R i = l means that T l is the i th smallest death time, i.e. T R i = T ( i ) = T l ). Assume that p ( l | H i ) only depends on H i through the knowledge that the i th death happens at time t ( i ) and that R ( t ( i ) ) are the ones at risk at time t ( i ) . Thus p ( l | H i ) = P ( R i = l | T R i ∈ [ t ( i ) , t ( i ) + dt [ , R ( t ( i ) ) = A ) This is the probability that l is the i th person to die given that the i th death happens at time t ( i ) and that the persons in A are at risk at time t ( i ) (thus probability is zero if l �∈ A ) 9 / 27
We now express the conditional probability in terms of the hazard function: P ( R i = l , T R i ∈ [ t ( i ) , t ( i ) + dt [ | R ( t ( i ) ) = A ) = P ( T l ∈ [ t ( i ) , t ( i ) + dt [ , T k > T l , k ∈ A \ { l }| R ( t ( i ) ) = A ) � ‘ = ′ h 0 ( t ( i ) ) exp( z T (1 − h 0 ( t ( i ) ) exp( z T l β ) d t k β ) d t ) k ∈ A \{ l } Note ‘=’ is because we actually replace T k > T l by T k > t ( i ) + d t . This does not really matter since d t infinitesimal. NB: if R i = l then t ( i ) = t l so in the following we replace t ( i ) with t l . 10 / 27
Finally, P ( R i = l | T R i ∈ [ t l , t l + dt [ , R ( t l ) = A ) = P ( R i = l , T R i ∈ [ t l , t l + dt [ | R ( t l ) = A ) P ( T R i ∈ [ t l , t l + dt [ | R ( t l ) = A ) P ( R i = l , T R i ∈ [ t l , t l + dt [ | R ( t l = A )) = � j ∈ R ( t l ) P ( R i = j , T R i ∈ [ t l , t l + dt [ | R ( t l ) = A ) h 0 ( t l ) exp( z T k ∈ R ( t l ) \{ l } (1 − h 0 ( t l ) exp( z T l β ) d t � k β ) d t ) = j ∈ R ( t l ) h 0 ( t l ) exp( z T k ∈ R ( t l ) \{ j } (1 − h 0 ( t l ) exp( z T � j β ) d t � k β ) d t ) exp( z T l β ) = k ∈ R ( t l ) exp( z T � k β ) Note: last = follows after cancelling h 0 ( t l ) d t and noting that (1 − h 0 ( t l ) exp( z T k β ) d t ) tends to one when d t tends to zero. NB: denominator is hazard for minimum of T k , k ∈ R ( t l ) (exercise 18) 11 / 27
Conditional likelihood for matched case-control study Cox’s idea very closely related to conditional likelihood for matched case-control studies. Let X denote a binary random variable (e.g. sick/healthy) for an individual in a population. We want to study the impact of a covariate z on X . Assume that the population can be divided into homogeneous groups (strata) so that probability of being ill is given by a logistic regression exp( α i + β z ) P ( X = 1) = p i ( z ) = 1 + exp( α i + β z ) for an individual in the i th strata and with the covariate z . 12 / 27
Suppose X 1 = 1 with covariate z 1 is observed for a sick person in the i th stratum. In a matched case-control study this observation is paired with an observation X 2 = 0 with covariate z 2 for a randomly selected healthy person in the same stratum. The conditional likelihood is now based on the conditional probabilities P ( X 1 = 1 | X 1 = 1 , X 2 = 0 or X 1 = 0 , X 2 = 1) = p i ( z 1 )(1 − p i ( z 2 )) p i ( z 1 )(1 − p i ( z 2 )) + (1 − p i ( z 1 )) p i ( z 2 ) This reduces to exp( β z 1 ) exp( β z 1 ) + exp( β z 2 ) which is free of the strata specific intercept α i . Note α i is a nuisance parameter when we are just interested in β . 13 / 27
Invariance argument Again consider the case of no censoring. Kalbfleisch and Prentice noticed that if one applies a strictly monotone differentiable function g to the survival times T 1 , . . . , T n then ˜ T i = g ( T i ) again follows a proportional hazards model with a completely unspecified hazard function ˜ h 0 (exercise 17). Hence estimation problem for β the same regardless of whether we consider T i ’s or ˜ T i ’s. They thus concluded that only the ordering (ranks) of the survival times and not the magnitudes of the survival times could matter for inference on β . One can verify (exercise 23) that for the ranks R i , P ( R 1 = r 1 , . . . , R n = r n ) = P ( T r 1 < T r 2 < · · · < T r n ) is precisely Cox’s partial likelihood. 14 / 27
Profile likelihood Cox’s partial likelihood can also be derived as a profile likelihood. Consider likelihood (assuming no ties) � t i n l β )] δ i exp[ − exp( z T � [ h 0 ( t i ) d t exp( z T i β ) h 0 ( u ) d u ] . 0 i =1 Let’s try to maximize wrt h 0 . First, we need h 0 ( t l ) > 0 for l ∈ D . At the same time we should take h 0 ( u ) = 0 between death times. So we let h 0 ( t ) d t = α l in very small intervals around death times, [ t l , t l + d t [, l ∈ D , and zero elsewhere. Note likelihood does not inform about h 0 ( t ) for t larger than max i t i . 15 / 27
Then likelihood becomes �� � n α l exp[ z T � exp( z T � L ( α, β ) = l β ] exp( − i β ) α l ) l ∈ D i =1 l ∈ D : t l ≤ t i �� � α l exp[ z T � � exp( z T = l β ] exp( − α l i β )) l ∈ D l ∈ D i ∈ R ( t l ) Taking log and differentiating wrt α l we obtain ∂ log L ( α, β ) = 1 � exp( z T − j β ) ∂α l α l j ∈ R ( t l ) Setting equal to zero and solving wrt α l gives 1 α l ( β ) = ˆ j ∈ R ( t l ) exp( z T � j β ) 16 / 27
Recommend
More recommend