Compstat’2010, Paris, August 22–27 Censored Survival Data: Simulation and Kernel Estimates Jiˇ r´ ı Zelinka Department of Mathematics and Statistics Faculty of Science, Masaryk University Brno, Czech Republic Supported by M ˇ SMT LC06024 – p. 1
Introduction • Previous research (Horová, Pospíšil & Zelinka (2008) and Horová, Pospíšil & Zelinka (2009)): combination of kernel smoothing and dynamic model in survival analysis. • Verification of developed method: simulations → problem • The subject of this paper is to solve this problem. – p. 2
Survival and hazard functions T ≥ 0 survival time cumulative distribution function (c.d.f.) of T F ¯ F = 1 − F survival function λ = λ ( x ) hazard function Hazard function: – intensity of survival probability: ¯ F ′ ( x ) F ( x )) = f ( x ) F ( x ) = − log ′ ( ¯ λ ( x ) = − (1) ¯ ¯ F ( x ) if the density f exists. From (1) we have x R − λ ( t ) dt ¯ F ( x ) = e . (2) 0 – p. 3
Random censorship model i.i.d. lifetimes with c.d.f. F T 1 , T 2 , . . . , T n i.i.d. censoring times with c.d.f. G C 1 , . . . , C n Censoring times are independent of the lifetimes. In the random censorship model we observe pairs ( X i , δ i ) , i = 1 , . . . , n, where X i = min( T i , C i ) δ i = 1 { X i = T i } indicates whether the observations is censored or not. { X i } are i.i.d. with survival function ¯ L : ¯ L ( x ) = ¯ F ( x ) ¯ G ( x ) . – p. 4
Kernel estimates of the hazard function Let [0 , τ ] , τ > 0 , be an interval for which L ( τ ) < 1 and λ ∈ C 2 [0 , τ ] and let K be a continuous and symmetric function on R called a kernel satisfying conditions: 1. supp K = [ − 1 , 1] 2. K ∈ Lip[ − 1 , 1] 1 , k = 0 � 1 − 1 x k K ( x )d x = 3. 0 , k = 1 β 2 � = 0 , k = 2 . The well-known kernels: K ( x ) = 3 4 (1 − x 2 )1 [ − 1 , 1] Epanechnikov kernel K ( x ) = 15 16 (1 − x 2 ) 2 1 [ − 1 , 1] quartic kernel – p. 5
The kernel estimate of the hazard function is given as n � x − X ( i ) � δ ( i ) λ h,K ( x ) = 1 ˆ � K n − i + 1 . (3) h h i =1 The parameter h is called bandwidth or smoothing parameter. Let us denote � 1 � 1 − 1 K 2 ( x )d x, − 1 x 2 K ( x )d x, V ( K ) = β 2 = � 2 � T � T � λ ( x ) λ (2) ( x ) Λ = L ( x ) d x, D 2 = d x. ¯ 0 0 The global quality of the estimate – Mean Integrated Square Error: � T � T � 2 � � � � � ˆ ˆ ˆ = λ h,K ( x ) d x = λ h,K ( x ) − λ ( x ) d x, MISE λ h,K MSE E 0 0 – p. 6
The leading term MISE (ˆ λ h,K ) of MISE (ˆ λ h,K ) takes the form = 1 2 D 2 + V ( K )Λ � � ˆ 4 h 4 β 2 MISE λ h,K nh The asymptotically optimal bandwidth minimizing MISE (ˆ λ h,K ) with respect to h is given by the formula � 1 / 5 � Λ V ( K ) h opt = n − 1 / 5 (4) β 2 2 D 2 The estimate of h opt will be denoted with ˆ h opt . See Horová & Zelinka (2006) for method of evaluating the appropriate estimate ˆ h opt . – p. 7
Kernel estimate of the hazard function – example 1 0.95 0.9 0.85 0.8 0.75 0 50 100 150 200 250 5 x 10 −3 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 – p. 8
Simulation of lifetimes For given hazard function λ we have (see (2)) x R − λ ( t ) dt F ( x ) = 1 − e 0 The lifetimes T 1 ,. . . , T n can be evaluated numerically by re -sampling random variables U 1 ,. . . , U n uniformly distributed on interval [0 , 1] . 1 F 0.9 0.8 0.7 0.6 U 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 T – p. 9
Simulation of censoring times Real situation: Let’s have a clinical study dealing with some disease. The research begins in time t 0 (we can suppose t 0 = 0 ). Patients come to the study randomly in interval [ t 0 , t 1 ] , the begin of treatment is given by random variable B with cumulative distribution function H . The coming of patients is broken in time t 1 , but the study may continue to some time t 2 ≥ t 1 when it is finished. The censorship time is C = t 2 − B For the survival function ¯ G we have ¯ G ( x ) = H ( t 2 − x ) . 1 0.9 0.8 0.7 0.6 H(x) 1−G(x) 0.5 0.4 0.3 0.2 0.1 0 0 t2−t1 t1 t2 Cumulative distribution function for coming of patients ( H ) and survival function of censoring times ( ¯ G = 1 − G ). – p. 10
Let us recall opt = V ( K )Λ h 5 nβ 2 2 D 2 for � 1 � 1 − 1 K 2 ( x )d x, − 1 x 2 K ( x )d x, V ( K ) = β 2 = � 2 � τ � τ � λ ( x ) λ (2) ( x ) Λ = G ( x ) d x, D 2 = d x. F ( x ) ¯ ¯ 0 0 Choice of τ : naturally τ = t 2 , ⇒ problem with counting Λ as ¯ G ( t 2 ) = 0 . Solution: for ¯ G let us take such λ that λ ( x ) ¯ F ( t 2 ) > 0 , G ( x ) = O (1) , for x → t 2 . ¯ As a result of this property we have λ ( t 2 ) = 0 and for λ ∈ C 2 [0 , T ] also λ ′ ( t 2 ) = 0 as λ is non -negative. In all simulations let the begins of treatment B be uniformly distributed on [0 , t 1 ] . Due to this fact the cumulative distribution function C is uniformly dis- tributed on [ t 2 − t 1 , t 2 ] . – p. 11
Simulation 1 λ : unimodal hazard function on [0 , t 2 ] : λ ( x ) = x (2 − x ) 2 x 2 12 (3 x 2 − 16 x +24) . F ( x ) = 1 − e Let K be the Epanechnikov kernel and n = 100 Case A: t 1 = 1 , t 2 = 2 , h opt = 0 . 4437 Case B: t 1 = 1 . 5 , t 2 = 2 , h opt = 0 . 4721 Case C: t 1 = 2 , t 2 = 2 , h opt = 0 . 4993 – p. 12
Estimate of λ for optimal bandwidth 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Case A: λ – dashed line, estimate – solid line 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Case B: λ – dashed line, estimate – solid line 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Case C: λ – dashed line, estimate – solid line – p. 13
Estimate of hopt for 200 repetitions 1 0.8 Values 0.6 0.4 0.2 0 t1=1.0 t1=1.5 t1=2.0 Dashed lines: optimal bandwidths – p. 14
Simulation 2 λ : unimodal hazard function on [0 , t 2 ] : � � 1 1 − cos 2 ∗ π λ ( x ) = x 100 t 2 t 2 2 ∗ pi sin 2 ∗ pi 1 100 ( t 2 x − x ) F ( x ) = 1 − e Let K be the Epanechnikov kernel and n = 100 Case A: t 1 = 100 , t 2 = 200 , h opt = 43 . 703 Case B: t 1 = 150 , t 2 = 200 , h opt = 47 . 122 – p. 15
Estimate of λ for optimal bandwidth 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0 20 40 60 80 100 120 140 160 180 200 Case A: λ – dashed line, estimate – solid line 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0 20 40 60 80 100 120 140 160 180 200 Case B: λ – dashed line, estimate – solid line – p. 16
Estimate of hopt for 200 repetitions 60 55 50 45 40 Values 35 30 25 20 15 t1=100 t1=150 Dashed lines: optimal bandwidths – p. 17
Simulation 3 λ : bimodal hazard function on [0 , t 2 ] : � � 1 1 − cos 4 ∗ π λ ( x ) = x 100 t 2 t 2 4 ∗ pi sin 4 ∗ pi 1 100 ( t 2 x − x ) F ( x ) = 1 − e Let K be the Epanechnikov kernel and n = 200 Case A: t 1 = 100 , t 2 = 200 , h opt = 23 . 443 Case B: t 1 = 150 , t 2 = 200 , h opt = 25 . 255 – p. 18
Estimate of λ for optimal bandwidth 0.025 0.02 0.015 0.01 0.005 0 0 20 40 60 80 100 120 140 160 180 200 Case A: λ – dashed line, estimate – solid line 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0 20 40 60 80 100 120 140 160 180 200 Case B: λ – dashed line, estimate – solid line – p. 19
Estimate of hopt for 200 repetitions 26 24 22 20 18 Values 16 14 12 10 8 t1=100 t1=150 Dashed lines: optimal bandwidths – p. 20
Conclusion • The simulations indicate that the proposed method of generating random censored data for given cumulative distribution function C and hazard function λ can be well applied for testing the algorithms of survival analysis. • At the same time the simulations show that the method of bandwidth choice proposed in Horová & Zelinka (2006) gives worse results for the greater frequency of censored data, but the estimates of optimal bandwidth are still well usable. – p. 21
References Collett D.: Modelling Survival Data in Medical Research . Chapman & Hall/CRC: Boca Raton -London-New York-Washington, D.C., 2003. Horová I., Zelinka J., Budíková M.: Estimates of Hazard Functions for Carcinoma Data Sets. Environmetrics , 17 , 239–255, 2006. Horová I., Zelinka J.: (2006) Kernel Estimates of Hazard Functions for Biomedical Data Sets. In Applied Biostatistics: Case studies and Interdisciplinary Methods , Springer, 2006. Horová I., Pospíšil Z., Zelinka J.: Semiparametric Estimation of Hazard Function for Cancer Patients, Sankhya , 69 , 494–513, 2008. Horová I., Pospíšil Z., Zelinka J.: Hazard function for cancer patients and cancer cell dynamics, Journal of Theoretical Biology , textbf258, 437–443, 2009. – p. 22
Recommend
More recommend