V ARIABLE S ELECTION AND THE A SSESSMENT OF P REDICTIVE A CCURACY WITH I NTERVAL -C ENSORED R ESPONSES R ICHARD C OOK S TATISTICS AND A CTUARIAL S CIENCE U NIVERSITY OF W ATERLOO Statistical Issues in Biomarker and Drug Co-Development Toronto, Ontario November 8, 2014 Joint work with Ying Wu and Ker-Ai Lee
PART I V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES O UTLINE 1
P ROGNOSTIC H UMAN L EUKOCYTE A NTIGENS IN P SORIATIC A RTHRITIS • The University of Toronto Psoriatic Arthritis Clinic is a tertiary referral clinic comprised of 1300 patients with extensive longitudinal follow-up on disease progression and collection of genetic and serum samples. • Patients with psoriatic arthritis are classified as suffering from arthritis mu- tilans if they have 5 or more damaged joints • Patients are scheduled to be radiologically assessed every two years . • The time for the development of arthritis mutilans is unknown because it is subject to interval-censoring. I MMEDIATE G OAL Interest lies in identifying HLA markers that predict onset of arthritis mutilans. I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 2
J OINT D AMAGE AND M ARKER V ALUES IN C ONTINUOUS T IME 10 − − 100 ESR MARKER TOTAL NUMBER OF DAMAGED JOINTS MARKER OF INFLAMMATION (ESR) # DAMAGED JOINTS 8 − − 80 − − 6 60 − − 4 40 − − 2 20 | HLA MARKERS CLINIC ENTRY TIME SINCE ONSET OF PSORIATIC ARTHRITIS I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 3
J OINT D AMAGE AND M ARKER V ALUES IN C ONTINUOUS T IME 10 − − 100 ESR MARKER TOTAL NUMBER OF DAMAGED JOINTS MARKER OF INFLAMMATION (ESR) # DAMAGED JOINTS 8 − − 80 − − 6 60 − − 4 40 − − 2 20 | | T HLA MARKERS CLINIC ARTHRITIS ENTRY MUTILANS TIME SINCE ONSET OF PSORIATIC ARTHRITIS I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 4
A VAILABLE D ATA D UE TO I NTERMITTENT A SSESSMENTS X 10 − − 100 X ESR MARKER X TOTAL NUMBER OF DAMAGED JOINTS MARKER OF INFLAMMATION (ESR) X # DAMAGED JOINTS 8 − − 80 X − − 6 60 − − 4 40 X − − 2 20 X | | | | | | | | | s 1 s 2 s 3 s 4 s 5 s 6 T HLA MARKERS CLINIC FOLLOW−UP ASSESSMENT TIMES ENTRY TIME SINCE ONSET OF PSORIATIC ARTHRITIS I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 5
D ATA FOR R ESPONSE M ODEL CENSORING INTERVAL | | | PsA ONSET L R HLA DATA (X) D ATA FOR A SSESSMENT P ROCESS Z ( s j ) denotes marker of inflammation w j = s j − s j − 1 , j = 1 , 2 , . . . are waiting times | | | | | | | s 1 s 2 s 3 s 4 s 5 s 6 PsA ONSET Z ( s 1 ) Z ( s 2 ) Z ( s 3 ) Z ( s 4 ) Z ( s 5 ) Z ( s 6 ) HLA DATA (X) I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 6
S EMI -P ARAMETRIC E STIMATES OF W AITING T IME D ISTRIBUTIONS 1.0 0.8 CUMULATIVE PROBABILITY 0.6 Diagnosis to 1st X−RAY 0.4 1st to 2nd X−RAY 2nd to 3rd X−RAY 3rd to 4th X−RAY 4th to 5th X−RAY 5th to 6th X−RAY 6th to 7th X−RAY 0.2 7th to 8th X−RAY 8th to 9th X−RAY 9th to 10th X−RAY 0.0 0 10 20 30 40 TIME IN YEARS I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 7
E STIMATE 1 OF DISTRIBUTION OF TIME TO ARTHRITIS MUTILANS 1.0 TURNBULL ESTIMATE CUMULATIVE PROBABILITY OF ARTHRITIS MUTILANS POINTWISE 95% CONFIDENCE BAND 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 YEARS SINCE DIAGNOSIS OF PsA 1 Turnbull BW (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data, Journal of the Royal Statistical Society. Series B (Methodological) 38, 290-295. I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 8
P ENALIZED R EGRESSION FOR F AILURE T IME D ATA • log L ( β ) is the log likelihood or log partial likelihood • Consider a penalized “likelihood” function p � log L PEN ( β ) = log L ( β ) − π γ,λ ( β j ) (1.1) j =1 • π γ,λ ( · ) is a penalty function • ( γ, λ ) are tuning parameters • λ = ( λ 1 , . . . , λ p ) ′ if we use different penalties for each variable I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 9
S OME P ARTICULAR P ENALTY F UNCTIONS The L 2 penalty π λ ( | β | ) = λ | β | 2 gives ridge regression 2 The L 1 penalty π λ ( | β | ) = λ | β | yields the LASSO 3 S MOOTHLY C LIPPED A BSOLUTE D EVIATION (SCAD) P ENALTY The smoothly clipped absolute deviation (SCAD) 4 penalty has the form A DAPTIVE LASSO The adaptive LASSO 5 with penalty has the form π λ ( | β j | ) = λ | β j | τ j , with small weights τ j chosen for large coefficients and large weights for small 2 Hoerl AE and Kennard RW (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12 (1), 55–67. 3 Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. 4 Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96 (456), 1348–1360. 5 Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101 (476), 1418–1429. I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 10
P ENALIZED R EGRESSION WITH I NTERVAL -C ENSORED D ATA • For individual i , D i = { ( L i , R i ) , X i } , where X i is a p × 1 covariate vector • Data consists of D = { D i , i = 1 , 2 , . . . , m } O BSERVED D ATA L OG -L IKELIOOD m � log L ∝ log [ F ( L i | X i ) − F ( R i | X i )] i =1 where F ( s | X ) is the survivor function P ENALIZED O BSERVED D ATA L OG -L IKELIOOD p � m � log L penalized ∝ log [ F ( L i | X i ) − F ( R i | X i )] − π γ,λ ( β j ) i =1 j =1 I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 11
P ENALIZED R EGRESSION WITH I NTERVAL C ENSORED D ATA B 1 B 2 B 3 B k | | | | | | b 0 b 1 b 2 b 3 b k−1 b k Breakpoints 0 = b 0 < · · · < b K = ∞ define B k = [ b k − 1 , b k ) , k = 1 , . . . , K . � u If I k ( u ) = I ( u ∈ B k ) and S k ( u ) = 0 I ( v ∈ B k ) dv then � K i β )) I k ( u ) ( ρ k exp ( x ′ h ( s ; θ ) = k =1 where θ = ( ρ ′ , β ′ ) ′ , ρ = ( ρ 1 , . . . , ρ K ) ′ and β = ( β 1 , . . . , β p ) ′ C OMPLETE D ATA L IKELIHOOD � m � K { I k ( u i ) [log( ρ k ) + X ′ i β ] − S k ( u i ) ρ k exp( X ′ log L c ( θ ) = i β ) } i =1 k =1 I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 12
A N EM A LGORITHM 6 WITH P ENALIZED R EGRESSION T HE E XPECTATION S TEP Take the conditional expectation of penalized complete data log-likelihood p � � log L c ( θ ) | D ; θ r − 1 � Q ( θ ; θ r − 1 ) = E − π α,λ ( β j ) j =1 If � I k ( u i ) | D i ; θ r − 1 � g r ˆ ik = E � S k ( u i ) | D i ; θ r − 1 � ˆ S r ik = E then � � p � m � K � i β ) − ˆ Q ( θ ; θ r − 1 ) = g r ik (log( ρ k ) + X ′ S r ik ρ k exp( X ′ ˆ i β ) − π γ,λ ( β j ) i =1 j =1 k =1 6 Dempster AP, Laird NM and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38. I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 13
M AXIMIZATION S TEP Let • Z ij = I ( j = k ) , j = 2 , . . . , K , Z ik = (1 , Z i 2 , . . . , Z iK ) ′ • α 1 = log( ρ 1 ) , α j = log( ρ j ) − log( ρ 1 ) , j = 2 , . . . , K Then Q ( θ ; θ r − 1 ) is � � p m K � � � i β ) − ˆ ik ( Z ′ ik α + X ′ ik exp( Z ′ ik α + X ′ g r S r ˆ i β ) − π γ,λ ( β j ) i =1 k =1 j =1 With a pseudo dataset we can maximize Q ( θ ; θ r − 1 ) using standard software for penalized regression (e.g. glmnet(.), SIS(.)) I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 14
S ELECTION OF O PTIMAL P ENALTY λ OPT • The criterion for selecting the optimal λ is similar to the traditional cross- validation. • We partition the dataset into R subsamples T 1 , . . . , T R . • T r and T − T r are r th testing and training sets. • For a given λ , the cross-validation statistic is � R � CV ( λ ) = log L ( θ − r ( λ )) − log L − r ( θ − r ( λ )) . r =1 • L − r is the observed likelihood for the r th training dataset. • θ − r ( λ ) is the estimate for the r th training data. • The optimal λ maximizes � CV ( λ ) . I. V ARIABLE S ELECTION WITH I NTERVAL - CENSORED R ESPONSES 15
Recommend
More recommend