The Analysis of Placement Values for Evaluating Discriminatory Measures Margaret Sullivan Pepe & Tianxi Cai Biometrics (2004) Allison Meisner · May 27, 2014 1
Overview When we have a continuous test Y and a binary outcome D , the ROC curve plots the (FPR, TPR) pairs for each possible cutoff of the test. Problem: The ROC curve may differ by patient characteristics. Identifying such variability helps us to apply the test in an optimal way. Solution: ROC regression with placement values 2
Motivating Example Prostate-specific antigen (PSA) is a popular, though controversial, way to screen men for prostate cancer (PCa). The biology of PSA and PCa has implications for the usefulness of PSA as a screening tool: ◮ PSA levels differ by age: older men typically have higher PSA, regardless of PCa status ◮ Age can potentially affect the ability of PSA to discriminate PCa cases ◮ Among PCa cases, PSA measured closer to diagnosis does a better job of discriminating PCa 3
Background: FPR, TPR, ROC 4
Background: FPR, TPR, ROC 5
Background: FPR, TPR, ROC 6
Background: FPR, TPR, ROC 7
Background: Effect of Covariates on ROC 8
Background: Effect of Covariates on ROC 9
Background: Effect of Covariates on ROC 10
Background: Effect of Covariates on ROC 11
Background: Effect of Covariates on ROC 12
Background: Effect of Covariates on ROC 13
Background: Effect of Covariates on ROC 14
Background: Effect of Covariates on ROC Recall, ROC ( u ) = (TPR at FPR = u ) . 15
ROC Model ◮ ROC model (Pepe, 1997): ROC Z D ( u ) = g ( β T Z D + H α ( u )) ◮ α = underlying shape of ROC curve ◮ β = impact of Z D on shape of ROC curve ◮ Problem: estimation ◮ Pepe (2000) and Alonzo and Pepe (2002) create indicators I ( Y Di ≥ F − 1 D (1 − u )) for some set of FPRs u and then use binary regression techniques ◮ Pepe & Cai propose using placement values and what is known about their distribution to estimate the parameters more efficiently 16
Placement Values ◮ Definitions ◮ Placement values: U Di = 1 − F D ( Y Di ) for the i th diseased subject. In words, the placement value for the i th diseased subject is the proportion of the reference (non-diseased) population with marker Y values above Y Di . ◮ If Z D affects the distribution of Y in the reference population, U Di = 1 − F D, Z D ( Y Di ). ◮ ROC curve: ROC ( u ) = P ( Y D ≥ F − 1 D (1 − u )) = (TPR at FPR=u) ◮ Relationship between ROC and placement values P ( Y D ≥ F − 1 ROC ( u ) = D (1 − u )) = P (1 − u ≤ F D ( Y D )) = P (1 − F D ( Y D ) ≤ u ) = P ( U D ≤ u ) 17
Placement Values 18
Proposed Method ◮ ROC model (Pepe, 1997): ROC Z D ( u ) = g ( β T Z D + H α ( u )) ◮ Proposed model: H α ( U D ) = − β T Z D + ǫ , where ǫ ∼ g ◮ Proof of equivalence: Pr ( U D ≤ u ) = Pr ( H α ( U D ) ≤ H α ( u )) Pr ( − β T Z D + ǫ ≤ H α ( u )) = Pr ( ǫ ≤ β T Z D + H α ( u )) = g ( β T Z D + H α ( u )) = ROC Z D ( u ) = Recall that if Z D affects the distribution of Y in the reference population, U Di = 1 − F D, Z D ( Y Di ); then we may write H α ( U D ) = − β T Z D + ǫ ⇔ ROC Z D , Z D ( u ) = g ( β T Z D + H α ( u )) ◮ In our example, Z D = age and Z D = (age, time). 19
Proposed Method: Algorithm Since Pr ( U D ≤ u ) = g ( β T Z D + H α ( u )), we know the density function is f ( u ) = ∂g ( β T Z D + H α ( u )) . ∂u Then, for [ a, b ] ⊂ (0 , 1), the log likelihood is n D [ I ( U Di < a )log { g ( β T Z Di + H α ( a )) } � ℓ ( θ ) = i =1 + I ( U Di > b )log { 1 − g ( β T Z Di + H α ( b )) } + I ( U Di ∈ ( a, b ))log f ( U Di )] where θ = ( α , β ). 20
Proposed Method: Algorithm Estimating F D, Z D ◮ Pepe and Cai advise estimating F D, Z D nonparametrically if Z D is discrete and semiparametrically otherwise. ◮ For semiparametric estimation, Pepe and Cai recommend the semiparamtric regression quantile estimation procedure developed by Heagerty and Pepe (1999). The estimates of the placement values, ˆ U Di , are substituted into ℓ ( θ ), yielding a pseudo-log-likelihood*, which is maximized to estimate θ . 21
Competing Method: Algorithm Alonzo and Pepe proposed an algorithm for fitting ROC regression based on binary regression methods. 1. For [ a, b ] ⊂ (0 , 1), let T = { u 1 , ..., u n T } = { 1 − j/n D ; j = 1 , ..., n D − 1 } ∩ [ a, b ] (the maximal set). 2. Then for each diseased subject i , the n T binary variables B ui are calculated: B ui = I [ ˆ U Di ≤ u ] , u ∈ T. 3. The binary generalized linear regression model E { B ui } = g { β T Z D + H α ( u ) } is fit using standard techniques. The Pepe and Cai method is claimed to be more efficient than that of Alonzo and Pepe. 22
Simulations Set-up ◮ Y D = α − 1 1 { α 0 + β 1 Z 1 + ( β 2 + 0 . 5 α 1 ) Z 2 + ǫ D } Y D = 0 . 5 Z 2 + ǫ D ◮ Z 1 ∼ Bernoulli(0 . 5), Z 2 ∼ Uniform(0 , 1) ◮ ǫ D ∼ N (0 , 1), ǫ D ∼ N (0 , 1) Induced ROC curve: ROC Z D , Z D ( u ) = Pr ( U D ≤ u ) = Pr (1 − F D ( Y D ) ≤ u ) Pr ( F − 1 D (1 − u ) ≤ α − 1 = 1 { α 0 + β 1 z 1 + ( β 2 + 0 . 5 α 1 ) z 2 + ǫ D ) Pr (Φ − 1 (1 − u ) + 0 . 5 z 2 ≤ = α − 1 1 { α 0 + β 1 z 1 + ( β 2 + 0 . 5 α 1 ) z 2 + ǫ D } ) Pr ( ǫ D ≤ − α 1 Φ − 1 (1 − u ) + α 0 + β 1 z 1 + β 2 z 2 ) = Φ( α 1 Φ − 1 ( u ) + α 0 + β 1 z 1 + β 2 z 2 ) = g ( β T Z D + H α ( u )) = Recall, α = shape of ROC, β = effects of Z D on ROC 23
Simulations Note that here Z D = Z 2 and Z D = ( Z 1 , Z 2 ) . Despite their recommendations, Pepe and Cai did not use the semiparametric method of Heagerty and Pepe to estimate placement values. Instead, Pepe and Cai regress Y on Z 2 among the non-diseased subjects: E ( Y D | Z 2 = z 2 ) = γ 0 + γ 1 z 2 ⇒ ˆ ǫ Di = Y Di − ˆ γ 0 − ˆ γ 1 z 2 Di . Then the placement value for subject i was estimated to be n D U Di = 1 ˆ � I (ˆ ǫ D j > Y Di − ˆ γ 0 − ˆ γ 1 z 2 Di ) . n D j =1 24
Simulations Two sets of simulations (1000 simulations each): 1. Pepe and Cai method only ◮ Bias ◮ Empirical SE ◮ Mean estimated SE ◮ Empirical coverage probability ◮ Note: α 0 = 1 , α 1 = 1 , β 1 = 0 . 5 , β 2 = 0 . 7 throughout ◮ Considered [ a, b ] = [0 . 01 , 0 . 99] and [ a, b ] = [0 . 01 , 0 . 20] 2. Pepe and Cai vs. Alonzo and Pepe ◮ Bias ◮ MSE ◮ Two sets of parameter values considered ◮ α 0 = 1 , α 1 = 1 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ α 0 = 1 . 5 , α 1 = 0 . 9 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ Considered [ a, b ] = [0 . 01 , 0 . 99] and [ a, b ] = [0 . 01 , 0 . 50] 25
Simulations: Pepe & Cai ◮ [ a, b ] = [0 . 01 , 0 . 99] 26
Simulations: Pepe & Cai vs. Alonzo & Pepe ◮ α 0 = 1 , α 1 = 1 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ [ a, b ] = [0 . 01 , 0 . 99] 27
Application The proposed method was applied to data from a study on PSA and PCa screening. ◮ 88 PCa cases, 88 age-matched controls ◮ Recall, Z D = age and Z D = (age, time) ◮ Model: ROC Z D , Z D ( u ) = Φ( α 0 + α 1 Φ − 1 ( u ) + β 1 time + β 2 age) ◮ SE estimates from the bootstrap (500 replications) Estimate (SE) α 0 4.30 (0.93) α 1 0.84 (0.09) β 1 -0.16 (0.03) β 2 -0.04 (0.01) 28
Conclusions ◮ The proposed method has nice intuition behind it and makes full use of the data through placement values, as opposed to creating indicators. ◮ Implementation of the proposed method is less straightforward and is not particularly computationally efficient. ◮ In most scenarios, the proposed method is more statistically efficient than the binary regression technique. ◮ Both methods are susceptible to misspecification in both the estimation of F D and the form of the ROC model. 29
Effects of Misspecification What happens when Y D = 0 . 5 Z 2 2 + N (0 , ( Z 2 + 0 . 5) 2 ) but we still assume Y D = 0 . 5 Z 2 + N (0 , 1)? This will impact 1. estimates of placement values 2. form of the induced ROC curve (used in the likelihood calculation) 30
Effects of Misspecification ◮ α 0 = 1 , α 1 = 1 , β 1 = 0 . 5 , β 2 = 0 . 7 31
Effects of Misspecification ◮ α 0 = 1 . 5 , α 1 = 0 . 9 , β 1 = 0 . 5 , β 2 = 0 . 7 32
Conclusions ◮ The proposed method has nice intuition behind it and makes full use of the data through placement values, as opposed to creating indicators. ◮ Implementation of the proposed method is less straightforward and is not particularly computationally efficient. ◮ In most scenarios, the proposed method is more statistically efficient than the binary regression technique. ◮ Both methods are susceptible to misspecification in both the estimation of F D and the form of the ROC model. 33
Simulations: Pepe & Cai ◮ [ a, b ] = [0 . 01 , 0 . 20] 34
Simulations: Pepe & Cai vs. Alonzo & Pepe ◮ α 0 = 1 , α 1 = 1 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ [ a, b ] = [0 . 01 , 0 . 50] 35
Simulations: Pepe & Cai vs. Alonzo & Pepe ◮ α 0 = 1 . 5 , α 1 = 0 . 9 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ [ a, b ] = [0 . 01 , 0 . 99] 36
Simulations: Pepe & Cai vs. Alonzo & Pepe ◮ α 0 = 1 . 5 , α 1 = 0 . 9 , β 1 = 0 . 5 , β 2 = 0 . 7 ◮ [ a, b ] = [0 . 01 , 0 . 0 . 5] 37
Recommend
More recommend