Rater agreement - ordinal ratings Karl Bang Christensen Dept. of Biostatistics, Univ. of Copenhagen NORDSTAT, 2012 http://biostat.ku.dk/~kach/ 1
Rater agreement - ordinal ratings Methods for analyzing rater agreement are well-established when ratings are dichotomous or if they can be assumed to be normally distributed. raters r = 1 , . . . , R subjects s = 1 , . . . , S X rs ∈ { 0 , 1 , . . . , K } 2
Two raters (R=2) A A A A Pr ( a ) observed proportion of agreement, κ coefficient κ = Pr ( a ) − Pr ( e ) (1) 1 − Pr ( e ) where � Pr ( e ) = Pr ( X 1 = k ) Pr ( X 2 = k ) k is the expected proportion of agreement under independence. Cohen. Educational and Psychological Measurement, 1960,20:37–46. 3
Two raters (R=2) κ coefficient (1) is widely used (i) the value depends on the margins and thus on the sample (ii) if ratings are ordinal this is not taken into account (iii) no rational for saying that, e.g., κ > 0 . 7 is good. κ coefficient (1) is a marginal (population average) measure. Cohen. Educational and Psychological Measurement, 1960, 20:37-46. 4
(i) value depends on margins 70 three tables with agreement about 100 of the subjects: + - + - + - X 2 X 2 X 2 + 20 20 + 10 20 + 5 20 X 1 X 1 X 1 - 10 50 - 10 60 - 10 65 κ =0.35 κ =0.21 κ =0.08 5
(ii) if ratings are ordinal this is not taken into account Weighted κ coefficient A 0.75A 0.50A 0.25A 0.75A A 0.75A 0.50A 0.50A 0.75A A 0.75A 0.25A 0.50A 0.75A A Arbitrary weights (two standards implemented in SAS) 6
Marginal homogeneity Beyond agreement we would want Pr( X 1 = k ) = Pr( X 2 = k ) for all k = 0 , 1 , . . . , K Bowkers test of Symmetry tests this hypothesis. For K = 1 this is McNemars test. Pr( X 1 = 1 , X 2 = 0) Pr( X 1 = 1 , X 2 = 0) + Pr( X 1 = 0 , X 2 = 1) McNemar. Psychometrika 1947, 12:153-157. 7
Continuous data: regression model ǫ rs ∼ N (0 , ω 2 ) X rs = δ r + γ s + ǫ rs Limits of agreement / Bland-Altman plot X 1 s − X 2 s = δ 1 − δ 2 + ( ǫ 1 s − ǫ 2 s ) 95% reference interval V ( ǫ 1 s − ǫ 2 s ) ∼ N (0 , 2 ω 2 ) 8
Ordinal data: regression models / IRT divide-by-total models exp( xθ s − � x k =1 β rk ) l exp( lθ s − � l � k =1 β rk ) Pr( X rs = x | θ ) = exp( α r ( xθ s − � x k =1 β rk )) l α r ( lθ s − � l � k =1 β rk )) ( K = 1: logistic regression) threshold models Φ( .. ) − Φ( .. ) Pr( X rs = x | θ ) = expit( .. ) − expit( .. ) Thissen, Steinberg. Psychometrika, 1986, 51:567-577. 9
θ = θ s latent location of subject s θ 10
θ = θ s latent location of subject s θ 11
θ = θ s latent location of subject s θ 12
θ = θ s latent location of subject s θ 13
θ = θ s latent location of subject s , ( β rk ) k =1 ,...,K rater parameters θ 14
Marginal homogeneity Rater parameters β r = ( β r 1 , . . . , β rK ). Test H 0 : β r = β for all r = 1 , . . . , R using likelihood ratio test based on (2) or (3) Example: 150 subjects, two raters X 1 = 0 X 1 = 1 X 1 = 2 X 2 = 0 9 10 1 X 2 = 1 22 59 14 X 2 = 2 3 25 7 Bowker’s test S = 8 . 6, d f = 3, p = 0 . 0351. LRT based on (2) − 2 log Q = 12 . 2, d f = 2, p = 0 . 0022. 15
Quantify agreement Randomly chosen person s with location θ s = θ . Compute reference interval for | X 1 s − X 2 s | , Pr( X 1 s = X 2 s ) or Pr( | X 1 s − X 2 s | > 1) if θ ∼ N (0 , ω 2 ): computations for ’typical’ person θ = 0. Compare to population distribution of θ . 16
Example X ∈ { 0 , 1 , 2 } marginal homogeneity H 0 : β r 1 = β r 2 accepted. Common estimate ( β 1 , β 2 ) = ( − 0 . 82 , − 0 . 75) Table � Pr( X 1 = 0 , X 2 = 0 | θ ) Pr( X 1 = 0 , X 2 = 1 | θ ) Pr( X 1 = 0 , X 2 = 2 | θ ) � Pr( X 1 = 1 , X 2 = 0 | θ ) Pr( X 1 = 1 , X 2 = 1 | θ ) Pr( X 1 = 1 , X 2 = 2 | θ ) Pr( X 1 = 2 , X 2 = 0 | θ ) Pr( X 1 = 2 , X 2 = 1 | θ ) Pr( X 1 = 2 , X 2 = 2 | θ ) (normal latent distribution: Typical patient ∼ θ = 0) 17
Example X ∈ { 0 , 1 , 2 } Agreement Pr(( X 1 , X 2 ) ∈ { (1 , 1) , (2 , 2) , (3 , 3) }| θ ) 18
Examples of clinical applications T¨ onnis grade 0,1,2,3: rating of x-rays in hip surgery population. Modified Ashworth Scale 0,1,2,3,4,5: Spasticity as complications in spinal cord lesion patients (hospital sample). Sparse tables. Assessment of exercise-induced laryngeal obstruction 0,1,2,3: sub sampling best and worst cases. 19
Issues Conditional inference if persons locations cannot be assumed to be normally distributed. Reduced rank parametrization if tables are sparse. Interpretation on original scale. 20
Ordinal data: regression models / IRT (Marginal or Conditional inference). divide-by-total models exp( xθ s − � x k =1 β rk ) l exp( lθ s − � l � k =1 β rk ) Pr( X rs = x | θ ) = exp( α r ( xθ s − � x k =1 β rk )) l α r ( lθ s − � l � k =1 β rk )) ( K = 1: logistic regression) threshold models Φ( .. ) − Φ( .. ) Pr( X rs = x | θ ) = expit( .. ) − expit( .. ) Thissen, Steinberg. Psychometrika, 1986, 51:567-577. 21
Ordinal data: regression models / IRT (Marginal or Conditional inference). divide-by-total models exp( xθ s − � x k =1 β rk ) ( C, M ) l exp( lθ s − � l � k =1 β rk ) Pr( X rs = x | θ ) = exp( α r ( xθ s − � x k =1 β rk )) ( M ) l α r ( lθ s − � l � k =1 β rk )) ( K = 1: logistic regression) threshold models Φ( .. ) − Φ( .. ) ( M ) Pr( X rs = x | θ ) = expit( .. ) − expit( .. ) ( M ) Thissen, Steinberg. Psychometrika, 1986, 51:567-577. 22
let X s = ( X 1 s , . . . , X Rs ) and x s = ( x 1 s , . . . , x Rs ) Marginal inference � � l M ( β ) = log Pr ( X s = x s | θ s ) ϕ ( θ s ) (2) s similar to the model yielding Limits of agreement Conditional inference l C ( β ) = Pr ( X s = x s | X 1 s + . . . + X Rs = x 1 s + . . . + x Rs ) (3) similar to the McNemar test. Bock, Aitkin. Psychometrika 1981, 46:443-459. Andersen. Journal of the Royal Statistical Society B, 1972, 34:42-54. 23
Reduced rank parametrization Interpreting and testing differences in rater parameters β r = ( β rx ) x =1 ,...,K and β r ′ = ( β r ′ x ) x =1 ,...,K can be difficult for K = 4 , 5 , . . . Reparametrization using ’location’ parameter µ r and ’spread’ pa- rameter σ r β rx = µ r + (2 x − m − 1) σ r . Andrich. Psychometrika, 1982, 47:105-113. 24
Reduced rank parametrization Reparametrize ( β 1 + β 2 , β 2 − β 1 ). Hypotheses: 2 2 Raters differ only wrt. location Raters differ only wrt. spread Raters do not differ 25
Interpretation on original scale Probability of agreement across values of θ can be compared to modeled distribution: ϕ ( θ ). empirical distribution: ˆ θ 1 , . . . , ˆ θ S found by maximizing L ( θ ) = Pr ˆ β ( X s = x s | θ ) . values E ( X rs | θ s = θ ), same for all r under marginal homogeneity. 26
Example X ∈ { 0 , 1 , 2 } Agreement Pr(( X 1 , X 2 ) ∈ { (1 , 1) , (2 , 2) , (3 , 3) }| θ ) θ 27
Example X ∈ { 0 , 1 , 2 } Agreement Pr(( X 1 , X 2 ) ∈ { (1 , 1) , (2 , 2) , (3 , 3) }| θ ) θ 28
Example X ∈ { 0 , 1 , 2 } Agreement Pr(( X 1 , X 2 ) ∈ { (1 , 1) , (2 , 2) , (3 , 3) }| θ ) ˆ θ 29
Example X ∈ { 0 , 1 , 2 } Agreement Pr(( X 1 , X 2 ) ∈ { (1 , 1) , (2 , 2) , (3 , 3) }| θ ) E ( X | ˆ θ ) = 0 . 5 , 1 . 0 , 1 . 5 30
Recommend
More recommend