A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1
Outline Background Purpose Methods Results Implications 2
Background 3
Score Resolution in Performance Assessment Usually based on rater agreement: Resolved ratings are Additional ratings some combination Raters disagree Collected of original & new ratings 4
Potential Issues with Agreement-Based Score Resolution 1. Discrepancies in rater judgment may not always indicate inaccurate ratings Two raters could exhibit different levels of severity and both ratings could be plausible They could both accurately reflect student achievement over domains Statistical adjustments for rater severity (e.g., MFRM) could mitigate severity differences 2. Rater agreement may not always indicate accurate ratings Two raters could agree on inaccurate representations of student achievement Unlikely in high-stakes assessments where raters are highly trained, but still possible 5
Score Resolution & Person Fit Agreement-based score resolution has a similar goal as individual person fit analysis in modern measurement models : To identify individual students for whom achievement estimates may not be a reasonable representation of their response pattern 6
Previous Research on Agreement-based Score Resolution & Person Fit Both methods identify similar students whose performances warrant additional investigation (Myford & Wolfe, 2002) Applying agreement-based score resolution improves psychometric defensibility from both a rater agreement & person fit perspective… For most students But not all students! (Wind & Walker, 2019) 7
Brief Illustration: Agreement-Based Resolution does not Always Improve Person Fit 8
Before Resolution After Resolution Raters disagreed on all domains except D1 Raters agreed on all domains • • Flagged for resolution Improved agreement • • Overall shape of observed PRF generally Overall shape of observed PRF deviates • • aligned with model-expected PRF from model-expected PRF 9
Purpose 10
Purpose To explore a model-data-fit informed approach to score resolution in the context of mixed-format educational assessments 11
Research Questions 1. What is the impact of using 2. To what extent do the model- person fit statistics to identify data fit-informed approach and a performances for resolution and rater-agreement approach to rater fit statistics to identify raters score resolution result in similar to provide resolved scores on student achievement estimates student achievement and person and person fit statistics? fit statistics? 12
Methods 13
Simulation Study 🤩 Yes, it’s kind of weird to use a simulation to look at rater judgments And especially resolved rater judgments! Simulated data are useful because: 🧑 We cannot collect new resolved ratings for performances identified for resolution in a secondary analysis of real data We designed the simulation based on results from analyses of large-scale performance 🤔 assessments in which score resolution procedures are applied (Wind & Walker, 2019) 14
Design of Simulation Study In all conditions: 1 writing task Scored on 4 domains 5,000 students Domain difficulty : δ 1 = 0.00, δ 2 = 2.00, δ 3 = - 2.00, δ 4 = 1.00 Student achievement: θ ~ N (0,1) 5-category rating scale (0, 1, 2, 3, 4) 30 MC items (all students respond to all items) 50 total raters Item difficulty: β ~N (0, 0.5) Rater severity: λ ~ N (0,.5) 2 randomly selected raters scored each student’s writing task 15
Simulation Study, continued Manipulated factors: Disordered domain difficulty from the order in the complete sample Type of student misfit: % of Raters Exhibiting % of Students Exhibiting Severity: Misfit: Differential achievement over 0% 0% domains 20% 5% Student*rater 40% 10% interaction Disordered domain difficulty for one rater & kept original order for second rater ½ of students exhibit Severity effect raters: λ ~ N (1.0,.5) each type of misfit 16
Null condition Conditions with 0% simulated rater severity and 0% simulated person misfit informed our evaluation of rater severity & person fit Infit & Outfit MSE statistics Bootstrap approach 🤡 to identify critical values 17
Analysis Procedure 18
(1) Simulate MC responses (2) Analyze simulated (3) Evaluate person fit and CR ratings using specified data using PC-MFR and rater fit ( 🤡 ) conditions model (3C) Identify students with (3B) Identify raters with (3A) Identify misfitting acceptable fit moderate severity and good fit students No resolution needed. ln[ P nijlk (x= k ) / P nijlk (x= k -1) ] = θ n – λ i – δ j – η l – τ lk (4) Simulate new ratings for each student in Ratings from Step 1 are (3A) from one randomly selected rater from the final ratings (3B) using student theta parameter from (1) Students who misfit due to Students who misfit due to ”true” ⚓ MC item difficulties, rater*domain interaction differential achievement domain difficulties, and rater severity locations to values (Maintain disordered domain (Use expected domain difficulty from Step 2 difficulty → expected person → expect improved person fit) misfit after resolution) (7) Analyze final (5) From original ratings (6) Use the ratings ratings for all (1), identify the original identified in (5) and students using rater whose ratings are the new ratings from the PC-MFR closest to the model- (4) as the final model and expected ratings resolved ratings evaluate person fit
👰 Rater Agreement Analysis We also examined rater agreement in our simulated ratings Identified performances with discrepancies ≥ 2 raw score points Used the same approach to identify a third rater and generate additional ratings Compared the 3 rd rater’s ratings to the original ratings & kept the ratings from the closest 2 raters Analyzed resolved ratings using PC-MFR model 20
Results 21
Unresolved Ratings Person fit statistics reflected simulation design 80-96% of students simulated to exhibit misfit were classified as misfitting ≤ 1% of students simulated to fit were classified as misfitting 22
Resolved Ratings Lower average MSE fit statistics for all students in all conditions Some differences for student fit subgroups: Rater*Performance Differential Achievement Interaction students: students: More acceptable • Fit statistics still higher • average person fit Fitting students: (noisier; more misfit) than statistics Fit remained acceptable • the overall sample 1% - 2% classified as • 79-94% classified as • “misfitting” following “misfitting” resolution 23
Comparison to Rater Agreement Similar overall average person fit statistics following resolution Lack of Some differences for student fit subgroups: improvement in person fit = Key Rater*Performance Differential Achievement difference between Interaction students: students: Average person fit • Fit statistics still higher • model fit-informed statistics did not improve Fitting students: (noisier; more misfit) than much following resolution Fit remained acceptable • approach & rater the overall sample 70% - 82% classified as • 75-93% classified as • “misfitting” following agreement-based “misfitting” resolution approach 24
Implications 25
Contribution Previous research on score resolution has focused almost exclusively on rater agreement methods We considered the implications of using indicators of model-data fit to identify performances for resolution & to identify raters to provide the new scores 26
RQ1: What is the impact of using • Fit-informed approach resulted in overall person fit statistics improved model-data fit to identify for students who exhibited performances for misfit due to the rater*performance resolution and interaction rater fit statistics to identify raters to • Effective for improving the provide resolved quality of achievement scores on student estimates in performance assessments achievement and person fit statistics? 27
What does this mean? 🧑 Fit-informed approach can help researchers and practitioners identify students with unexpected ratings both before and following score resolution If fit does not improve: Additional steps may be needed to meaningfully evaluate their achievement related to the construct of interest. E.g., additional qualifiers for interpreting and using the score should be considered along with the score itself 28
RQ2: To what extent • Overall improvement in do the model-data person fit following fit-informed resolution approach and a • Person fit did not improve rater-agreement for rater*performance approach to score interaction subgroup resolution result in • Profiles were less comparable student discrepant achievement • Still misfitting from a measurement estimates and perspective person fit statistics? 29
Recommend
More recommend