Approach to Score Resolution in Performance Assessments Stefanie A. - PowerPoint PPT Presentation

A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1

Outline Background Purpose Methods Results Implications 2

Background 3

Score Resolution in Performance Assessment Usually based on rater agreement: Resolved ratings are Additional ratings some combination Raters disagree Collected of original & new ratings 4

Potential Issues with Agreement-Based Score Resolution 1. Discrepancies in rater judgment may not always indicate inaccurate ratings  Two raters could exhibit different levels of severity and both ratings could be plausible  They could both accurately reflect student achievement over domains  Statistical adjustments for rater severity (e.g., MFRM) could mitigate severity differences 2. Rater agreement may not always indicate accurate ratings  Two raters could agree on inaccurate representations of student achievement  Unlikely in high-stakes assessments where raters are highly trained, but still possible 5

Score Resolution & Person Fit  Agreement-based score resolution has a similar goal as individual person fit analysis in modern measurement models :  To identify individual students for whom achievement estimates may not be a reasonable representation of their response pattern 6

Previous Research on Agreement-based Score Resolution & Person Fit  Both methods identify similar students whose performances warrant additional investigation (Myford & Wolfe, 2002)  Applying agreement-based score resolution improves psychometric defensibility from both a rater agreement & person fit perspective…  For most students  But not all students! (Wind & Walker, 2019) 7

Brief Illustration: Agreement-Based Resolution does not Always Improve Person Fit 8

Before Resolution After Resolution Raters disagreed on all domains except D1 Raters agreed on all domains • • Flagged for resolution Improved agreement • • Overall shape of observed PRF generally Overall shape of observed PRF deviates • • aligned with model-expected PRF from model-expected PRF 9

Purpose 10

Purpose  To explore a model-data-fit informed approach to score resolution in the context of mixed-format educational assessments 11

Research Questions 1. What is the impact of using 2. To what extent do the model- person fit statistics to identify data fit-informed approach and a performances for resolution and rater-agreement approach to rater fit statistics to identify raters score resolution result in similar to provide resolved scores on student achievement estimates student achievement and person and person fit statistics? fit statistics? 12

Methods 13

Simulation Study 🤩  Yes, it’s kind of weird to use a simulation to look at rater judgments  And especially resolved rater judgments!  Simulated data are useful because: 🧑  We cannot collect new resolved ratings for performances identified for resolution in a secondary analysis of real data  We designed the simulation based on results from analyses of large-scale performance 🤔 assessments in which score resolution procedures are applied (Wind & Walker, 2019) 14

Design of Simulation Study In all conditions:  1 writing task  Scored on 4 domains  5,000 students  Domain difficulty : δ 1 = 0.00, δ 2 = 2.00, δ 3 = - 2.00, δ 4 = 1.00  Student achievement: θ ~ N (0,1)  5-category rating scale (0, 1, 2, 3, 4)  30 MC items (all students respond to all items)  50 total raters  Item difficulty: β ~N (0, 0.5)  Rater severity: λ ~ N (0,.5)  2 randomly selected raters scored each student’s writing task 15

Simulation Study, continued Manipulated factors: Disordered domain difficulty from the order in the complete sample Type of student misfit: % of Raters Exhibiting % of Students Exhibiting Severity: Misfit:  Differential achievement over  0%  0% domains  20%  5%  Student*rater  40%  10% interaction Disordered domain difficulty for one rater & kept original order for second rater ½ of students exhibit Severity effect raters: λ ~ N (1.0,.5) each type of misfit 16

Null condition  Conditions with 0% simulated rater severity and 0% simulated person misfit informed our evaluation of rater severity & person fit  Infit & Outfit MSE statistics  Bootstrap approach 🤡 to identify critical values 17

Analysis Procedure 18

(1) Simulate MC responses (2) Analyze simulated (3) Evaluate person fit and CR ratings using specified data using PC-MFR and rater fit ( 🤡 ) conditions model (3C) Identify students with (3B) Identify raters with (3A) Identify misfitting acceptable fit moderate severity and good fit students No resolution needed. ln[ P nijlk (x= k ) / P nijlk (x= k -1) ] = θ n – λ i – δ j – η l – τ lk (4) Simulate new ratings for each student in Ratings from Step 1 are (3A) from one randomly selected rater from the final ratings (3B) using student theta parameter from (1) Students who misfit due to Students who misfit due to ”true” ⚓ MC item difficulties, rater*domain interaction differential achievement domain difficulties, and rater severity locations to values (Maintain disordered domain (Use expected domain difficulty from Step 2 difficulty → expected person → expect improved person fit) misfit after resolution) (7) Analyze final (5) From original ratings (6) Use the ratings ratings for all (1), identify the original identified in (5) and students using rater whose ratings are the new ratings from the PC-MFR closest to the model- (4) as the final model and expected ratings resolved ratings evaluate person fit

👰 Rater Agreement Analysis  We also examined rater agreement in our simulated ratings  Identified performances with discrepancies ≥ 2 raw score points  Used the same approach to identify a third rater and generate additional ratings  Compared the 3 rd rater’s ratings to the original ratings & kept the ratings from the closest 2 raters  Analyzed resolved ratings using PC-MFR model 20

Results 21

Unresolved Ratings  Person fit statistics reflected simulation design  80-96% of students simulated to exhibit misfit were classified as misfitting  ≤ 1% of students simulated to fit were classified as misfitting 22

Resolved Ratings  Lower average MSE fit statistics for all students in all conditions  Some differences for student fit subgroups: Rater*Performance Differential Achievement Interaction students: students: More acceptable • Fit statistics still higher • average person fit Fitting students: (noisier; more misfit) than statistics Fit remained acceptable • the overall sample 1% - 2% classified as • 79-94% classified as • “misfitting” following “misfitting” resolution 23

Comparison to Rater Agreement  Similar overall average person fit statistics following resolution Lack of  Some differences for student fit subgroups: improvement in person fit = Key Rater*Performance Differential Achievement difference between Interaction students: students: Average person fit • Fit statistics still higher • model fit-informed statistics did not improve Fitting students: (noisier; more misfit) than much following resolution Fit remained acceptable • approach & rater the overall sample 70% - 82% classified as • 75-93% classified as • “misfitting” following agreement-based “misfitting” resolution approach 24

Implications 25

Contribution  Previous research on score resolution has focused almost exclusively on rater agreement methods  We considered the implications of using indicators of model-data fit to identify performances for resolution & to identify raters to provide the new scores 26

RQ1: What is the impact of using • Fit-informed approach resulted in overall person fit statistics improved model-data fit to identify for students who exhibited performances for misfit due to the rater*performance resolution and interaction rater fit statistics to identify raters to • Effective for improving the provide resolved quality of achievement scores on student estimates in performance assessments achievement and person fit statistics? 27

What does this mean? 🧑  Fit-informed approach can help researchers and practitioners identify students with unexpected ratings both before and following score resolution  If fit does not improve:  Additional steps may be needed to meaningfully evaluate their achievement related to the construct of interest.  E.g., additional qualifiers for interpreting and using the score should be considered along with the score itself 28

RQ2: To what extent • Overall improvement in do the model-data person fit following fit-informed resolution approach and a • Person fit did not improve rater-agreement for rater*performance approach to score interaction subgroup resolution result in • Profiles were less comparable student discrepant achievement • Still misfitting from a measurement estimates and perspective person fit statistics? 29

Approach to Score Resolution in Performance Assessments Stefanie A. - PowerPoint PPT Presentation

A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1 Outline Background Purpose Methods Results Implications 2

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

SIGBI Limited General Meeting 2019 Resolutions 1-6 Resolution 1 Resolution 2 Resolution 3

Patagonia Gold Plc 2009 Patagonia Gold VOTING ORDINARY SPECIAL Resolution 1 Resolution 2

The Single Resolution Mechanism Elke Knig Chair of the Single Resolution Board FDIC Systemic

Sample Score Report by three areas, or claims. Sample Score

Entrepreneurship & SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship

Score Distribution Models Evangelos Kanoulas Keshi Dai Virgil Pavlu Javed Aslam

Linear Classification w T x i is the classifier score for the instance x i The score can be used

CCP Resolution: proposal for an EU Regulation and FSB Guidance on CCP Resolution 2ND EUROPEAN

Patagonia Gold Plc g 2010 Cap-Oeste updated June 2010 Patagonia Gold AGM VOTING 2010 g

Security Council Resolution 1540 Security Council Resolution 1540 Security Council Resolution

How to Make Cross-border Resolution Effective Dr. Axel Kunde, Head of Unit Resolution Planning and

Resolution 830 Public Consultation Process and Revised Draft Resolution ARHA Redevelopment Work

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Resolution 1 Resolution for predicate logic Gilmores algorithm is correct and complete, but

High-End Diamonds Continue to Sparkle Investor Presentation February 2016 Disclaimer The

TO FIGHT CANCER Investor presentation September 2020 IMPORTANT NOTICE AND DISCLAIMER This

Renewable Integration Market & Product Review- Phase 2 Day-of Market Design Framework

Crash Investigation Created By Rohan Govardhan Project Leader, JP Research India Pvt. Ltd.

Medial Malleolar Fixation: Sex Appeal Long 3.5 mm screws Bob Zura, MD LSU Health New Orleans

Black Sea Dynamics Joanna Staneva Click to edit Master text styles Second level

Corneal Changes Following LASIK and Enhancement with Microkeratome and Femtosecond Laser Flaps

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

Sambuz

Useful Links

Newsletter

Mail Us

Approach to Score Resolution in Performance Assessments Stefanie A. - PowerPoint PPT Presentation

A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1 Outline Background Purpose Methods Results Implications 2

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

SIGBI Limited General Meeting 2019 Resolutions 1-6 Resolution 1 Resolution 2 Resolution 3

Patagonia Gold Plc 2009 Patagonia Gold VOTING ORDINARY SPECIAL Resolution 1 Resolution 2

The Single Resolution Mechanism Elke Knig Chair of the Single Resolution Board FDIC Systemic

Sample Score Report by three areas, or claims. Sample Score

Entrepreneurship &amp; SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship

Score Distribution Models Evangelos Kanoulas Keshi Dai Virgil Pavlu Javed Aslam

Linear Classification w T x i is the classifier score for the instance x i The score can be used

CCP Resolution: proposal for an EU Regulation and FSB Guidance on CCP Resolution 2ND EUROPEAN

Patagonia Gold Plc g 2010 Cap-Oeste updated June 2010 Patagonia Gold AGM VOTING 2010 g

Security Council Resolution 1540 Security Council Resolution 1540 Security Council Resolution

How to Make Cross-border Resolution Effective Dr. Axel Kunde, Head of Unit Resolution Planning and

Resolution 830 Public Consultation Process and Revised Draft Resolution ARHA Redevelopment Work

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Resolution 1 Resolution for predicate logic Gilmores algorithm is correct and complete, but

High-End Diamonds Continue to Sparkle Investor Presentation February 2016 Disclaimer The

TO FIGHT CANCER Investor presentation September 2020 IMPORTANT NOTICE AND DISCLAIMER This

Renewable Integration Market &amp; Product Review- Phase 2 Day-of Market Design Framework

Crash Investigation Created By Rohan Govardhan Project Leader, JP Research India Pvt. Ltd.

Medial Malleolar Fixation: Sex Appeal Long 3.5 mm screws Bob Zura, MD LSU Health New Orleans

Black Sea Dynamics Joanna Staneva Click to edit Master text styles Second level

Corneal Changes Following LASIK and Enhancement with Microkeratome and Femtosecond Laser Flaps

Noise Programs &amp; NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

Sambuz

Useful Links

Newsletter

Mail Us

Entrepreneurship & SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship

Renewable Integration Market & Product Review- Phase 2 Day-of Market Design Framework

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1