PARCC Research Results Karen E. Lochbaum Pearson June 22, 2016 Presented at that National Conference on Student Assessment, Philadelphia, PA 1
Research Questions • Do scores assigned by the Intelligent Essay Assessor (IEA) agree with human scores as well as human scores agree with each other? ‒ Across all prompts and traits for all responses? ‒ Across prompts and traits for responses across subgroups? • Do scores assigned by IEA agree with scores assigned by experts to validity papers as well as human scores do?
Series of Studies and Results • 2014: Field Test Study • Promising Initial Results • 2015: Year 1 Operational Studies • Performance • Validity responses • Subgroups • 2016: Year 2 Operational Performance 3
2015 Research Summary 4
Year 1 Operational Study • IEA served as 10% second score • A subset of prompts received an additional human score • One of each prompt type • In each grade level • Study compared IEA-human to human- human performance on 26 prompts 5
Summary of Human vs. IEA Exact Agreement Rates The exact agreement between IEA and human readers was higher than it was between two human readers. And higher still between IEA and more experienced human back read scorers.
Summary of Human vs. IEA Exact Agreement Rates on Validity Responses IEA’s exact agreement on validity responses was higher than it was for humans
Human vs. IEA Exact Agreement Rates by Subgroup Comparison Af Am Asian Hispanic 2+ Races Native Am Human 2 Human 1 68.6% 62.8% 67.1% 69.8% 65.4% IEA Op Human 1 74.0% 68.1% 72.5% 72.6% 72.6% Comparison White ELL SWD Female Male Human 2 Human 1 65.0% 71.2% 75.5% 63.9% 68.2% IEA Op Human 1 69.9% 76.3% 78.6% 69.0% 73.0% Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites) The exact agreement between IEA and human readers was higher than it was between two human readers for various demographic subgroups.
2016 Operational Performance 9
A Reminder: Criteria for Operationally Deploying the AI Scoring Model 1. Primary Criteria – Based on validity responses • With smart routing applied as needed, IEA agreement is as good or better than human agreement for both trait scores 2. Contingent Primary Criteria (if validity responses are not available) • With smart routing applied as needed, IEA-Human exact agreement is within 5.25% of Human-Human exact agreement for both trait scores 3. Secondary Criteria - Based on the training responses • With smart routing applied as needed, IEA-human differences on statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses 10
Summary of Results: Comparison of IEA and Human Scores • Mean and Standard Deviations of IEA and Human Scores across all prompts were very close • Some variability compared to the first human scorer might be expected item-by-item because IEA was trained on the “best” score available (backread, resolution, first read)
IEA Mean vs. Human Mean Conventions Trait
IEA SD vs. Human SD Conventions Trait
IEA Mean vs. Human Mean Expressions Trait
IEA SD vs. Human SD Expressions Trait
IEA vs. Human Validity Agreement Conventions Trait Grade Exact SP0 SP1 SP2 SP3 3 4 Blue means IEA 4 performance exceeds 5 5 human by > 5.25 6 6 Blue-Green means IEA at 6 or above human 7 7 8 Green means IEA 9 performance within 5.25 9 of human 9 10 Red means IEA 10 performance lower than 10 11 human by > 5.25 11 11 11 16 11
IEA vs. Human Validity Agreement Expressions Trait Grade Exact SP0 SP1 SP2 SP3 SP4 3 4 4 5 5 6 6 6 7 7 8 9 9 9 10 Blue exceeds by > 5.25 10 Blue-Green exceeds 10 Green within 5.25 11 Red lower by > 5.25 11 11 11 17 11
IEA vs. Human Agreement Conventions Trait Grade Exact SP0 SP1 SP2 SP3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7 7 7 8 8 Blue exceeds by > 5.25 8 8 Blue-Green exceeds 8 Green within 5.25 9 9 Red lower by > 5.25 9 10 10 10 18 10 11
IEA vs. Human Agreement Expressions Trait Grade Exact SP0 SP1 SP2 SP3 SP4 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7 7 7 8 8 Blue exceeds by > 5.25 8 8 Blue-Green exceeds 8 9 Green within 5.25 9 Red lower by > 5.25 9 10 10 10 10 19 11
A Reminder: Subgroup Analyses • For each prompt, we evaluated the performance of IEA for various subgroups • We calculated various agreement indices (r, Kappa, Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results • We also looked at standardized mean differences (SMDs) between IEA and human scores • We flagged differences for any groups based on the quality criteria: Measure Threshold Human-Machine Difference Less than 0.7 Greater than 0.1 Pearson Correlation Less than 0.4 Greater than 0.1 Kappa Less than 0.7 Greater than 0.1 Quadratic Weighted Kappa Less than 65% Greater than 5.25% Exact Agreement Greater than 0.15 Standardized Mean Difference 20
Subgroup Analyses • 29/55 prompts had no flags on either trait • When flags did occur • Only for one or two groups • Only one or two of the quality measures • None sufficiently concerning to consider retraining • Sometimes different measures indicated different results • Lower than humans on exact agreement • Higher on quadratic weighted kappa • SMD flags were rare • Always indicated higher IEA scores than human scores 21
Summary of Subgroup Analyses 22
Spring 2016 Continuous Flow Performance With 6.5M responses scored YTD 23
Summary • Extensive research was conducted over three years to validate the use of the Continuous Flow system on the PARCC assessment • Initial results indicate its successful operational use in 2016 • Continuous Flow combines the strengths and benefits of both human and automated scoring • Continuous Flow performance exceeds that of a human only scoring system while routing potentially challenging responses for further review 24
Recommend
More recommend