Continuous Flow Scoring of Prose Constructed Response: A Hybrid of Automated and Human Scoring Denny Way Pearson June 22, 2016 Presented at that National Conference on Student Assessment, Philadelphia, PA 1
Background • This presentation and the one that follows are based on systems Pearson has developed to support high volume, large scale applications of automated scoring (AS) of written constructed response items • Much of Pearson’s recent work in this area has been in supporting the PARCC assessment consortium • The PARCC English Language Arts / Literacy (ELA/L) assessments include a variety of prose constructed response (PCR) tasks, which require students to write extended responses 2
Scoring Written Responses on PARCC • The extensive use of writing is a strength of the ELA/L assessment and a primary reason for its strong rating obtained in evaluations comparing it to other Common Core assessments 1 • Historically writing has been scored by humans, which takes time and adds cost. Research has indicated that automated scoring can effectively supplement human scoring to reduce cost and increase scoring efficiency 1 See Doorey, N. & Polikoff, M. (2016, February). Evaluating the quality of next generation assessments. Washington, DC: Thomas Fordham Institute. 3
Use of Automated Scoring for PARCC ELA/L • Automated scoring of writing was assumed for the operational PARCC assessments beginning in 2015-16 • At their preference, individual PARCC states may optionally contract to have 100% human scoring • A single score is reported for each PARCC PCR with 10% second scoring for the purposes of reliability • To support the use of automated scoring, extensive research has been conducted; this research and proposed operational procedures were vetted with PARCC’s Technical Advisory Committee (TAC) and approved by the PARCC State Leads 4
Topics for This Presentation • What is “continuous flow” scoring? • Training IEA on operational data • Criteria for operationally deploying the AI scoring model • Evaluating results 5
Continuous Flow • In continuous flow scoring of constructed response items, a hybrid of human and Pearson’s Intelligent Essay Assessor (IEA) is used to optimize both quality and cost of scoring • Continuous flow utilizes human scoring along with automated scoring such that responses can be branched to flow to either scoring approach • Part of continuous flow involves “smart routing” a process that involves automatically routing certain responses to obtain an additional human score by predicting that the automated score will be less likely to agree with a human score 6
Smart Routing Concept 7
8 Continuous Flow Process Diagram
Training IEA on Operational Data • Continuous flow makes it relatively easy to train the automated scoring engine on operational data early in the administration window • During this process, multiple human scores can be requested and any backreading scores assigned by supervisors can also be used to obtain the best possible data to train IEA • Human scoring is monitored closely and when criteria are met, IEA modeling takes place • Once IEA is trained on a particular prompt, results are evaluated by comparing IEA-human scoring agreement with human-human scoring agreement 9
Reporting with Multiple Scores: Best (i.e., Highest Quality) Score Wins • Although multiple scores may be assigned for a given response, only one can be reported • When multiple scores exist, there is a hierarchy for deciding which score is actually reported • When the automated score is the only score, it is reported • When there is an automated score and a human score, the human score is reported • When there are two human scores, the first score is reported • When there is a supervisor back read score, the back read score is reported • When there are two non-adjacent scores, a resolution score is provided and the resolution score is reported 10
Daily Scoring Status Calls • While training IEA operationally, daily status meetings were held with automated scoring experts, performance scoring operational staff, content experts, program team, and psychometricians • Scoring statistics were shared daily to review human scoring performance and the operational readiness of automated scoring models. • Interventions were made where scoring challenges were encountered • Resetting human scores where agreement was low • Additional training clarifications • Efforts to sample high performance responses 11
Criteria for Operationally Deploying the AI Scoring Model - Considerations • Need for automated criteria that can be applied in real time • Focus on validity as the most important criteria • Governs evaluation of human scoring • Expressed in terms of agreement rates rather than other statistics • Need to document performance of AI scoring for subgroups • Metrics based on the research literature 2 Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15 2 See Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practices, 31, 2–13. 12
Criteria for Operationally Deploying the AI Scoring Model 1. Primary Criteria – Based on validity responses • With smart routing applied as needed, IEA agreement is as good or better than human agreement for both trait scores 2. Contingent Primary Criteria (if validity responses are not available) • With smart routing applied as needed, IEA-Human exact agreement is within 5.25% of Human-Human exact agreement for both trait scores 3. Secondary Criteria - Based on the training responses • With smart routing applied as needed, IEA-human differences on statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses 13
Subgroup Analyses • For each prompt, we evaluated the performance of IEA for various subgroups • We calculated various agreement indices (r, Kappa, Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results • We also looked at standardized mean differences (SMDs) between IEA and human scores • We flagged differences for any groups based on the quality criteria: Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15 14
Summary • Continuous flow scoring involves integrating human and automated scoring processes to support high quality and efficient scoring • This presentation described various processes involved in continuous flow scoring as applied to the PARCC assessment program • The presentation that follows will share some of the research and initial operational results for the PARCC program based on continuous flow scoring. 15
Recommend
More recommend