Automated Scoring and Rater Drift National Conference on Student Assessment Detroit, 2010 Wayne Camara The College Board
Rater drift • When ratings are made over a period of time there is a concern that ratings may become more lenient or harsh. • Occurs in all contexts, performance appraisals, scoring performance assessments, judging athletic events… • Increased risk when: • Rubrics (criteria) are more subjective. • Scoring occurs over time (within year, between years). • Pressure to score many tasks quickly
Detecting and Correcting Rater Drift • Tools may differ between assessments completed on paper and computer. • Multiple readers, with mixed assignments • Read behind • Seed papers from previous administration, benchmark papers (established mark) • Calibration of readers, retraining
Automated Scoring • Automated scoring systems – essays, spoken responses, short content items, numerical /graphical responses to math questions (with verifiable and limited set of correct responses). • Typically evaluated through comparison with human readers. • Correlations, Weighted Kappa (preferred over % agreement which is misleading and sensitive to rating scale – 4-pt vs 9-pt). • Exact and adjacent agreement is impacted by score scale (4 vs 9 pt) • Similar distributions as humans (variation in ratings, use of extremes in scale). • Also validated against external criteria (other test sections, previous scores on same test, scores on similar tests, grades)
Automated Scoring – Issues to consider in using scores for detecting drift • Rubric – whether it is general vs task specific; holistic vs mechanistic, unidimensional. • Using other sections of the test is useful (such as MC items). However, there are also weaknesses with using MC items. • Relationship between performance tasks and MC items should differ (assume they measure different parts of the construct). Need to ensure consistency across tasks before employing MC section corr. as criteria. • Best when computed separately for each dimension (not combined score) and each rater (not total score)
Papers by Lottridge and Schulz : Best Practices • Scoring engine must be trained – if drift exists then using papers from a brief time period can introduce similar error in system. • Note that raters will tend to avoid extreme scores – but some AS systems also avoid extreme scores • Selection of training sample – tasks already calibrated, representativeness of tasks. • Compare reader agreement AND distribution of scores across all results (readers)
Papers by Lottridge and Schulz : Best Practices • Year to year drift should be checked (e.g., rescore papers, N=500 to 1,000). • Intrareader correlations and agreement increased over time. • AS is treated as a single scorer in comparison to each reader • Propose using as second reader or solely to monitor reader quality. • Utility as second reader is established with knowledge that AS will focus on some dimensions (grammar, mechanics, vocab, semantic content or relevance, organization) • AS does not evaluate rhetorical skills, voice, the accuracy of concepts described, whether arguments are well founded.
Automated Scoring • Some cautions, but promise • Common Core Assessments – many of the leading proponents of different assessment models have over estimated the efficacy of AS and underestimated the cost and time required. • As noted earlier – limited to certain types of tasks and subjects. • As noted in papers – AS doesn’t make judgments, but scores on selected features. Readers, also score based on context and differential features, but have the ability to make judgments and consider all aspects of a paper (if time permitted). • AS moving beyond the big three (ETS, Vantage, PEM) to many new players, including Pacific Metrics, AIR…
Recommend
More recommend