Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018
Outline of Topics • Natural human responses in educational assessment • Technology in education, assessment and scoring • Computational methods for automated scoring (NLP , LSA, ML) • Rating information in statistical and psychometric analysis: Challenges • Unreliability and bias • Combining information from multiple rating • Hierarchical rater model (HRM) • Applications • Comparing machine to humans • Simulating human rating errors to further related research
Why natural, constructed response formats in assessment? • Learning involves constructing knowledge and expressing through language (written and/or oral) • Assessments should consist of ‘authentic’ tasks, i.e., of a type that students encounter during instruction • Artificially contrived item formats (e.g., multiple-choice) advantage skills unrelated to the intended construct • Some constructs (e.g., essay writing) simply can’t be measured through selected-response formats
Disadvantages of Constructed- Response formats • Time consuming for examinees (fewer items per unit time) • Require expensive human ratings (typically) • Create delay in providing scores, reports • Human rating is error-prone • Consistency across rating events di ffi cult to maintain • Inconsistency impairs comparability • Combining multiple ratings creates modeling, scoring problems
Practical Balancing of Priorities • Mix constructed-response formats with selected response formats, to realize benefits of each • Leverage technology in the scoring of CR items • Rule-based scoring (exhaustively enumerated/ constrained) • Natural language processing and subsequent automated rating for written (sometimes spoken) responses • Made more practical with computer-based test delivery
Technology For Automated Scoring • Ten years ago there were relatively few providers • Expensive, proprietary algorithms • Specialized expertise (NLP , LSA, AI) • Laborious, ‘hand crafted’, engine training • Today solutions are much more ubiquitous • Students fitting AS models in CS, STAT classes • Open source libraries abound • Machine learning, neural networks, accessible, powerful, up to job • Validity and reliability challenges remain • Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies • Managing algorithm improvements, examinee adaptations, over time • Quality human scores needed to train the machines (supervised learning) • Biases or other problems in human ratings ‘learned’ by algorithms • Combining scores from machines and humans
Machine Learning for Automated Essay Scoring Example architecture: Taghipour & Ng (2016) Example characteristics: • Words processed in relation to corpus for frequency, etc., etc. • N-grams (word pairs, triplets, etc.) • Transformations (non-linear, sinusoidal), and dimensionality reduction • Iterations improving along gradient, memory of previous states • Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters
Focus of Research • Situating rating process within overall measurement and statistical context • The hierarchical rater model (HRM) • Accounting for multiple ratings ‘correctly’ • Contrast with alternative approaches, e.g., Facets • Simultaneous analysis of human and machine ratings • Example from large-scale writing assessment • Leveraging models of human rating behavior for better simulation, examination of impacts on inferences
Hierarchical Structure of Rated Item Response Data Patz, Junker & Johnson, 2002 • If all levels follow normal distribution, then Generalizability Theory applies • Estimates at any level weigh data mean and prior mean, using ‘generalizability coe ffi cient’ • If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM ⎫ θ i i . i . d . N ( µ , σ 2 ), i = 1, … , N ~ ⎪ ⎪ HRM levels ξ ij an IRT model (e.g. PCM), j = 1, … , J ,for each i ~ ⎬ ⎪ a signal detection model r = 1, … , R , for each i , j X ijr ~ ⎪ ⎭
Hierarchical Rater Model • Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias and imprecision φ r = − .2 ψ r = .5 ξ = 3 p 33 r = .64 p 32 r = .08 p 34 r = .27
Hierarchical Rater Model (cont.) • Examinees respond to items according to a polytomous item response theory model (here PCM; could be GPCM, GRM, others): ⎧ ⎫ ξ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ ⎦ = θ i ~ i . i . d . N ( µ , σ 2 ), i = 1, … , N , ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1
HRM Estimation • Most straightforward to estimate using Markov chain Monte Carlo • Uninformative priors specified in Patz et al 2002, Casablanca et al, 2016 • WinBugs/JAGS (may be called from within R) • HRM has been estimated using maximum likelihood and posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)
Facets Alternative • Facets (Linacre) models can capture rater e ff ects: ⎧ ⎫ ξ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ , λ rjk ⎦ = ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1 where 𝜇 rjk is the e ff ect rater r has on category k of item j . Note: rater e ff ects 𝜇 may be constant for all levels of an item, all items at a a given level, or for all levels of all items. Every rater-item combination has unique ICC Facets models have proven highly useful in the detection and mitigation of rater e ff ects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)
Dependence structure of Facets models • Ratings are directly related to proficiency • Arbitrarily precise 𝜄 estimation achievable by increasing ratings R • Alternatives (other than HRM) include: • Rater Bundle Model (Wilson & Hoskens, 2001) • Design-e ff ect-like correction (Bock, Brennan, Muraki, 1999)
Applications & Extensions of HRM • Detecting rater e ff ects and “modality” e ff ects in Florida assessment program (Patz, Junker, Johnson, 2002) • 360-degree feedback data (Barr & Raju, 2003) • Rater covariates, applied to Golden State Exam (image vs. paper study) (Mariano & Junker, 2007) • Latent classes for raters, applied to large-scale language assessment (DeCarlo et al, 2011) • Machine (i.e., automated) and human scoring (Casabianca et al, 2016)
HRM with rater covariates • Introduce design matrix 𝛷 associating individual raters to their covariates • Bias and variability of ratings vary according rater characteristics Bias: Variability:
Application with Human and Machine Ratings • Statewide writing assessment program (provided by CTB) • 5 dimensions of writing (“items”); each on 1-6 rubric • 487 examinees • 36 raters: 18 male, 17 female, 1 machine • Each paper scored by four raters (1 machine, 3 humans) • 9740 ratings in total
Results by “gender” • Males and female very similar (and negligible on average) bias • Machine less variable (esp. than males) more severe (not sig.) • Individual rater bias and severity is informative (next slide)
Individual rater estimate may be diagnostic Most lenient, r=11 Most harsh and least variable, r=20 (problematic pattern confirmed) Most variable (r=29)
Continued Research • HRM presents systematic way to simulate rater behavior • What range of variability and bias are typical? Good? Problematic? • Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.? • What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training? • To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines • Under what conditions should di ff erent (esp. more granular) signal detection models be used within the HRM framework?
Quadratic Weighted Kappa O i , j = Observed count in cell i,j ∑ w i , j O i , j E i , j = Expected count in cell i,j κ = 1 − i , j where ∑ w i , j E i , j w i , j = ( i − j ) 2 i , j ( N − 1) 2 • Penalizes non-adjacent disagreement more than unweighted kappa or linearly (i-j) weighted kappa • Widely used as a prediction accuracy metric in machine learning • Kappa statistics are an important supplement to rates of agreement (exact/adjacent) in operational rating
HRM Rater Noise • How do HRM signal detection accuracy impact reliability and agreement statistics for rated items? • Use HRM to simulate realistic patterns of rater behavior • Example • For 10,000 examinees with normally distributed proficiencies • True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item • Vary rater variability parameter 𝛺 , with rater bias 𝟈 =0
Ideal ratings follow PCM Ideal Ratings: ψ r = 0
Results
Recommend
More recommend