humans and machines modeling the stochastic behavior of
play

Humans and Machines: Modeling the Stochastic Behavior of Raters in - PowerPoint PPT Presentation

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018 Outline of Topics Natural human responses in educational


  1. Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018

  2. Outline of Topics • Natural human responses in educational assessment • Technology in education, assessment and scoring • Computational methods for automated scoring (NLP , LSA, ML) • Rating information in statistical and psychometric analysis: Challenges • Unreliability and bias • Combining information from multiple rating • Hierarchical rater model (HRM) • Applications • Comparing machine to humans • Simulating human rating errors to further related research

  3. Why natural, constructed response formats in assessment? • Learning involves constructing knowledge and expressing through language (written and/or oral) • Assessments should consist of ‘authentic’ tasks, i.e., of a type that students encounter during instruction • Artificially contrived item formats (e.g., multiple-choice) advantage skills unrelated to the intended construct • Some constructs (e.g., essay writing) simply can’t be measured through selected-response formats

  4. Disadvantages of Constructed- Response formats • Time consuming for examinees (fewer items per unit time) • Require expensive human ratings (typically) • Create delay in providing scores, reports • Human rating is error-prone • Consistency across rating events di ffi cult to maintain • Inconsistency impairs comparability • Combining multiple ratings creates modeling, scoring problems

  5. Practical Balancing of Priorities • Mix constructed-response formats with selected response formats, to realize benefits of each • Leverage technology in the scoring of CR items • Rule-based scoring (exhaustively enumerated/ constrained) • Natural language processing and subsequent automated rating for written (sometimes spoken) responses • Made more practical with computer-based test delivery

  6. Technology For Automated Scoring • Ten years ago there were relatively few providers • Expensive, proprietary algorithms • Specialized expertise (NLP , LSA, AI) • Laborious, ‘hand crafted’, engine training • Today solutions are much more ubiquitous • Students fitting AS models in CS, STAT classes • Open source libraries abound • Machine learning, neural networks, accessible, powerful, up to job • Validity and reliability challenges remain • Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies • Managing algorithm improvements, examinee adaptations, over time • Quality human scores needed to train the machines (supervised learning) • Biases or other problems in human ratings ‘learned’ by algorithms • Combining scores from machines and humans

  7. Machine Learning for Automated Essay Scoring Example architecture: Taghipour & Ng (2016) Example characteristics: • Words processed in relation to corpus for frequency, etc., etc. • N-grams (word pairs, triplets, etc.) • Transformations (non-linear, sinusoidal), and dimensionality reduction • Iterations improving along gradient, memory of previous states • Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters

  8. Focus of Research • Situating rating process within overall measurement and statistical context • The hierarchical rater model (HRM) • Accounting for multiple ratings ‘correctly’ • Contrast with alternative approaches, e.g., Facets • Simultaneous analysis of human and machine ratings • Example from large-scale writing assessment • Leveraging models of human rating behavior for better simulation, examination of impacts on inferences

  9. Hierarchical Structure of Rated Item Response Data Patz, Junker & Johnson, 2002 • If all levels follow normal distribution, then Generalizability Theory applies • Estimates at any level weigh data mean and prior mean, using ‘generalizability coe ffi cient’ • If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM ⎫ θ i i . i . d . N ( µ , σ 2 ), i = 1, … , N ~ ⎪ ⎪ HRM levels ξ ij an IRT model (e.g. PCM), j = 1, … , J ,for each i ~ ⎬ ⎪ a signal detection model r = 1, … , R , for each i , j X ijr ~ ⎪ ⎭

  10. Hierarchical Rater Model • Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias and imprecision φ r = − .2 ψ r = .5 ξ = 3 p 33 r = .64 p 32 r = .08 p 34 r = .27

  11. Hierarchical Rater Model (cont.) • Examinees respond to items according to a polytomous item response theory model (here PCM; could be GPCM, GRM, others): ⎧ ⎫ ξ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ ⎦ = θ i ~ i . i . d . N ( µ , σ 2 ), i = 1, … , N , ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1

  12. HRM Estimation • Most straightforward to estimate using Markov chain Monte Carlo • Uninformative priors specified in Patz et al 2002, Casablanca et al, 2016 • WinBugs/JAGS (may be called from within R) • HRM has been estimated using maximum likelihood and posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)

  13. Facets Alternative • Facets (Linacre) models can capture rater e ff ects: ⎧ ⎫ ξ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ , λ rjk ⎦ = ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1 where 𝜇 rjk is the e ff ect rater r has on category k of item j . Note: rater e ff ects 𝜇 may be constant for all levels of an item, all items at a a given level, or for all levels of all items. Every rater-item combination has unique ICC Facets models have proven highly useful in the detection and mitigation of rater e ff ects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)

  14. Dependence structure of Facets models • Ratings are directly related to proficiency • Arbitrarily precise 𝜄 estimation achievable by increasing ratings R • Alternatives (other than HRM) include: • Rater Bundle Model (Wilson & Hoskens, 2001) • Design-e ff ect-like correction (Bock, Brennan, Muraki, 1999)

  15. Applications & Extensions of HRM • Detecting rater e ff ects and “modality” e ff ects in Florida assessment program (Patz, Junker, Johnson, 2002) • 360-degree feedback data (Barr & Raju, 2003) • Rater covariates, applied to Golden State Exam (image vs. paper study) (Mariano & Junker, 2007) • Latent classes for raters, applied to large-scale language assessment (DeCarlo et al, 2011) • Machine (i.e., automated) and human scoring (Casabianca et al, 2016)

  16. HRM with rater covariates • Introduce design matrix 𝛷 associating individual raters to their covariates • Bias and variability of ratings vary according rater characteristics Bias: Variability:

  17. Application with Human and Machine Ratings • Statewide writing assessment program (provided by CTB) • 5 dimensions of writing (“items”); each on 1-6 rubric • 487 examinees • 36 raters: 18 male, 17 female, 1 machine • Each paper scored by four raters (1 machine, 3 humans) • 9740 ratings in total

  18. Results by “gender” • Males and female very similar (and negligible on average) bias • Machine less variable (esp. than males) more severe (not sig.) • Individual rater bias and severity is informative (next slide)

  19. Individual rater estimate may be diagnostic Most lenient, r=11 Most harsh and least variable, r=20 (problematic pattern confirmed) Most variable (r=29)

  20. Continued Research • HRM presents systematic way to simulate rater behavior • What range of variability and bias are typical? Good? Problematic? • Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.? • What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training? • To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines • Under what conditions should di ff erent (esp. more granular) signal detection models be used within the HRM framework?

  21. Quadratic Weighted Kappa O i , j = Observed count in cell i,j ∑ w i , j O i , j E i , j = Expected count in cell i,j κ = 1 − i , j where ∑ w i , j E i , j w i , j = ( i − j ) 2 i , j ( N − 1) 2 • Penalizes non-adjacent disagreement more than unweighted kappa or linearly (i-j) weighted kappa • Widely used as a prediction accuracy metric in machine learning • Kappa statistics are an important supplement to rates of agreement (exact/adjacent) in operational rating

  22. HRM Rater Noise • How do HRM signal detection accuracy impact reliability and agreement statistics for rated items? • Use HRM to simulate realistic patterns of rater behavior • Example • For 10,000 examinees with normally distributed proficiencies • True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item • Vary rater variability parameter 𝛺 , with rater bias 𝟈 =0

  23. Ideal ratings follow PCM Ideal Ratings: ψ r = 0

  24. Results

Recommend


More recommend