Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. Michigan State University 2. UC Berkeley, BEAR Center 3.
Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014) • Need assessment tasks with multiple components to get at all 3 dimensions (C 2-1) • Tasks must accurately locate students along a sequence of progressively more complex understanding (C 2-2) • Traditional selected-response items cannot assess the full breadth and depth of NGSS • Technology can address some of the problems • Particularly scalability and cost
Example of a Carbon TIME Item
Comparing FC vs CR vs Both • Compare spread of data • Adding CR (or CR only) increases the confidence that we have classified students correctly • Since explanations is a practice that we are focusing on in the LP, it requires CR to assess the construct fully
Recursive Feedback Loops for Item Development Using WEW Creating Students (Human Machine Item WEW (Rubric) respond to scoring) to Learning Development Development Items create (ML) training set Models Interpretation Using ML Psychometric QWK Check Backcheck by larger Model Analysis for coding research (Computer (IRT, WLE) Reliability (human) group scoring) Processes moving towards final interpretation Feedback loops that indicate that a question, rubric, or coding potentially has a problem that needs to be addressed
As of March 6, 2019 Consequences of using machine scoring School Year Responses Scored • Item revision and improvement • 2015-16 175,265 Increase in the size of the usable data set to increase power of statistics 2016-17 532,825 • Increased confidence in reliability of scoring 2017-18 693,086 through back-checking samples and revising models 2018-19 227,041 • Reduced costs by needing fewer human coders • Model to show that the kinds of assessments TOTAL 1,628,217 envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost Cost Savings and scalability Labor hours needed to human score responses @ 100 16,282.7 hours per hour Labor cost per hour (undergraduate students including $18 per hour misc. costs) Cost to human score all responses $293,079
Types of validity evidence • As taken from the Standards for Educational and Psychological Testing, 2014 ed. • Evidence based on test content • Evidence based on response processes • Evidence based on internal structure • Evidence based on relation to other variables • Convergent and discriminant evidence • Test-criterion evidence • Evidence for validity and consequences of testing
Comparison of interviews and IRT analysis results • • Comparison of scoring for one Overall Spearman rank correlation written versus interview item = 0.81, p <0.01, n =49
Evidence based on internal structure • Analysis method: item response models (specifically, unidimensional and multidimensional partial credit models) • Provide item and step difficulties and person proficiencies on one scale • Provide comparisons of step difficulties within items
Step difficulties for each item 2015-16 Data
Classifying Students into LP levels Comparing FC to EX + FC
Classifying Students into LP levels Comparing EX to EX + FC
Classifying Classroom Data 95% confidence intervals: Average learning gains for teachers with at least 15 students who had both overall pretests and overall posttests (macroscopic explanations)
Questions? • Contact info • Jay Thomas jay.thomas@act.org • Karen Draney kdraney@berkeley.edu • Andy Anderson andya@msu.edu • Ellen Holste holste@msu.edu • Shruti Bathia shruti_bathia@berkeley.edu
Recommend
More recommend