Integrated Profile Method: An Innovative Approach in Standard Setting National Conference on Student Assessment Dr. Liru Zhang liru.zhang@doe.k12.de.us Theresa Bennett Theresa.bennett@doe.k12.de.us June 27-29, 2018
Standard Setting for Performance Assessments • Standard setting is a critical step in the design of high-stakes testing programs (Kane, 2001). With advanced technology and increasing use of performance assessments, the standard setting methods have evolved in the past decades to meet the challenges, such as to verify the defensibility of cut scores and provide evidence to validate the process (Zieky, 2001). • For battery tests and performance tasks (e.g., CR items, direct writing, and scientific simulation), multiple scores are usually reported to capture all of the aspects of an examinee’s performance. Those scores generally form a profile. • The context of the judgmental policy capturing (Jaeger, 1994) is judgment-centered method based on an overall review of profiles of scores to classify each profile into pass or fail category. 2
Profile Approach • Profile approach has been used to set performance standards for the Kindergarten Readiness Assessment in GA (Donahue, et al., 2000), the California Alternate Performance Assessment (Morgan, 2004), and for redesigned AP tests (Morgan, et al., 2015). • In practice, the profile-based approach provides a visual representation of student performance in the form of a profile to facilitate panelists’ review and evaluation. In some cases, multiple profiles are presented by raw scores into an ordered profile packet, the panelists could place bookmarks for achievement levels. • With the profile approach, the strategy, such as compensatory or non-compensatory; conjunctive or disjunctive, should be decided for decision making. 3
Integrated Profile Method (1) • The Integrated Profile Method (IPM) is an innovative approach to set cut scores for performance-based assessments with limited number of tasks. • The performance task (PT) is typically designed to measure different components or dimensions in terms of content categories at various cognitive complexity levels. PT often derives multiple scores. • Those scores may or may not be aggregated depending the test design and the scoring process. 4
Integrated Profile Method (2) • To implement the hybrid model, a two-phase procedure is developed in general. In Phase I, the participants review a sample of responses with full performance continuum by component to identify the minimally acceptable score based on the non-compensatory approach. In Phase II, the participants focus their review on the overall performance by task to distinguish proficient from non-proficient profiles based on the compensatory approach. Two examples 5
Two Examples Assessment A Assessment B - Comp. 1a - T1 - Comp. 1b - Component 1 - Comp. 1c Two Tasks One Task - Component 2 - Comp. 2a - T2 - Comp. 2b - Component 3 - Comp. 2c 6
Integrated Profile Method (3) The notion of replicability is central to standard setting regardless of the specific context or standard setting method (AERA, APA, NCME, 1999). • The IPM is designed to replicate with multiple groups to examine the generalizability of the performance standards (or cut scores) and provide validity evidence for the standard setting process. • The IPM employs the two-phase design that is enable the participants to establish the decision rules in a more efficient manner. • The IPM comprises the two-round process in each phase to provide participants sufficient time for individual review and multiple opportunities to discuss professional judgments for decisions. 7
An Application of IPM • The redesigned SAT Essay measures how well students understand the passage and uses it as the basis for a well-written, thought-out discussion. In operation, examinees read the passage (topic) that is adapted from previous publications, analyze information and evidence, and write an essay within 50 minutes. • The quality of an essay is evaluated in three categories and awarded 1-4 points of each. Examinees receive three non- aggregated dimension scores of reading, analysis, and writing, ranging from 2 to 8 points by two raters. • Due to the inconsistency of hand scoring between raters and misclassifications on the narrow raw-score scale (2-8), only two achievement levels, Proficient or Non-Proficient, are reported. 8
The Essay 9
Achievement Level Descriptors 10
Achievement Level Descriptor – Level 1 11
Achievement Level Descriptor – Level 2 12
Achievement Level Descriptor – Level 3 13
Achievement Level Descriptor – Level 4 14
SAT Essay Standard Setting On February 23-24, 2017: • Educators representing 15 districts and charters engaged in Standard Setting • Administrators from 6 districts and charters, served as observers to give feedback on the process • Institutes of High Education included in the process • Literacy Cadre Instructional Coaches served as content-expert facilitators to ensure content aligns to the standards 15
Roles and Responsibilities of Table Leaders • Made sure all participants completed a confidentiality agreement • Facilitated discussions both small group and large group • Made sure participants understand and stay on task • Documented results of each phase of standard setting • Disseminated and collected all materials • Gave feedback on the process • Reported breaches of confidentiality 16
Process and Results (1) In 2016, Delaware adopted SAT as the high-school assessments. Essay scores are used as a supplemental indicator for the high-stakes accountability. The two-day standard setting was held in February, 2017. • The panel consists of 26 participants from public schools and higher education. The majority of them were classroom teachers with expertise in writing at the high-school level. The participants were assigned into four groups with 6-7 of each. • A sample of 141 essays was randomly selected based on observed profiles from grade 11 on the 2016 School Day. The essay sample was split into four packages, 30-35 of each with 5-7 overlapping essays. • Three half-day trainings were designed to fit the needs: The Group Leader training, Phase I training, and Phase II training. 17
Standard Setting (2) • In Phase I, the panelists build a better understanding of the dimension scores, the scoring rubric, and student performance through a careful review of each dimension of essays. From the first round to the second round, the range of minimally acceptable dimension scores was noticeably narrowed down (Table 1). The median of the panel ratings changed from 4, 4, 4 to 5, 4, 5 respectively for reading, analysis, and writing. • Identified dimension scores served as the starting points in Phase II, which efficiently reduced the scope of profiles to facilitate the process. The review of the overall essay quality helped panelists comprehend the uniqueness of each dimension, and their connections and contributions to quality writing. The panel focused on borderline performance and meaningful profiles to achieve the decision rules (Table 2). 18
Summary of Phase I Round One Round Two Dimension Score Reading Analysis Writing Reading Analysis Writing 3 0 6 0 0 4 0 0.0 23.1 0.0 0.0 15.4 0.0 4 14 14 16 5 18 12 53.8 53.8 61.5 19.2 69.2 46.2 5 9 5 9 21 4 14 34.6 19.2 34.6 80.0 15.4 53.8 6 3 1 1 0 0 0 11.5 3.8 3.8 0.0 0.0 0.0 Median 4 (4-6) 4 (3-6) 4 (4-6) 5 (4-5) 4 (3-5) 5 (4-5) 19
Impact Data in Phase I Reading Analysis Writing Score N c% Impact N c% Impact N c% Impact 2 427 5.7 2321 31.2 691 9.3 3 786 16.3 1402 50.0 911 21.5 4 2113 44.7 1735 73.3 50% 2029 48.8 5 1848 69.5 55% 1082 87.9 1594 70.2 51% 6 1794 93.6 657 96.7 1802 94.4 7 376 98.7 186 99.2 324 98.8 8 99 100.0 60 100.0 92 100.0 Total 7443 7443 7443 20
Summary of Phase II Round One Round Two Condition 1 Condition 2 Sum N (%) Condition 1 Condition 2 Sum N (%) DS ≥ 3 DS ≥ 3 13, 13-14 7 (.27) 13 8 (.31) one DS ≥ 3 DS ≥ 3 13, 12-14 5 (.19) 13-14 6 (.23) condition 11, 12 AS ≥ 3 DS ≥ 3 5 (.19) 14 2 (.08) 11-12 Other or multiple Other or multiple 12-14 9 (.35) 14 10 (.38) conditions conditions DS ≥ 3 13-14 21
Impact Data in Phase II Descriptive Statistics for 2016 SAT on School Day Score N Minimum Maximum Mean SD Reading 7443 2 8 4.7 1.305 Analysis 7443 2 8 3.6 1.463 Writing 7443 2 8 4.6 1.393 HES 7443 6 24 12.9 3.817 Sum ≥ 14 and Dimension Scores ≥ 3 42% Proficient Sum ≥ 13 and Dimension Scores ≥ 3 49% proficient 22
Two-Year Impact Data 2016 2017 Test Form N Mean SD Prof. N Mean SD Prof. Major 8041 12.7 3.84 47 8566 13.17 3.46 55 Forms Form 1 204 11.01 3.13 28 235 12.40 3.84 43 Form 2 394 9.83 3.21 17 550 10.21 3.18 19 Form 3 7443 12.9 3.82 49 7671 13.46 3.35 58 Form 4 - 110 9.88 2.98 16 23
Recommend
More recommend