Changes in test Scores w ith Multiple Sittings of CanTEST Philip Nagy
Rationale Research Questions • Do test scores change on repeating the test? • Is change related to length of time between sittings? Test Development Questions • Can data from repeaters be used in test calibration for new form development? Context: Receptive Skills Official Languages and Bilingualism Institute
The Data Listening Tests: Six forms with 15 short and 25 long passage items Reading Tests: Seven forms with 15 skim-and-scan, 20 reading passage, and 25 cloze items The Sample: Mean first score of 3.6, compared to 4.3 for those who write only once Assumptions • Difficulty of forms is balanced across sittings (true) • Samples writing each form are equivalent (untested) Official Languages and Bilingualism Institute
Listening Results: Sitting 2 minus Sitting 1 (N=179) Change in Total Test Short Long Raw Score (40) Passages Passages (15) (25) Down >11 3 1 Down 6 to 10 18 2 11 Down 3 to 5 18 24 22 Same ± 2 43 91 72 Up 3 to 5 42 42 46 Up 6 to 10 36 20 24 Up >11 19 3 Official Languages and Bilingualism Institute
Listening Results, another look Change in Total Test Short Long Raw Score (40) Passages Passages (15) (25) Down some 22% 15% 19% About the 24% 51% 40% same Up some 54% 34% 41% Mean raw 2.6 1.3 1.3 gain Mean % gain 6.5% of 40 8.8% of 15 5.2% of 25 items items items Official Languages and Bilingualism Institute
Listening Results Interpretation How important is the improvement? • On average, 3.6 points needed out of 40 to improve one band • So, 2.6 points is about 75% of a band improvement Official Languages and Bilingualism Institute
Listening Results Interpretation Can the data be used for test calibration? • The changes in average item difficulty are different for the subtests •.088 for short passages •.052 for long passages • The difference of .036 (.088 - .052) is about the same as the standard error of the difficulty indices • Listening data from repeaters should not be used for item calibration Official Languages and Bilingualism Institute
Changes in Listening by Length of Time betw een Sittings Test → Total Short Long Time Between Test Passages Passages Tests ↓ > 6 months +2.13 +0.63 1 +1.49 (N=63) +1.69 1 < 6 months +2.87 +1.18 (N=116) 1 Difference significant, p=0.05 Those who repeat sooner do better than those who repeat later Official Languages and Bilingualism Institute
Reading Results: Sitting 2 minus Sitting 1 (N=284) Change in Raw Total (80) Skim-&-Scan Passage (20) Cloze (25) Score (15) Down 21 or more 17 Down 11 to 20 19 2 12 Down 6 to 10 21 12 18 32 Down 3 to 5 28 32 30 34 Same score ± 2 46 139 142 106 Up 3 to 5 33 65 63 52 Up 6 to 10 47 31 23 36 Up 11 to 20 48 3 8 12 Up 21 or more 25 Note: Reading Score is doubled to give a total out of 80 rather than 60. Official Languages and Bilingualism Institute
Reading Results, another look Change in Raw Score Total (80) Skim-&- Reading Cloze Scan (15) Passage Passage (20) (25) Down some 30% 16% 17% 27% About the same 16% 49% 50% 37% Up some 54% 35% 33% 35% Official Languages and Bilingualism Institute
Reading Results Interpretation How important is the improvement? • On average, 6.5 points needed (out of 80) to improve one band • So, 3.45 points is about 55% of a band improvement Official Languages and Bilingualism Institute
Reading Results Interpretation Can the data be used for test calibration? • The changes in average item difficulty are different for the subtests •+0.072 for skim-and-scan •+0.050 for reading passages •+0.002 for cloze • The largest difference of .070 (.072 - .002) is two to three times larger than the standard error of the difficulty indices • Reading data from repeaters should not be used for item calibration Official Languages and Bilingualism Institute
Changes in Reading by Length of Time betw een Sittings Test → Total (80) Skim-&Scan Reading Cloze Time Passage Passage Between Tests ↓ > 6 months -0.119 -0.292 1 -0.017 -0.079 (N=105) < 6 months +0.070 +0.171 1 +0.010 +0.046 (N=179) 1 Difference significant, p=0.05 Those who repeat later actually do worse than those who repeat sooner Official Languages and Bilingualism Institute
Conclusion • Listening: • 30% of sample do more poorly on 2 nd sitting • Average gain is 75% of a band score • Differences in gains across item types vary by an item standard error • Reading • 40% of sample do more poorly on 2 nd sitting • Average gain is 55% of a band score • Differences in gains across item types vary by 2-3 times an item standard error • Both • Those who rewrite within six months do better • Data from repeaters should not be used for item calibration Official Languages and Bilingualism Institute
Recommend
More recommend