Recognising the error of our ways Dr Paul E. Newton Presentation to the Cambridge Assessment Forum for New Developments in Educational Assessment. Downing College, Cambridge. 10 December 2008.
HOW MANY STATISTICIANS DOES IT TAKE TO CHANGE A LIGHT BULB?
ONE, PLUS OR MINUS THREE!
Other valid responses: How many did it take this time last year? 3.9967 (after six iterations). 75% of the population believe less than four. What kind of number did you have in mind? Don't bother. Nothing can be inferred from a single light bulb. You’d need to use a nonparametric procedure – statisticians are not normal. 1-n to change the bulb and n-1 to test its replacement. It depends whether the bulb is - vely or + vely screwed.
ONE!
HOW MANY PSYCHICS DOES IT TAKE TO CHANGE A LIGHT BULB?
Francis Ysidro Edgeworth That examination is a very rough, yet not wholly inefficient, test of merit is generally accepted.
What do we mean by ‘error’? Part 1
Variability Whatever precautions have been taken to secure unity of standard, there will occur a certain divergence between the verdicts of competent examiners. Say full marks are thirty; then if one examiner marks 20, another might mark 21, another 19. If we tabulate the marks given by the different examiners, they will tend to be disposed after the fashion of a gend’arme’s hat. Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society , LI, 599-635.
A gendarme’s hat?
Chapeau de Gendarme
Measurement ‘truth’ This central figure which is, or may be supposed to be, assigned by the greatest number of equally competent judges, is to be regarded as the true value of the Latin prose; just as the true weight of a body is determined by taking the mean of several discrepant measurements. Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society , LI, 599-635.
Measurement ‘error’ I think it is intelligible to speak of the mean judgment of competent critics as the true judgment; and deviations from that mean as errors. Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society , LI, 599-635.
Reliability and replication Reliability is about quantifying the luck of the draw. What if the… candidate happened to have been in a different state of mind? exam happened to have comprised a different set of questions? script happened to have been marked by a different marker? cut-scores happened to have been set by a different panel? etc. … would the same grade have been awarded?
What do we know about error? Part 2
The public perception of error? Only limited data have been published about the reliability of national curriculum tests, although it is likely that the reliability of national curriculum tests is around 0.80 – perhaps slightly higher for mathematics and science. Black, P. & Wiliam, D. (2006). The reliability of assessments. In J. Gardner (Ed.). Assessment and learning . London: Sage.
Test consistency Target 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 levels Spelling 2,3 n.a. 0.94 0.97 0.95 0.97 0.95 0.94 0.89 0.92 0.92 - 0.92 Reading 2 n.a. 0.87 0.92 0.91 0.91 0.91 0.87 0.90 0.90 0.87 - 0.89 Reading 3 n.a. 0.77 0.84 0.75 0.82 0.84 0.78 0.80 0.79 0.82 - 0.76 Key Stage 1 Tests Mathematics 2,3 n.a. 0.88 0.88 0.88 0.89 0.90 0.90 - - - - - Mathematics 2 - - - - - - - 0.88 0.88 0.83 - 0.85 Mathematics 3 - - - - - - - 0.83 0.83 0.84 - 0.85 Reading 3,4,5 0.85 0.86 0.92 0.89 0.88 0.88 0.90 0.87 0.87 0.87 0.91 0.89 Writing 3,4,5 n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. Spelling 3,4,5 0.91 0.90 0.92 0.92 0.91 0.89 0.90 0.90 0.90 0.91 0.91 0.89 Handwriting 3,4,5 n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. Mathematics A 3,4,5 0.88 0.87 0.91 0.90 0.90 0.89 0.89 0.92 0.93 0.91 0.93 0.92 Key Stage Mathematics B 3,4,5 0.89 0.88 0.83 0.90 0.87 0.89 0.89 0.93 0.92 0.92 0.93 0.92 2 Tests Mental mathematics 3,4,5 - - 0.90 0.88 0.85 0.88 0.89 0.88 0.89 0.87 0.87 0.89 Overall 3,4,5 n.a. n.a. n.a. n.a. n.a. n.a. n.a. 0.97 0.97 0.97 0.97 0.97 Science A 3,4,5 0.83 0.86 0.85 0.87 0.87 0.86 0.88 0.86 0.87 0.86 0.87 0.86 Science B 3,4,5 0.82 0.87 0.86 0.87 0.87 0.87 0.88 0.85 0.86 0.86 0.87 0.82 Overall 3,4,5 n.a. n.a. n.a. n.a. n.a. n.a. n.a. 0.92 0.93 0.92 0.93 0.91 Reading 3,4,5,6,7 0.71 0.88 0.94 0.90 0.89 0.89 0.88 0.84 0.84 0.81 0.85 0.85 Writing 3,4,5,6,7 0.91 n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. Shakespeare 3,4,5,6,7 - - - - - - - n.a. n.a. n.a. n.a. n.a. Mathematics 1 3,4,5 0.88 0.89 0.88 0.90 0.92 0.91 0.90 0.89 0.91 0.90 0.89 0.91 Mathematics 2 3,4,5 0.88 0.94 0.90 0.89 0.92 0.92 0.88 0.91 0.91 0.91 0.90 0.90 Mathematics 1 4,5,6 0.86 0.81 0.84 0.86 0.85 0.85 0.87 0.84 0.86 0.88 0.86 0.88 Mathematics 2 4,5,6 0.84 0.91 0.82 0.82 0.87 0.89 0.88 0.85 0.88 0.87 0.86 0.87 Mathematics 1 5,6,7 0.86 0.90 0.84 0.84 0.88 0.88 0.86 0.87 0.85 0.90 0.90 0.88 Mathematics 2 5,6,7 0.88 0.87 0.85 0.83 0.88 0.91 0.88 0.88 0.88 0.89 0.90 0.87 Key Stage Mathematics 1 6,7,8 0.85 0.68 0.82 0.85 0.89 0.90 0.92 0.88 0.88 0.89 0.90 0.88 3 Tests Mathematics 2 6,7,8 0.87 0.81 0.80 0.83 0.90 0.92 0.90 0.89 0.91 0.89 0.90 0.91 Mental mathematics A 4,5,6,7,8 - - 0.89 0.87 0.88 0.88 0.86 0.87 0.89 0.90 0.89 0.88 Mental mathematics B 4,5,6,7,8 - - 0.88 0.90 0.88 0.80 0.86 0.85 0.89 0.88 0.86 0.89 Mental mathematics C 3,4,5 - - 0.83 0.81 0.83 0.87 0.83 0.83 0.82 0.85 0.86 0.85 Science 1 3,4,5,6 0.88 0.90 0.91 0.90 0.93 0.94 0.90 0.94 0.91 0.92 0.93 0.92 Science 2 3,4,5,6 0.88 0.89 0.89 0.88 0.92 0.94 0.90 0.93 0.92 0.93 0.93 0.91 Overall 3,4,5,6 n.a. n.a. n.a. n.a. n.a. n.a. n.a. 0.96 0.96 0.96 0.96 0.96 Science 1 5,6,7 0.85 0.84 0.86 0.82 0.88 0.87 0.87 0.87 0.92 0.88 0.88 0.88 Science 2 5,6,7 0.85 0.85 0.86 0.88 0.87 0.86 0.87 0.88 0.90 0.90 0.90 0.91 Overall 5,6,7 n.a. n.a. n.a. n.a. n.a. n.a. n.a. 0.93 0.95 0.94 0.94 0.95
Marker consistency Agreement English Reading Writing between markers 100 marks 50 marks 50 marks (n = 9) and Lead N, 3, 4, 5, 6, 7 B4, 4, 5, 6, 7 B4, 4, 5, 6, 7 Chief Marker Mean coefficient of 0.92 0.94 0.80 correlation (marks) Percentage exact 59 % 61 % 52 % agreement (levels)
Level setting consistency 3 to 6 tier Confidence Interval Tucker Linear Lower Upper Final 45 42 48 42 Level 3 72 70 74 69 Level 4 105 103 106 104 Level 5 135 133 136 134 Level 6
Dylan Wiliam on error […] it is likely that the proportion of students awarded a level higher or lower than they should be because of the unreliability of the tests is at least 30% at key stage 2 Wiliam, D. (2001). Level best? London: ATL.
Overall reliability (parallel forms) Agreement English Reading Writing between 100 marks 50 marks 50 marks performance B3, 3, 4, 5 B3, 3, 4, 5 B3, 3, 4, 5 across test forms Classification 73 % 73 % 67 % consistency (two forms) Classification 84 % 84 % 79 % accuracy – rough!! (one form)
What do we say about error? Part 3
Sometimes we dodge questions The Qualifications and Curriculum Authority said the test was carefully trialled and pre-tested to make sure it was appropriate and stimulating for the age group. Ward, H. (2002). Children exhausted by ‘too wordy’ reading challenge. The TES , 24 May. A QCA spokesman said that all the questions cited were consistent with national curriculum requirements. Shaw, M. (2002). A gender-bending question. The TES , 17 May.
Sometimes we downplay error A Qualifications and Curriculum Authority spokeswoman said: “We are confident that the quality of the marking of tests is robust.” Mansell, W. (2003). Row over test marks at 14. The TES , 11 July.
Occasionally ‘inevitable’ “It was a proof-reading error on our part.” said a spokesman for the authority. “We make no excuses and this error should not have happened, but we have made sure no students suffer as a result.” Mistakes are inevitable in an examinations system which deals with 18 million papers a year, says the QCA. Hook, S. (2002). Anger at blunder in key skills paper. The TES , 24 May.
Occasionally ‘unacceptable’ However, any level of error has to be unacceptable – even just one candidate getting the wrong grade is entirely unacceptable for both the individual student and the system. QCA. (2003). A level of preparation. TES Insert. The TES , 4 April.
Recommend
More recommend