Screening Common Items for IRT 6/17/2011 Equating Screening Common Items for IRT Equating Yi Du, Ph.D. Data Recognition Corporation Presentation at the 41th National Conference on Student Assessment, June 20, 2011 Overview ● Most states use equating to maintain a common test scale across years ● Common item equating designs help bring the new items onto the reference scale ● The quality of equating results (proficiency estimates) are affected by the across-year stability of the common items 6/17/2011 2 Selecting Common Items Historically, common items were selected to be representative of the total test forms in: ● Content ● e.g., proportionally match the test specifications ● Statistical characteristics ● e.g., item difficulty and the location of spread of item difficulty Sinharay and Holland, 2007 • Recently suggested that the best common items may come from the middle of the test difficulty range 6/17/2011 3 Yi Du, Ph.D. 1
Screening Common Items for IRT 6/17/2011 Equating Using Common Items ● Common Items should be interspersed throughout the test ● Avoid very early and late item positions ● Common Items should be placed in the same relative item position across usages ● The number of common items should be a predetermined ratio of the total items ● e.g., 20 – 30% of total test items. 6/17/2011 4 Unstable Common Items Construct-irrelevant factors: ● Test security ● Over exposure ● Item revisions ● Test booklet (change in format/layout) ● Environmental and societal change ● Test population changes Construct-relevant factors: ● Curriculum/Instruction changes 6/17/2011 5 Unstable Anchors and Test Score Validity If common items are differentially difficulty across administrations, item drift exists. Using common items with item drift can result in questionable equating results. Mechanisms are required to detect item drift during the equating process Consider removing unstable items from the anchor set 6/17/2011 6 Yi Du, Ph.D. 2
Screening Common Items for IRT 6/17/2011 Equating Today‘s Presentation Many assessment programs implement quantitative and qualitative procedures for screening common items This presentation will: Summarize popular item screening procedures Show results from select screening procedures Discuss strategies for selecting and implementing different screening procedures 6/17/2011 7 Screening Procedures Classical screening procedures IRT-related procedures Expert judgment 6/17/2011 8 Classical Screening Techniques P-value cut-off criterion Delta plot Method (Angolf, 1972, 1982, Dorans & Holland, 1993) DIF methods Mental-Haenszel chi-square statistics (Holland , 1985; Holland and Chayer, 1988) 6/17/2011 9 Yi Du, Ph.D. 3
Screening Common Items for IRT 6/17/2011 Equating P-Value Cut-Off Procedures Computes the differences in common item p-values (item difficulty) across two test administrations. Unstable items are removed from the anchor set based on fixed rules (usually established a priori). A popular procedure for screening common items. The method may not accurately control for Type I error (Harris, 1993, Miller, Rotou and Twing, 2004) 6/17/2011 10 Delta Plot Procedures Introduced by Angoff (1972) Measures differences based on item-by-group interaction Estimates are accurate when all items are equal in discrimination Modified by Dorans and Holland (1993) 6/17/2011 11 Dorans and Holland (1993) Delta Plots Item p-values are converted to an interval scale using inverse normal deviates ( z’s ) The following linear transformation is then applied: 1 p 13 4 ( ) The resulting scale has a mean = 13 and an SD = 4 The perpendicular distance of the paired Delta value to the principal axis line is determined. An a priori cut-off rule is used to remove common items from the anchor set 12 Yi Du, Ph.D. 4
Screening Common Items for IRT 6/17/2011 Equating Modified Delta Plot Technique, Cont. An Alternative Approach Item p-values are converted using the inverse standard normal distribution 1 p ( ) The perpendicular distance of the paired Delta value is computed by: AZ Z B old new D 2 A 1 2 2 2 2 2 2 2 2 ( SD SD ) ( SD SD ) 4 r SD SD Znew Zold Znew Zold ( zold )( Znew ) Znew Zold A 2 2 r SD SD ( zold )( Znew ) Znew Zold 6/17/2011 13 Modified Delta Plot Technique, Cont. An Alternative Approach — Continued And: B=Mean(Z new ) – A*Mean(Z old ) The standard deviation (SD) of the perpendicular distance is given by ( SD SD ) Znew Zold 1 SD r ( )( ) D 2 Zold Znew A fixed rule is used to flag items, e.g., D > 3 SD from the fitted line 6/17/2011 14 Mantel-Haenszel (MH) Chi-Square Techniques Item Drift is a special case of DIF that affects a test‘s objectivity (Bock, Muraki, & Pfeittenberger, 1988) Many DIF methods could be used for screening common items, theoretically MH procedures have been studied and used in the screening common items process (Michaelides, 2006) DIF and Item Drift differ practically in: Sample size Subgroup differences vs. local proficiency 6/17/2011 15 Yi Du, Ph.D. 5
Screening Common Items for IRT 6/17/2011 Equating IRT-Related Procedures Expected p-value comparison (including Delta plot method) Common items TCC comparison and preliminary screening Item parameters comparison with inferential statistics Weighted and unweighted t-statistic (Wright & Stone, 1979, Wright and Bell, 1984; Smith and Kramer, 1992 ) Alternative information-weighted Linking Constants (Cohen, Jiang and Yu, 2008) Displacement measure (Linacre, 2006) Robust-z statistic (Huynh; 2002; Tenenaum, 2001) Lord‘s (1980) Chi -Square statistic 16 Expected P-Value Comparisons The expected p-values of common items are compared to make sure the items are similar in difficulty in both forms A regression line is fit for the p-values between the estimated new form and old form. Delta Plot methods can be used to evaluate the expected p-value differences in the IRT framework 6/17/2011 17 Expected P-Value Comparisons 6/17/2011 18 Yi Du, Ph.D. 6
Screening Common Items for IRT 6/17/2011 Equating Common Items TCC Comparisons The old and equated common item set TCCs are compared to make sure that they have reasonable overlap. The correlation coefficients between the old and new item parameters are compared. Fixed rules are used to flag unstable common items 6/17/2011 19 Common Items TCC Comparisons 6/17/2011 20 Screening for 3-PL IRT Estimates ● IRT item parameters (a, b, and c) of common items are plotted ● Fixed rules are used to flag unstable common items, e.g., .3 > a or a > 1.5, or the item drift > 1.0 theta unit -2.0 > b or b > 2.0 or the item drift > 1.0 theta unit C > .35 6/17/2011 Yi Du, Ph.D. 7
Screening Common Items for IRT 6/17/2011 Equating Wright and Stone‘s t -statistic Mainly used in Rasch equating Offers the unweighted and weighted link constants: d d M ij ik M w 1 i ( d d ) ijk ij ik 1 ( wl ) i ( ul ) jk M 1 jk M w i 1 ijk 2 2 w ( se ) ( se ) ijk ij ik 6/17/2011 22 Wright and Stone‘s t -statistic (Cont.) ● The t -statistic identifies two sources of error: The random fluctuation arising from the finite samples The differences in item difficulty ● A common practice is to use the critical value α at 0.05 . d d ( ul ) ij ik jk t 2 2 1 / 2 ijk [( ) ( ) ] se se ij ik 6/17/2011 23 Information-Weighted Linking Constants The original linking constant is determined by taking the arithmetic mean difference in item difficulty parameters as the linking constant: J 1 t t b d b t j 1 0 J j 1 The weighted linking constant is weighted by the Information δ j : t J 1 t ' j d J J t t J 2 ' 2 2 t 1 ( ) , 1 t Var d w w j 1 t j j j j 1 1 j j 2 1 j j By weighting with sample error, the item parameter estimates are believed to be more sufficient. 24 Yi Du, Ph.D. 8
Recommend
More recommend