Reliability and Validity of Angoff Ratings J. Anthony Bayless - PowerPoint PPT Presentation

Reliability and Validity of Angoff Ratings J. Anthony Bayless Henry Busciglio Personnel Research and Assessment Division Office of Human Resources Management

Standard Setting  Process to establish a performance standard, cut score, or passing score  Process not purely technical or empirical  Process involves value judgments ( Standards for Educational and Psychological Testing )  Various methods of standard setting, for example:  Contrasting Groups and Borderline Groups (Livingston & Zieky, 1982)  Angoff (1971)  Ebel (1972)  Nedelsky (1954) OHRM/PRAD June 10, 2008 2

Angoff Procedure  SMEs are administered the test  SMEs estimate the proportion of “minimally qualified” or “minimally competent” examinees who would answer each item correctly  Average Angoff rating is calculated for each item  Grand average of the Angoff ratings across items is calculated to represent the recommended performance standard (or cut score) OHRM/PRAD June 10, 2008 3

Promotional Assessments  Career Experience Inventory  Critical Thinking Skills  In-Basket Job Simulation  Managerial Writing Skills  Job Knowledge Test OHRM/PRAD June 10, 2008 4

Job Knowledge Test  80 items for each occupation’s (IEA and DO) test  Multiple-choice items with four response options  Dichotomously scored items  Power tests OHRM/PRAD June 10, 2008 5

Research Interest  How good are SMEs at conceptualizing and consistently applying a hypothetical construct of “minimally qualified” examinees?  Specifically, how reliable are the SME estimates?  Specifically, how valid are the SME estimates? OHRM/PRAD June 10, 2008 6

Methodology – Angoff IEA SMEs DO SMEs n=5 (Time 1 + Time 2) n=8 No group discussion Group discussion OHRM/PRAD June 10, 2008 7

Methodology - Study  Two post hoc studies, one per occupation  DO sample (N=259 examinees)  IEA sample (N=318 examinees)  Assessed interjudge reliability via internal consistency estimate of reliability  Assessed validity via correlation of average Angoff rating and actual (observed) item difficulty index for a “minimally qualified” group of examinees OHRM/PRAD June 10, 2008 8

Results - Reliability  DO Sample (72 scored items, 8 SMEs)  Alpha = .863, no removable SMEs  Item-total correlations from .582 to .680  IEA Sample (70 usable items, 5 SMEs)  Initial Alpha = .429, with 2 removable SMEs  Final Alpha = .547, using 3 SMEs  Item-total correlations from .364 to .422  We used both 5- and 3-SME groups for further analyses. OHRM/PRAD June 10, 2008 9

Results - Validity  Validity - agreement between SMEs’ Angoff estimates and actual p- values among group of “minimally qualified” test takers.  “Minimally qualified” defined two ways:  Candidates scoring close to 50 th percentile  Candidates getting 70% of items correct  Used both correlations and t-tests to assess validity OHRM/PRAD June 10, 2008 10

Results – Validity (Corr.)  For DO sample, correlations were:  .591** for 50 th percentile group  .479** for 70% correct group  For IEA sample, correlations (for 5- and 3-SME groups, respectively) were:  .311** and .243* for 50 th percentile group  .282* and .183 for 70% correct group ** p<.01. *p<.05. OHRM/PRAD June 10, 2008 11

Results – Validity (T-tests)  Agreement – magnitude of mean differences between the Angoff ratings for each item and the corresponding p-value among minimally qualified test takers.  Used paired-samples t-tests  For DO sample:  Grand average Angoff rating = .6310  Average p-value for 50 th percentile group = .6315  t = 0.025, df = 71, p = .980  Average p-value for 70% correct group = .6906  t = 2.750, df = 71, p = .008 OHRM/PRAD June 10, 2008 12

Results – Validity (T-tests) For IEA sample:  Grand average Angoff ratings  5-SME = .7716  3-SME = .7710  Average p-values  50 th percentile group = .6810  70% correct group = .6980 OHRM/PRAD June 10, 2008 13

Results – Validity (T-tests) For IEA sample, continued:  Comparisons:  1: 50 th perc p-values compared to 5-SME Angoffs  t = -3.233, p = .002  2: 70% corr p-values compared to 5-SME Angoffs  t = -2.685, p = .009  3: 50 th perc p-values compared to 3-SME Angoffs  t = -3.148, p = .002  4: 70% corr p-values compared to 3-SME Angoffs  t = -2.587, p = .012 OHRM/PRAD June 10, 2008 14

Results – Validity (T-tests) IEA T-Test Comparisons 50th Percentile p -values 70% Correct p -values Avg. Angoffs for t = -3.233 t = -2.685 5 SMEs p = .002 p = .009 Avg. Angoffs for t = -3.148 t = -2.587 3 SMEs p = .002 p = .012 OHRM/PRAD June 10, 2008 15

Results – Summary  DO SMEs gave reasonably reliable and valid estimates of actual p-values, especially for test takers at the 50 th percentile.  IEA SMEs gave less reliable and valid estimates by exhibiting less interrater agreement, demonstrating less insight into the relative difficulty of items, and overestimating p-values.  The notably superior performance of the DO SMEs is reasonable given the differences between the procedures used to obtain Angoff estimates from the two groups. OHRM/PRAD June 10, 2008 16

Limitations of Current Study  Post hoc studies  Did not retain initial round of Angoff ratings prior to group discussions during second round OHRM/PRAD June 10, 2008 17

How Does This Help You?  The more SMEs, the merrier!  Group discussion is critical  SMEs need to be experienced and representative of occupational workforce OHRM/PRAD June 10, 2008 18

References American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational measurement (pp. 508-600). Washington, DC: American Council on Education. Cizek, G.J. (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum Associates. Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational measurement: Issues and practice, 23(4), 31-50. OHRM/PRAD June 10, 2008 19

References (continued) Ebel, R.L. (1972). Essentials of educational measurement . Englewood Cliffs, NJ: Prentice-Hall. Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline examinees. Applied measurement in education, 12(1), 13-28. Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and psychological measurement, 14, 3-19. OHRM/PRAD June 10, 2008 20

Reliability and Validity of Angoff Ratings J. Anthony Bayless - PowerPoint PPT Presentation

Reliability and Validity of Angoff Ratings J. Anthony Bayless Henry Busciglio Personnel Research and Assessment Division Office of Human Resources Management Standard Setting Process to establish a performance standard, cut score, or

ASSESSING THE MEASUREMENT MODEL RELIABILITY AND VALIDITY USING SPSS/AMOS USING SPSS/AMOS

The Brief Assessment of Cognition for Schizophrenia: Validity and Reliability of the Filipino

preschoolers in Sweden: reliability and validity of an instrument Mina Sedem, Eva Siljehag,

1 Interactive procedures for qualitative inquiry: Reliability and validity checking Abstract In

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

Reliability and validity of Arabic version of BICAMS: Egyptian dialect Prepared by Nevin M

External Validity March 25 1 / 16 Definition How do we define external validity? Mundane

Lecture 3/Chapter 3 Measurements, Mistakes, Misunderstandings Definitions: validity,

Cue validity Cue validity - predictiveness of a cue for a given category Central

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Experimental Design & Evaluation 14. Quiz 3 SunyoungKim,PhD 1. Scientists always try

Normative Database Construction, Reliability, and Usability: Considerations in Premarket Review

and Differentiel Item Effect to Evaluate Construct Validity of the COPSOQ Should we worry about

Statewide Kindergarten Formative Assessment Systems: Challenges and Innovative Designs Reliability

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

First-Order Necessity and Validity First-Order Necessity and Validity Mark Criley IWU

Software Reliability Categorizing and specifying the reliability of software systems CS 422

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

Does it matter what validity means? Professor Paul E. Newton Date: 4 February 2013

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

RESEARCH VALIDITY Winfred Arthur, Jr. Department of Psychological and Brain Sciences and

External Validity In order to test our RH: we have to decide on a research design, sample

Undecidability [Cutland, Computability , Section 6.1.] 1 Aim: Show that validity of

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

Reliability and Validity of Angoff Ratings J. Anthony Bayless - PowerPoint PPT Presentation

Reliability and Validity of Angoff Ratings J. Anthony Bayless Henry Busciglio Personnel Research and Assessment Division Office of Human Resources Management Standard Setting Process to establish a performance standard, cut score, or

ASSESSING THE MEASUREMENT MODEL RELIABILITY AND VALIDITY USING SPSS/AMOS USING SPSS/AMOS

The Brief Assessment of Cognition for Schizophrenia: Validity and Reliability of the Filipino

preschoolers in Sweden: reliability and validity of an instrument Mina Sedem, Eva Siljehag,

1 Interactive procedures for qualitative inquiry: Reliability and validity checking Abstract In

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

Reliability and validity of Arabic version of BICAMS: Egyptian dialect Prepared by Nevin M

External Validity March 25 1 / 16 Definition How do we define external validity? Mundane

Lecture 3/Chapter 3 Measurements, Mistakes, Misunderstandings Definitions: validity,

Cue validity Cue validity - predictiveness of a cue for a given category Central

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Experimental Design &amp; Evaluation 14. Quiz 3 SunyoungKim,PhD 1. Scientists always try

Normative Database Construction, Reliability, and Usability: Considerations in Premarket Review

and Differentiel Item Effect to Evaluate Construct Validity of the COPSOQ Should we worry about

Statewide Kindergarten Formative Assessment Systems: Challenges and Innovative Designs Reliability

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

First-Order Necessity and Validity First-Order Necessity and Validity Mark Criley IWU

Software Reliability Categorizing and specifying the reliability of software systems CS 422

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

Does it matter what validity means? Professor Paul E. Newton Date: 4 February 2013

An Inside Look at Electric Reliability 2018 Electric Reliability Report Stockton, California

RESEARCH VALIDITY Winfred Arthur, Jr. Department of Psychological and Brain Sciences and

External Validity In order to test our RH: we have to decide on a research design, sample

Undecidability [Cutland, Computability , Section 6.1.] 1 Aim: Show that validity of

System Reliability Regulation: System Reliability Regulation: A Jurisdictional Survey A

Experimental Design & Evaluation 14. Quiz 3 SunyoungKim,PhD 1. Scientists always try