evaluating data quality
play

Evaluating Data Quality to Support Evidence-Building Zachary H. - PowerPoint PPT Presentation

Software Tools for Evaluating Data Quality to Support Evidence-Building Zachary H. Seeskin NORC at the University of Chicago AcademyHealth Annual Research Meeting Washington, DC June 3, 2019 Acknowledgments : Rupa Datta, Gabriel Ugarte, Evan


  1. Software Tools for Evaluating Data Quality to Support Evidence-Building Zachary H. Seeskin NORC at the University of Chicago AcademyHealth Annual Research Meeting Washington, DC June 3, 2019 Acknowledgments : Rupa Datta, Gabriel Ugarte, Evan Herring-Nathan, Andrew Latterner, NORC; Bob George, Emily Wiegand, Chapin Hall at the University of Chicago Disclaimer : This research was supported by the Family Self-Sufficiency Research Consortium, Grant Number #90PD0272, funded by the Office of Planning, Research, and Evaluation in the Administration for Children and Families, U.S. Department of Health and Human Services to the University of Chicago, with NORC at the University of Chicago as a sub-grantee. The views expressed are solely those of the authors and do not necessarily represent the views of the Office of Planning, Research, and Evaluation.

  2. Motivation  Increasing research use of administrative data including for health and health care research  Report of the Commission on Evidence-Based Policymaking, 2017  Passage of Foundations for Evidence-Based Policymaking Act in January  Understanding data quality critical for expanding informed use of such data sources for evidence-building  But resources needed to inform evaluations of data quality  Literature largely focused on federal statistical agencies  NORC is developing software tools to fulfill this need  Provide best practices  Incorporate descriptive statistics and multivariate visualization 2

  3. Overview 1. Growing diversity of data sources for health research 2. Overview of data quality and assessment 3. Dimensions of data quality 4. NORC’s Data File Orientation Toolkit with examples 5. Conclusion 3

  4. Administrative Data Sources Used in Health and Health Care Research Range of data sources being used for research:  Medicare and Medicaid enrollment  E-prescription data  State registries (ex: immunization)  Consumer purchase data  Insurance claims  Many others  Electronic health records/Electronic medical records Uses of data sources:  Directly for analysis/estimation  Monitoring  With or without linkage to other sources  Surveillance  Indirectly to support estimation with other  Further background from Seeskin et al. (2018) sources (such as surveys)  Survey frames, imputation, calibration 4

  5. Challenges with Administrative Data Sources  Data collected for administration rather than to support statistical analyses  Common data quality concerns: Data entry errors, Missing data, Duplicate records  Varying quality for different variables based on importance for administration  Represent special populations without ready official statistics available  Subject to changes over time and differential treatment for different groups 5

  6. Principles for Data Quality Analyses: Know Your Data  Conduct careful review of metadata and documentation  Understand context in which data are collected and maintained  Including legal and compliance issues impacting measures in data file  Focus on data exploration to detect possible quality issues  Seek potential validation data related to measures in your file  By unit or in aggregate  If available, conduct detailed comparisons  Ask: Is your data fit for the purpose at hand?  Needs of data quality differ for different kinds of research questions (ex: cross-sectional, time series, longitudinal) 6

  7. Data Quality Dimensions from Literature Dimension Description Relevance Degree to which statistics meet needs of user, including whether data provide what is needed for use or research topic. Accuracy Whether data values reflect true values and are processed correctly. Completeness Whether data cover population of interest, include correct records, and do not contain duplicate or out-of-scope records. Additionally, whether cases have information filled in for all appropriate fields without missing data. Timeliness Whether the data are available in time to inform policy matters of interest. Accessibility The conditions in which users can obtain and work with the data, including physical conditions and legal requirements for access. Clarity/Interpretability Whether data are accompanied by sufficient and appropriate metadata to understand the data and their quality. Coherence/Consistency Data from different sources are based on the same approaches, classifications, and methodologies, with enough metadata available to support combining information from different sources. Comparability Extent to which differences between statistics reflect real phenomena rather than methodological differences. Types of comparability: over time, across geographies, among domains. 7

  8. Recommended Checks from Literature Accuracy Completeness Analysis Description Analysis Description Validity of units Assesses validity of Coverage of units Assesses whether there are units that are identification keys for units in the missing or not available for the analysis. dataset. Duplicates Looks at the occurrence of multiple Validity of variable Assesses sensibility of values of registrations of identical units in the dataset. values single variables and among Missing values Looks at the absence of values for the variables variables using the metadata. and analyzes whether characteristics of the Trustworthy variable Determines values in data that, units with missing data are different from those values while valid, are suspicious from of units with complete data. judgment or experience. Comparability Analysis Description Distribution of Assesses distribution of relevant variables to look for variables incongruences with expected distributions. Relationships among Looks for unexpected patterns in relationships among variables variables. Consistency over time Looks for unexpected patterns in variables over time. Key sources on methods and frameworks: Daas et al. 2011, Laitila et al. 2011, Iwig et al. 2013, Office for National Statistics UK 2013, Statistics Canada 2018 8

  9. NORC Software Tools for Evaluating Data Quality  Data File Orientation Toolkit for Family Self-Sufficiency Data Center  Produces report applying data quality analyses to your data file  Provides detailed written guidance on how to interpret analyses, organizing by dimensions  Based in R Markdown; Primarily designed for researchers and R programmers  Planned release upcoming at http://www.norc.org/Research/Projects/Pages/fa mily-self-sufficiency-data-center.aspx  Future plans: Data Quality Dashboard  Designed for broader set of users  Load data file and use point-and-click interface to conduct recommended data quality checks 9

  10. Example Analysis: Tableplots (Tennekes et al. 2011) Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases 10

  11. Example Analysis: Tableplots Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases 11

  12. Example Analysis: Tableplots Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases 12

  13. Example Analysis: Tableplots Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases 13

  14. Example Analysis: Tableplots Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases 14

  15. Example Analysis: Tableplots Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 15

  16. Example Analysis: Tableplots Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 16

  17. Example Analysis: Letter Value Plots (Hofmann et al. 2015) Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 17

  18. Example Analysis: Letter Value Plots Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 18

  19. Example Analysis: Letter Value Plots Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 19

  20. Example Analysis: Letter Value Plots Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records 20

  21. Conclusion  Use of administrative data sources for health and health care research is expanding  Key issues and recommendations described in Report of the Commission on Evidence- Based Policymaking, 2017  Provide much needed resource to support evaluations of data quality of administrative data sources to support evidence-building  Value of using software tools to explore your data  Advantages of R environment  Current tools geared toward researchers and programmers  In future, aim to develop and provide tools that are more broadly accessible 21

Recommend


More recommend