MHPE 494: Data Analysis MHPE 494: Data Analysis Alan Schwartz, PhD Alan Schwartz, PhD Matt Matt Lineberry Lineberry, PhD , PhD Department of Medical Education Department of Medical Education College of Medicine College of Medicine University of Illinois at Chicago University of Illinois at Chicago Welcome! Welcome! Your name, specialty, institution, Your name, specialty, institution, position position Experience in data analysis Experience in data analysis Why this class? Why this class? What are your expectations and What are your expectations and goals? goals? The Analytic Process The Analytic Process Formulate research questions Formulate research questions Design study Design study Covered in Research Design/Grant Writing Collect data Collect data Record data Record data Check data for problems Check data for problems Explore data for patterns Explore data for patterns Test hypotheses with the data Test hypotheses with the data Covered in Writing for Interpret and Interpret and report results report results Scientific Publication (c) Alan Schwartz, UIC DME, 1999 1
Monday AM Monday AM Introduction Introduction Syllabus Syllabus Data Entry Data Entry Data Checking Data Checking Exploratory Data Analysis Exploratory Data Analysis Data entry Data entry or, or, “Garbage in, garbage out” “G “Garbage in, garbage out” “G b b i i b b t” t” Data Entry Data Entry Data entry is the process of recording the Data entry is the process of recording the behavior of research subjects (or other behavior of research subjects (or other data) in a format that is efficient for: data) in a format that is efficient for: Understanding the coded responses Understanding the coded responses Understanding the coded responses Understanding the coded responses Exploring patterns in the data Exploring patterns in the data Conducting statistical analyses Conducting statistical analyses Distributing your data set to others Distributing your data set to others Data entry is often given low regard, but a Data entry is often given low regard, but a little time spent now can save a lot of time little time spent now can save a lot of time later! later! (c) Alan Schwartz, UIC DME, 1999 2
Methods of data entry Methods of data entry Direct entry by participants Direct entry by participants Direct entry from observations Direct entry from observations Entry via coding sheets Entry via coding sheets Entry to statistical software Entry to statistical software Entry to spreadsheet software Entry to spreadsheet software Entry to database software Entry to database software Data file layout Data file layout Most data files in most statistical software Most data files in most statistical software use “standard data layout”: use “standard data layout”: Each row represents one subject Each row represents one subject Each column represents one variable Each column represents one variable Each column represents one variable Each column represents one variable measurement measurement Special formats are sometimes used for Special formats are sometimes used for particular analyses/software particular analyses/software Doubly multivariate data (each row is a Doubly multivariate data (each row is a subject at a given time) subject at a given time) Matrix data Matrix data “Standard data layout” “Standard data layout” Id Female YrsOld GPA 1 1 19 3.5 2 2 0 0 21 21 3.4 3 4 3 1 20 3.4 (c) Alan Schwartz, UIC DME, 1999 3
Missing data Missing data Data can be missing for many reasons: Data can be missing for many reasons: Random missing responses Random missing responses Drop Drop- -out in longitudinal studies (censoring) out in longitudinal studies (censoring) Systematic failure to respond Systematic failure to respond Systematic failure to respond Systematic failure to respond Structure of research design Structure of research design Knowing why data is missing is often the Knowing why data is missing is often the key to deciding how to handle missing key to deciding how to handle missing data data Missing data Missing data Approaches to dealing with missing data: Approaches to dealing with missing data: Leave data missing, and exclude that cell or Leave data missing, and exclude that cell or subject from analyses subject from analyses Impute values for missing data (requires a Impute values for missing data (requires a Impute values for missing data (requires a Impute values for missing data (requires a model of how data is missing) model of how data is missing) Use an analytic technique that incorporates Use an analytic technique that incorporates missing data as part of data structure missing data as part of data structure Naming Variables Naming Variables Variables should have both a short name (for Variables should have both a short name (for the software) and a descriptive name (for the software) and a descriptive name (for reporting) reporting) Name for what is measured, not inferred Name for what is measured, not inferred Short names should capture something useful Short names should capture something useful about the variable (its scale, its coding) about the variable (its scale, its coding) Better names: Better names: Q1 Q1- -Q20, IQ, MALE, IN_TALL, IN_TALLZ Q20, IQ, MALE, IN_TALL, IN_TALLZ Worse names: Worse names: INTEL, SEX, SIZE INTEL, SEX, SIZE (c) Alan Schwartz, UIC DME, 1999 4
Coding Variables Coding Variables Depends on Depends on measurement scale measurement scale Nominal, two categories: Name variable for Nominal, two categories: Name variable for one category and code 1 or 0 one category and code 1 or 0 Nominal many categories: Use a string Nominal, many categories: Use a string Nominal, many categories: Use a string Nominal many categories: Use a string coding or meaningful numbers coding or meaningful numbers Ordinal: Code ranks as numbers, decide if Ordinal: Code ranks as numbers, decide if lower or higher ranks are better lower or higher ranks are better Interval/Ratio: Code exact value Interval/Ratio: Code exact value Labeling Variable Values Labeling Variable Values For nominal and ordinal variables, For nominal and ordinal variables, values values should also be labeled unless using string should also be labeled unless using string coding. coding. Value labels should precise indicate the Value labels should precise indicate the Value labels should precise indicate the Value labels should precise indicate the response to which the value refers. response to which the value refers. Example: Educational level ordinal variable: Example: Educational level ordinal variable: 1 = grade school not completed 1 = grade school not completed 2 = grade school completed 2 = grade school completed 3 = middle school completed 3 = middle school completed 4 = high school completed 4 = high school completed 5 = some college 5 = some college 6 = college degree 6 = college degree Error Checking Error Checking Goal: Identify errors made due to: Goal: Identify errors made due to: Faulty data entry Faulty data entry Faulty measurement Faulty measurement Faulty responses Faulty responses Faulty responses Faulty responses Prior to analyses. Not hypothesis Prior to analyses. Not hypothesis- -based based (c) Alan Schwartz, UIC DME, 1999 5
Range checking Range checking The first basic check that should be The first basic check that should be performed on all variables performed on all variables Print out the range (lowest and highest Print out the range (lowest and highest value) of every variable value) of every variable value) of every variable value) of every variable Quickly catches common typos involving Quickly catches common typos involving extra keystrokes extra keystrokes Distribution checking Distribution checking Examining the distribution of variables to Examining the distribution of variables to insure that they’ll be amenable to analysis. insure that they’ll be amenable to analysis. Problems to detect include: Problems to detect include: Floor and ceiling effects Fl Floor and ceiling effects Fl d d ili ili ff ff t t Lack of variance Lack of variance Non Non- -normality (including skew and kurtosis) normality (including skew and kurtosis) Heteroscedascity (in joint distributions) Heteroscedascity (in joint distributions) Eccentric subjects Eccentric subjects Patterns of data can suggest that Patterns of data can suggest that particular subjects are eccentric particular subjects are eccentric Subjects may have misunderstood Subjects may have misunderstood instructions instructions instructions instructions Subjects may understand instructions but use Subjects may understand instructions but use response scale incorrectly response scale incorrectly Subjects may intentionally misreport (to Subjects may intentionally misreport (to protect themselves or to subvert the study as protect themselves or to subvert the study as they see it) they see it) Subjects may actually have different, but Subjects may actually have different, but coherent views! coherent views! (c) Alan Schwartz, UIC DME, 1999 6
Recommend
More recommend