De Deal aling ing wit ith h mi missing ssing dat ata a in in pr pract actice: ice: Met Methods, hods, app pplicati lications, ons, and nd implication plications s for or HIV IV coh ohort ort st studies udies Belen Alejos Ferreras Centro Nacional de Epidemiología Instituto de Salud Carlos III 19 de Octubre de 2017 1
Wh What at is Mi is Missin ssing g or or Inc Incom omplete plete da data ta? ?
What at is Missi ssing ng or Incom omple plete te dat ata? a? Data that were intended to collect on observations but that due to different reasons were not Missing or Incomplete data collected V1 V2 V3 V4 X X X . X X X . X X X . X X X .
Do Do I I nee need d to to be be worr worried ied ab abou out t mi missin ssing g da data ta? ?
Imp mpor orta tance nce and conse onseque quences nces No universal rule to indicate the proportion of missing data producing bias or to invalid results The success of a statistical analysis in the presence of missing data will depend on the reasons why data are missing ( missing data mechanisms ) 5
Wh Whic ich h Miss Missing ing da data ta me mech chanisms anisms are are there? there?
Whic Wh ich h Miss Missing ing da data ta me mech chanisms anisms are are there? there? Missing Completly At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)
Missing data mechanisms Missing completely at random (MCAR) There is no relationship between whether an observation is missing and the unseen value nor to any values (observed or missing) 𝑸 𝑺 𝒁 = 𝑸(𝑺) Missing at random (MAR) There is no relationship between whether an observation is missing and the unseen value, but it is related to some of the observed data 𝑸 𝑺 𝒁 = 𝑸(𝑺|𝒁 𝒑𝒄𝒕 ) Missing not at random (MNAR) Whether an observation is missing depends on the unseen value itself R=missing data point ; Y=Variables
Met Method hods s to to deal deal wi with th mi missin ssing g da data ta
Metho thods s to o deal eal with th miss ssing ing data ta If it is not possible to get the original value … it is necessary to face the problem with statistical techniques
Methods to deal with missing data Ad-hoc or conventional Complete- Case (CC) Indicator Method (IM) Simple mean or regression mean imputation Stochastic regression imputation Easy implementation No specific software Not based on statistical principles Might produce biased results and loss of power 11
Methods to deal with missing data Ad-hoc or Advanced or conventional complex Multiple Imputation by Chained Equations Complete- Case (CC) (MICE) Indicator Method (IM) Maximum likelihood estimation Simple mean or regression mean imputation Bayesian Methods Stochastic regression imputation Inverse Probability weighting Easy implementation Maximize use of available information No specific software More precise results (higher statistical power) Not based on statistical principles Depend on missing data mechanism Might produce biased results and loss of Some not implemented in statistical software power 12
Methods to deal with missing data Ad-hoc or Advanced or conventional complex Multiple Imputation by Chained Equations Complete- Case (CC) (MICE) Indicator Method (IM) Maximum likelihood estimation Simple mean or regression mean imputation Bayesian Methods Stochastic regression imputation Inverse Probability weighting Easy implementation Maximize use of available information No specific software More precise results (higher statistical power) Not based on statistical principles Depend on missing data mechanism Might produce biased results and loss of Some not implemented in statistical software power 13
Compl mplete ete-Case Cases Consists of restricting the statistical analyses to the cases with complete information for all the variables in the model Original Complete-cases ID Outcome Variable Complete- ID Outcome Variable Complete- Case Case 1 5 4 Yes 1 5 4 Yes 2 4 . No 5 4 5 Yes 3 . 2 No 4 3 . No 5 4 5 Yes
Ind ndica icator tor me meth thod od Creates an extra category for missing values in each incomplete, independent and categorical variable and therefore all the observations are included in the analyses Original Indicator Method ID Outcome Variable Complete- ID Outcome Variable Complete- Case Case 1 5 0 1 1 5 0 1 2 4 . 0 2 4 9 0 3 4 1 1 3 4 1 1 4 3 . 0 4 3 9 0 5 4 1 1 5 4 1 1
Si Simp mple le im imputa tation tion me meth thod ods The information collected in the sample is used to assign one value to those variables with missing values 23.5
Si Simp mple le im imputa tation tion me meth thod ods Simple mean imputation replaces each missing observation by the completers mean Regression mean imputation replaces each missing observation with the predicted values from a regression model Random or stochastic regression imputation to create an imputed value, an appropriate random residual is added to the value predicted using regression mean imputation.
Si Simp mple le im imputa tation tion me meth thod ods Simple mean imputation replaces each missing observation by the completers mean Regression mean imputation replaces each missing observation with the predicted values from a regression model Random or stochastic regression imputation to create an imputed value, an appropriate random residual is added to the value predicted using regression mean imputation.
Si Simp mple le im imputa tation tion me meth thod ods SOLUTION: Multiple Imputation
Mul ulti tiple ple Imp mput utation ation me meth thods ods Imputation techniques that assign several imputed values to each missing value using the following procedure:
Mul ulti tiple ple Imp mput utation ation me meth thods ods Imputation techniques that assign several imputed values to each missing value using the following procedure: IMPUTED ESTIMATOR FINAL MODEL 1 DATA 1 1 DATASET WITH MISSING FINAL VALUES ESTIMATOR ESTIMATOR M
Mul ulti tiple ple Imp mput utation ation me meth thods ods Imputation techniques that assign several imputed values to each missing value using the following procedure: IMPUTED ESTIMATOR FINAL MODEL 1 DATA 1 1 IMPUTED FINAL MODEL 2 ESTIMATOR DATA 2 2 DATASET WITH MISSING FINAL VALUES ESTIMATOR ESTIMATOR M
Mul ulti tiple ple Imp mput utation ation me meth thods ods Imputation techniques that assign several imputed values to each missing value using the following procedure: IMPUTED ESTIMATOR FINAL MODEL 1 DATA 1 1 IMPUTED FINAL MODEL 2 ESTIMATOR DATA 2 2 DATASET WITH MISSING FINAL ESTIMATORS VALUES ARE ESTIMATOR IMPUTED COMBINED FINALMODEL 3 ESTIMATOR DATA 3 3 The total variance is the sum of Within-imputation variance and Between imputation variance IMPUTED FINAL MODEL M ESTIMATOR corrected by for a finite DATA number of imputations M M
Mul ulti tiple ple Imp mput utation ation me meth thods ods Multiple Imputation by Chained Equations (MICE)
Mul ulti tiple ple Imp mput utation ation me meth thods ods Multiple Imputation by Chained Equations (MICE) A particular multiple imputation technique that allows to impute missing values in multiple variables under MAR assumption. Logistic, multinomial or ordered regression can be used instead linear regression for non-normal variables . Missing values in X 1 , X 2 , X 3 X 1 X 2 X 3 Multiple Imputation : The complete process is repeated m times
Oth ther r ad adva vanc nced ed me metho thods ds Maximum likelihood estimation models simultaneously the outcome and the reason why data are missing Bayesian methods estimate a statistical model for full data (including missingness mechanism and the outcome) Inverse Probability Weighting calculates the predicted probability for certain variable to be observed of each patient and use these weights in the outcome model
Re Real al Wo World rld Da Data ta ca case se
Different Approaches to Account for Missing Data in a Cohort of HIV-Positive Patients To compare three different methods to deal with missing data in both outcome (cause of death) and covariates in a cohort of HIV-Positive patients (CoRIS) • CoRIS ( N=10,469) • Cancer mortality Poisson regression mortality rates and rate ratios for the effect of Hepatitis C Virus coinfection • Complete-case • Indicator- Method • MICE
Recommend
More recommend