a missing value tour
play

A missing value tour Julie Josse Ecole Polytechnique, INRIA 26 - PowerPoint PPT Presentation

A missing value tour Julie Josse Ecole Polytechnique, INRIA 26 january 2020 Workshop of the Applied Machine Learning Days 2020, Lausanne 1 Overview 1. Introduction 2. Handling missing values (inferential framework) 3. Supervised learning


  1. A missing value tour Julie Josse Ecole Polytechnique, INRIA 26 january 2020 Workshop of the Applied Machine Learning Days 2020, Lausanne 1

  2. Overview 1. Introduction 2. Handling missing values (inferential framework) 3. Supervised learning with missing values 4. Discussion - challenges 2

  3. Introduction

  4. Collaborators • PhD students - postdocs: W. Jiang, M. Le Morvan, I. Mayer, G. Robin (former), A. Sportisse • Colleagues: C. Boyer (LPSM), G. Bogdan (Wroclaw), F. Husson (Agrocampus) - (package missMDA ), J-P Nadal (EHESS), E. Scornet (X), G. Varoquaux (INRIA), S. Wager (Stanford) • Traumabase (hospital): T. Gauss, S. Hamada, J-D Moyer/ Capgemini 3

  5. Traumabase • 20000 patients • 250 continuous and categorical variables: heterogeneous • 11 hospitals: multilevel data • 4000 new patients/ year Center Accident Age Sex Weight Lactactes BP shock . . . Beaujon fall 54 m 85 NM 180 yes Pitie gun 26 m NR NA 131 no Beaujon moto 63 m 80 3.9 145 yes Pitie moto 30 w NR Imp 107 no HEGP knife 16 m 98 2.5 118 no . ... . . 4

  6. Traumabase • 20000 patients • 250 continuous and categorical variables: heterogeneous • 11 hospitals: multilevel data • 4000 new patients/ year Center Accident Age Sex Weight Lactactes BP shock . . . Beaujon fall 54 m 85 NM 180 yes Pitie gun 26 m NR NA 131 no Beaujon moto 63 m 80 3.9 145 yes Pitie moto 30 w NR Imp 107 no HEGP knife 16 m 98 2.5 118 no . ... . . ⇒ Estimate causal effect : Administration of the treatment ”tranexamic acid” (within 3 hours after the accident) on the outcome mortality for traumatic brain injury patients 4

  7. Traumabase • 20000 patients • 250 continuous and categorical variables: heterogeneous • 11 hospitals: multilevel data • 4000 new patients/ year Center Accident Age Sex Weight Lactactes BP shock . . . Beaujon fall 54 m 85 NM 180 yes Pitie gun 26 m NR NA 131 no Beaujon moto 63 m 80 3.9 145 yes Pitie moto 30 w NR Imp 107 no HEGP knife 16 m 98 2.5 118 no . ... . . ⇒ Predict the risk of hemorrhagic shock given pre-hospital features Ex random forests/logistic regression with covariates with missing values 4

  8. Missing values Multilevel data/ data integration : Systematic missing variable in one hospital Acide.tranexamique Percentage 25 50 75 0 AIS.externe Percentage of missing values AIS.face Catecholamines AIS.tete Impossible Not Applicable Not made Not Informed NA Craniectomie.decompressive Choc.hemorragique DVE Osmotherapie ISS.2 Trauma.Center PIC Trauma.cranien Anomalie.pupillaire IOT.SMUR Mydriase Glasgow.initial FC Delta.hemocue ACR.1 IGS.II Hb Variable PAS PAD DC.en.rea Traitement.antiagregants Traitement.anticoagulant SpO2 Ventilation.FiO2 PAS.min FC.max PAD.min Glasgow.moteur.initial SpO2.min Bloc.J0.neurochirurgie Temps.lieux.hop Hemocue.init DTC.IP.max PAS.SMUR FC.SMUR PAD.SMUR Glasgow.sortie Mannitol.SSH Regr.mydriase.osmo Cause.du.DC 5

  9. Complete-case analysis p = 300 = p = 5 An n × p matrix, each entry is missing with probability 0 . 01 significant role” (R. Sameworth, 2019) ”One of the ironies of Big Data is that missing data play an ever more ?lm, ?glm, na.action = na.omit = ⇒ ≈ 5% of rows kept ⇒ ≈ 95% of rows kept Acide.tranexamique Percentage 25 50 75 0 AIS.externe Percentage of missing values AIS.face Catecholamines AIS.tete Impossible Not Applicable Not made Not Informed NA Craniectomie.decompressive Choc.hemorragique DVE Osmotherapie ISS.2 Trauma.Center PIC Trauma.cranien Anomalie.pupillaire IOT.SMUR Mydriase Glasgow.initial FC Delta.hemocue ACR.1 IGS.II Variable Hb PAS PAD DC.en.rea Traitement.antiagregants Traitement.anticoagulant SpO2 Ventilation.FiO2 PAS.min FC.max PAD.min Glasgow.moteur.initial SpO2.min Bloc.J0.neurochirurgie Temps.lieux.hop Hemocue.init DTC.IP.max PAS.SMUR FC.SMUR PAD.SMUR Glasgow.sortie Mannitol.SSH Regr.mydriase.osmo Cause.du.DC 6

  10. Handling missing values (inferential framework)

  11. Solutions to handle missing values Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc. Modify the estimation process to deal with missing values Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) / Louis formulae for their variability Ex logistic regression: EM to get ˆ β + Louis to get ˆ V (ˆ β ) Aim: Estimate parameters & their variance from an incomplete data ⇒ Inferential framework 7

  12. Solutions to handle missing values Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc. Modify the estimation process to deal with missing values Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) / Louis formulae for their variability Ex logistic regression: EM to get ˆ β + Louis to get ˆ V (ˆ β ) Cons: Difficult to establish - not many softwares even for simple models One specific algorithm for each statistical method... Aim: Estimate parameters & their variance from an incomplete data ⇒ Inferential framework 7

  13. Solutions to handle missing values Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc. Modify the estimation process to deal with missing values Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) / Louis formulae for their variability Ex logistic regression: EM to get ˆ β + Louis to get ˆ V (ˆ β ) Cons: Difficult to establish - not many softwares even for simple models One specific algorithm for each statistical method... Imputation (multiple) to get a complete data set Any analysis can be performed Ex logistic regression: Impute and apply logistic model to get ˆ β , ˆ V (ˆ β ) Aim: Estimate parameters & their variance from an incomplete data ⇒ Inferential framework 7

  14. Mean imputation • ( x i , y i ) ∼ i . i . d . N 2 (( µ x , µ y ) , Σ xy ) ● ● 3 X Y ● ● ● ● ● 2 -0.56 -1.93 ● ● µ y = − 0 . 01 ˆ µ y = 0 ● ● ● ● ● ● ● ● ● -0.86 -1.50 ● 1 ● ● ● ● ● σ y = 1 σ y = 1 . 01 ˆ ● ● ● ● ● ● ● ● ● Y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ..... ... ● ● ● ● ● ● ● ● ρ = 0 . 6 0 ● ● ● ρ = 0 . 66 ˆ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.16 0.7 ● ● ● ● −1 ● ● ● ● ● ● ● 0.16 0.74 ● ● ● ● ● −2 ● −2 −1 0 1 2 3 4 X 8

  15. Mean imputation • ( x i , y i ) ∼ i . i . d . N 2 (( µ x , µ y ) , Σ xy ) • 70 % of missing entries completely at random on Y ● X Y 2 -0.56 NA µ y = 0 . 18 ˆ µ y = 0 ● ● ● 1 ● -0.86 ● NA ● σ y = 1 σ y = 0 . 9 ˆ ● ● Y ● ● ● ● ● ● ● ● ..... ... ● ● ρ = 0 . 6 ρ = 0 . 6 ˆ 0 ● ● ● 2.16 0.7 −1 ● ● ● 0.16 NA ● ● −2 −1 0 1 2 3 4 X 8

  16. Mean imputation • ( x i , y i ) ∼ i . i . d . N 2 (( µ x , µ y ) , Σ xy ) • 70 % of missing entries completely at random on Y • Estimate parameters on the mean imputed data Mean imputation ● ● 2 ● X Y ● ● ● ● ● -0.56 0 . 01 ● ● ● ● µ y = 0 . 01 ˆ µ y = 0 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● -0.86 0 . 01 ● ● ● ● σ y = 1 σ y = 0 . 5 ˆ ● ● ● ● ● Y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ..... ... ● ● ρ = 0 . 6 ● ● ˆ ρ = 0 . 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.16 0.7 −1 ● ● ● ● ● ● 0.16 0 . 01 ● ● ● ● −2 ● −3 −2 −1 0 1 2 X Mean imputation deforms joint and marginal distributions 8

Recommend


More recommend