ehr based phenotyping bulk learning and evalua on with
play

EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous - PowerPoint PPT Presentation

EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases) Po-Hsiang (Barnett) Chiu Phenotypes and phenotyping Physically observable traits of genotypes (and their interac;ons with environments) Biochemical or physiological


  1. EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases) Po-Hsiang (Barnett) Chiu

  2. Phenotypes and phenotyping Physically observable traits of genotypes (and their interac;ons with environments) Biochemical or physiological proper;es, behavior, and products of behavior AFribu;ons of diseases (e.g. suscep;bility) Diseases (and disease subtypes)

  3. Data-Driven Phenotyping Data-driven phenotyping • – Two main methodologies • Rule-based approach (e.g. eMerge, hFps://emerge.mc.vanderbilt.edu) • Predic3ve Analy3cs – Data sources: • EHRs/EMRs: Medicinal treatments, diagnoses, lab measurements, etc. • Genomic data: SNP arrays, copy number varia;on (CNVs), etc. – Phenotypes • Diseases, subtypes, or variables aFributed to disease predic;ons

  4. Diagnos;c Concept Units • Various diseases sharing the same set of diagnos;c concept units • Infec;ous diseases – Lab tests • Microorganism, blood, urine, body ;ssues, stool – Medica;ons • An;bio;c, an;virus, anthelmin;c • Build sta;s;cal models for each diagnos;c component and combine them appropriately – Ensemble learning

  5. Bulk Learning in a Nutshell … Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set) as a substrate for model learning and evaluation wherein (a given) medical ontology is used to perform feature selection and model stacking is used to construct abstract feature representation of low sample complexity in order to reduce training requirements. Key Concepts: 1. Build phenotyping models on top of mul;ple diseases 2. Automa;c feature selec;on using an exis;ng ontology 3. Models are combined via model stacking (a form of ensemble learning) 4. Abstract features Dimensionality reduc;on 5. Less labeled data required for model evalua;ons

  6. Phenotyping via Bulk Learning • Under model stacking, we then arrive at the no;on of “concept-driven phenotyping” – A subset or combina;ons of lab tests are more aFributable to some diseases while the others are beFer explained by medica;ons • In this study, infec;ous diseases associated with 100 ICD-9 codes as the domain of study for bulk learning – For simplicity, consider different diagnos;c codes as different diseases … – Why 100 codes? – Code selec;on strategy?

  7. Bulk Learning Basics I • Addresses two central issues in predic;ve analy;cal approach to computa;onal phenotyping – Feature engineering • Medical ontology for feature decomposi;on • Medical En;;es Dict (hFp://med.dmi.columbia.edu) – Data annota;on • Ensemble learning (e.g. stacked generaliza;on [Wolpert 1992]) • Feature abstrac;on for dimensionality reduc;on

  8. Medical Ontology for Grouping Features Snapshot of Medical En;;es Dic;onary • (hFp://med.dmi.columbia.edu)

  9. Model Stacking • Why inspec;ng mul;ple (infec;ous) diseases? – Using mul3ple diseases as substrate and iden;fy their common elements – Example stacking architecture (under stacked generaliza;on method) Attributes: Level-1 Probabilities and ICD-9 Target: True Labels (Gold Standard) Level 2 Attributes: Level-0 Probabilities and Indicators Target: Diagnostic Codes (Silver Standard) Level 1 Urinary Chemistry Measure Microbiology Measure Intravenous Chemistry Measure Antibiotic Measure Other Phenotypic Measures (e.g. Antiviral) Level 0

  10. Surrogate Labels vs True Labels • Model stacking is used to achieve: – Improve upon base model performances – Transform EHR data to a denser form • Uses diagnos;c codes (e.g. ICD-9) as surrogate labels to establish “approximate predic;ve models.” • Why surrogate labels (e.g. ICD-9)? – Features extracted from EHR can be large – Used to derive compact representa;on of the training data – “Free” supervised signals that are sufficiently close but can be obtained without extra work • Objec;ve: Build sta;s;cal models in abstract feature space – Create a sparse annota;on set (i.e. gold standard) that serves a proxy dataset for downstream model evalua;ons – 83 annotated cases

  11. raw (1) (1) (1) (1) m1 a1 b1 u1 logistic units features f11 f12 m1 Σ f1j f21 a1 (i-1) Σ f2j (i) (i+1) f31 b1 Σ f3j f41 u1 Σ m 1g a 1g b 1g u 1g Four Example Base Models urine Σ test microbiology global2 blood antibiotic test

  12. Performance Evalua;ons • How well does the model predict ICD-9s (using a separate test data)? • How well does the model predict annotated data (assoc. with “true labels”)? – (Binarized) ICD-9 becomes a candidate feature among abstract features (e.g. probability scores, indicators) • Annotated sample consists of randomly selected cases in which errors of ICD-9 coding are corrected • Data annota;ons and coding procedures are two independent processes

  13. Base Level Performances

  14. 127.4 Enterobiasis 009.1 Gastroenteri;s ... 117.9 Mycoses 047.8 (Other) viral meningi;s 053.9 Herpez zoster

  15. Other Components • Semi-supervised learning and virtual annota;on set • The 3 rd ;er in model stacking hierarchy – Trade-off between learned abstract features and the ICD-9 codes as surrogate labels. – Performance evalua;on on predic;ng annotated labels • Ontology-based feature engineering • Proper design of treatment and control (training) data

  16. Modeling Perspec;ve • EHR data consist of observa;ons and latent variables – Observa;ons can be directly answered via simple queries • Did the pa;ent have tests on E. Coli? • Did the pa;ent take Cekriaxon? • Latent variables represent quan;;es that cannot be directly observed in EHR or computed via simple queries – Does the pa;ent have an infec;on? – Diagnos;c ques;ons: specifically which infec;ons do the pa;ent have? • Learn classifiers to predict latent variables (with only access to observa;ons)

  17. Medical Perspec;ve • Seemingly different infec;ous diseases may share similar sets of lab tests and medica;ons – Staph. aureus • Skin infec;ons, pneumonia, blood poisoning – Cekriaxone • Meningi;s • Infec;ons at different sites of the body (e.g. bloodstream, lungs, urinary tracts) • Mul;ple classifiers for the same disease – 4 classifiers per ICD-9 code, each of which is binary classifier • 400 classifiers at base level

  18. Data Distribu;on Perspec;ve “Can we build a joint model applicable to all diseases?”

  19. Abstract Feature Representa;on: Design Choices Related work in construc;ng high-level features • – PCA, unsupervised feature learning, manifold learning, etc. Design choices • – Data characteris;cs – Interpretability Deep Neural Network • – Linear combina;on – Non-linear transforma;on (e.g. sigmoid, rec;fier, etc.) Feature set: con;nuous, dense, and “homogeneous” • – Image pixels – Times series of lab measurements – word2vec EHR data however are very different • – sparse and incomplete – consist of many different types (binary, categorical, con;nuous, etc.) – Features associated with mul;ple concepts

  20. Moving Forward … • Summary – Bulk learning is a framework with at least the following system choices • The bulk learning set (of target condi;ons) => base models • Classifica;on algorithms (guideline: probabilis;c classifiers + well-calibrated) • Stacking architecture (mul;ple ;ers => levels of abstrac;ons) • Strategy for combining individual (local) disease models to a global model – Advantage: Can use a small annotated sample for model construc;on and evalua;on within the abstract feature space (e.g. level-1 data) • 83 clinical cases were labeled in this study – Challenge: The model involving the interac;on between abstract features and ICD-9 do not generalize well into the region of the data where the ICD-9 coding was incorrect (1) (1) (1) (1) m1 a1 b1 u1 • Mul;ple types of surrogate labels m1 Σ Ongoing and future work • (i) m1 a1 (i-1) Complex decision boundary? Σ (i) a1 (i) (i) local2 (i) Σ b1 (i+1) Other surrogate labels b1 (i) u1 Σ (i-1) (i) (i+1) Semi-supervised learning u1 Σ m 1g a 1g b 1g u 1g Ac3ve learning Σ global2

  21. Reference [1] D.H. Wolpert, Stacked generaliza;on, Neural Networks. 5 (1992) 241–259. [2] K.M. Ting, I.H. WiFen, Issues in stacked generaliza;on, J. Ar;f. Intell. Res. 10 (1999) 271–289. [3] J. Jin Chen, C. Cheng Wang, R. Runsheng Wang, Using Stacked Generaliza;on to Combine SVMs in Magnitude and Shape Feature Spaces for Classifica;on of Hyperspectral Data, IEEE Trans. Geosci. Remote Sens. 47 (2009) 2193-2205. [4] David Baorto, James Cimino, et al. Available: hFp://med.dmi.columbia.edu. Access date: Oct 20, 2016. [5] T.A. Lasko, J.C. Denny, M.A. Levy, Computa;onal Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data, PLoS One. 8 (2013) e66341.

  22. T H A N K f11 f12 Y O U m1 Σ f1j f21 a1 Σ f2j f31 b1 Σ f3j f41 u1 Σ

  23. Level 0 Level 1 raw logistic units features f11 f12 m1 Microbiology Σ f1j m1 f21 a1 a1 An;bio;c Σ f2j b1 Σ f31 u1 b1 Blood test Σ f3j f41 u1 Urine test Σ

  24. Example Features

Recommend


More recommend