Obtaining phenotype and outcome data from EHRs Josh Denny, MD MS Vanderbilt University Medical Center 3/26/2018
EHR data are dense and efficient for discovery: Vanderbilt’s experience (BioVU) BioVU start Vanderbilt biobank enrollment EHR Data from Vanderbilt Biobank
eMERGE Goals: • To perform genomic studies using the EHR • To implement of genomic medicine
Making text documents useful for research Billing CC: SOB codes HPI: Mr. Smith is a 65yo w/ h/o CHF, … no dm2… Customized classifiers on atenolol 50mg daily… (smoking status, etc) Mother had RA. Clinical notes, Deidentify: remove HIPAA identifiers + …. test reports, CC: SOB Structured Output etc HPI: Mr. **jones** is a Research DrugName: atenolol 65yo w/ h/o CHF, … no Strength: 50 mg EHR Medication dm2… Frequency : daily extraction on atenolol 50mg daily… Mother had RA. chief_complaint: Shortness of Breath history_present_illness: Congestive Heart Failure Type 2 diabetes, negated Find biomedical concepts and mother_medical_history: qualifiers; create structured data rheumatoid arthritis Structured Output certainty (positive, negated) Who experienced it? (patient or family member?)
Finding a “simple” disease in the EHR: Who has hypertension? Definition: SBP > 140 or DBP > 90 Patient 1 Doesn’t have hypertension Patient 2 Has hypertension
Our “simple” example: Hypertension Multiple components are better (and blood pressure is the worst) Teixeira, JAMIA 2016
What we learned - Finding phenotypes in the EHR Clinical Notes Billing codes (NLP - natural language ICD9 & CPT processing) True cases Medications ePrescribing Labs & test results & NLP NLP Algorithm Development and Implementation <95% Case & control Identify Genetic ≥95% Manual review; Deploy algorithm phenotype associatio assess in BioVU development of interest n tests precision and refinement
Early discovery science in eMERGE – Hypothyroidism Algorithms can be deployed across multiple EHRs Analyses can be performed using extant data Am J Hum Genet. 2011;89:529-42
GWAS of QRS Duration in eMERGE n=5,272 SCN5A/SCN10A Ritchie et al., Circulation 2013
What happens in the “heart healthy” population? Examined the n=5272 Atrial fibrillation-free AA “heart healthy” population AG survival Followed for development of atrial HR=1.49 per G allele fibrillation based on GG p=0.001 genotype Years since normal ECG (and no heart disease) Ritchie et al., Circulation 2013
EHRs for drug response: Clopidogrel adverse events associated with CYP2C19 From clinical trials From the EHR Normal metabolizers Carriers N=807, P=0.005 Mega et al., NEJM 2009 Delaney et al. Clin Pharm Ther. 2012
Deep learning for Diabetic Retinopathy Train a machine learning algorithm over >128k images Gulshan et al. JAMA 2016
Phenome scanning ( PheWAS ) in the EHR Associated A phenotype genotypes Dense genomic information A genetic Associated variant phenotypes The curated EHR- based phenome
Replications of GWAS Binary traits associations via PheWAS P-value for replication: • All - 210/751: 2x10 -98 Continuous traits • Powered - 51/77: 3x10 -47 Nat Biotech 2013; 31:1102-1111
PheWAS across all HLA types (n= 37,270) Karnes et al, Sci Trans Med 2017
The potential for “call back” deeper phenotyping: Long QT genes ( SCN5A and KCNH2 ) in 2,200 sequenced patients in eMERGE • 83 rare (MAF < 1%) in SCN5A, 45 in KCNH2 • 121/128 MAF < 0.5%, 92 singletons • Three labs assessed known/likely pathogenicity Lab 1 16/121 4 Lab 2 Lab 3 24/121 17/121 Van Driest et al, JAMA 2016
Calculating a Phenotype Risk Score (PheRS) For each record i , generate PheRS OMIM 𝑙 Human 1 feature 1 Phenotype PheRS 𝑗 = 0 𝜕 𝑘 OMIM Ontology 𝑘=1 feature 2 ... EHR weight for 0=phenotype j Score for Add up phenotype j phenotypes OMIM absent subject i terms for k derrived from 1=phenotype j feature k phenotypes entire EHR present Repeat this for all Mendelian diseases Bastarache et al, Science 2018
Example: a phenotype risk score in Cystic Fibrosis CF cases CF controls Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M Chronic airway obstruction Pneumonia Diseases of pancreas Hypovolemia Acute upper respiratory infections Asthma Bronchiectasis Intestinal malabsorption Hepatomegaly Acute pulmonary heart disease Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7 Bastarache et al, Science 2018
PheRS identified potentially pathogenic SNVs N=21k on exome chip 6k SNVs Bastarache et al, Science 2018
The All of Us Research Program – Breaking Down Data Silos Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks of the U.S. Department of Health and Human Services.
Overview of the All of Us approach and protocol Direct Health Care Provider Volunteers Organizations EHR data Health Baseline Bio- Smartphones Surveys measurements specimens & Wearables Multiple data types linked together by semantic standards
All of Us will aggregate data from many sources Data added centrally by DRC From Direct Volunteers From Healthcare Provider Orgs Version 1 (2018) Sync for Science Death Claims & … Index Rx Data Billing Meds Visits Labs codes Health data aggregators Version 2 (PicnicHealth) Raw Data Repository Clinical Clinical Notes & Participant provided data Messaging Reports (Health surveys, activity monitors, etc) Much longer term Curated Data Repository Geospatial data Local Images Registries Participant exams and biospecimens APIs, Analysis tools, etc
Sync 4 Science (S4S) – a technology to share health data S4S: - FHIR-based - Starting with MU Common Clinical Data set S4S Pilot Sites
Data Access is centralized in All of Us Traditional Approach: Bring data AoU Approach: Bring to researchers researchers to the data Data Download from Public Cloud public repository Problems Advantages • Data sharing = data copying • Cost • Security (data handoffs) • Threat detection and auditing • Huge infrastructure needed • Increased Accessibility • Siloed compute • Shared compute
The power of a data biosphere of common semantics and APIs
Obtaining phenotype and outcome data from e-health records and digital platforms: the experience of UK Biobank Cathie Sudlow Professor of Neurology and Clinical Epidemiology Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh Director of Health Data Research UK Scottish substantive site Chief Scientist, UK Biobank International Cohorts Summit, Durham, North Carolina March 2018
UK Biobank in a nutshell • 500,000 UK men and women aged 40-69 years when recruited during 2006-2010 • Consent for all types of health research by both academic and commercial researchers • Extensive baseline questions and physical measures, with biological samples stored for future assays • Subsequent enhancements in all or large subsets of participants: – Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG) – Sample assays in all or large subsets: Complete: genome-wide genotyping; biochemistry panel Underway/planned: exome and whole genome sequencing; proteomics; infectious disease assays; stool microbiome – Multimodal imaging of 100,000 (>22,000 so far) – Web questionnaires • Comprehensive, long term follow-up for a wide range of health-related outcomes • Open access for approved research: see www.ukbiobank.ac.uk
Follow-up of participants in very large prospective cohorts Aim: identify a wide range of incident diseases and other health related outcomes Active methods requiring participant re-engagement • face to face reassessment • postal or web-based surveys • expensive • prone to incomplete coverage & selective loss to follow-up • miss cases emerging between assessments Passive methods via linkages to national health records • can follow all participants without need for re-engagement • efficient and cost effective • need adequate consent at recruitment • rely on universal healthcare system & availability of relevant datasets • can only detect cases of disease diagnosed in a healthcare setting • data need to be accurate and sufficiently detailed for research studies
Web questionnaires • Using email and web questionnaires – for more detailed assessment of exposures – and to obtain information on outcomes that cannot be obtained through linking to health records • Of 350,000 with email, >150,000 complete each questionnaire – Details of dietary intake Useful for following change over time…but beware – Cognitive function selective attrition – Mental health (thoughts and feelings) – Gastrointestinal symptoms
Recommend
More recommend