Genomic Medicine Big Data: Production, Analysis and Management Liz Worthey Ph.D Assistant Professor Clinical and Translational Genomic Analysis Lab Human and Molecular Genetics Center Pediatric Genomics, Department of Pediatrics Medical College of Wisconsin
Human Genome Sequencing First Human Genome Single Human Genome Sequence now Sequence • 10 years to complete • 8 days to complete • $2.7 billion • $5,000 • 100s ABI 3730 sequencing • 1 Illumina HiSeq2500 machine machines • 100 Gbases per day • 3 Gbases data total The cost to sequence one base has dropped 100 million fold
Many technologies exist
Analysis phases ASSAY DESIGN PRIMARY NGS RUN BASE CALLING fastq Lane-level Q/C ALIGNMENT SECONDARY DE-MULTIPLEX PRIMARY - bam QC metrics ALIGNMENT - other Thresholds - Vcf GENOTYPE CALLING - other SUB IND CNV SV TERTIARY VARIANTS LIST KNOWN NOVEL Derived from ANNOTATE figure from Birgit Funke PRIORITIZE CLINICAL CLASSIFY REPORT
Challenge: A variety of events lead to production of many sequencing errors • Polymerase errors (~300,000 per genome) • De-Phasing – Occurs with step-wise addition methods when growing primers move out of synchronicity for any given cycle • Dark Nucleotides causing false deletion errors – The nucleotide does not contain a fluorescent label – Breakdown of dye-label nucleotides – Contamination with unlabeled nucleotides
Sequencing the same sample with known variants on different platforms produces different sequencing errors Courtesy of Illumina more likely to have error after ‘ G ’ • Chris Mason • PCR-based methods miss GC- and AT-rich regions and SeQC • PolyA miscount errors for pyrosequencing methods Consortium
Secondary analysis – read mapping MECP2 Many methods are available: Which to use? • Speed • Accuracy • Computational requirements • Robustness • Support NEMO p.R175P p.L227P p.D311N p.Q384X p.Q45X p.L80P p.V146G p.Q183H p.Q236X p.E315A p.E391X p.E57K p.Q86X p.L153R p.R217G p.Q239X p.R319Q p.D406V p.R62X p.Q98X p.R173G p.E240X p.A323P p.C417F p.R123W p.R256X p.A288G p.R359W p.X420W p.Q290X
Some of these regions are of significant interest 881 829 Gaps 185 309 52 shared in Poorly 2,925 25 WGS 309 covered 526 1117 Actionable 932 Clinically actionable CSH1 CES1 CR1 OCLN Genes SFTPA1 PMS2 RPS17 NCF1 HBA2 1551 SMN1 SOX18 STRC HEXB KRT81 CCL3L1 OPN1MW DAZ1 CFC1 CFHR3 663 CHST14 OPN1LW SFTPA2 D2HGDH HRAS 362 C4A SLC7A9 EVC IKBKG DSG1 ATP8B1 C4B FKTN NEB FCGR2C CYP2R1 SSX4 FCGR2B SLC6A8 SMN2 >90 diseases INPP5E ABCC2 ANTXR2 CHRNA3 F13A1 Amelogenesis imperfecta F7 MME OTOA CFD CYP2C19 LPA ENAM Ehlers-Danlos syndrome, musculocontractural type TFE3 Systemic lupus erythematosus Ellis-van Creveld syndrome C4a deficiency
Strategies for spanning gaps • Some gaps are caused by overrepresentation of certain genomic regions in the sequence output, with underrepresentation of others • Sanger fill in – too expensive • WGS plus WES strategy? • Some gaps are caused by repeats that can’t be sequenced through or cause issues with mapping – add longer read data from a PacBio or ? • De Novo assembly?
Variant callers give different results on the same data Concordance good for latest renditions of SNV callers, ok for indel callers, and poor for SVs Comparison of variants called from 32 HapMap genomes by Complete Genomics and 1K Genomes Project Various mapping and variant Blue=CG calling algorithms Red=1KGP used in each analysis Rosenfeld, Mason, and Smith. PLoS One. 2012. Differences between sequencing/analysis pipelines are between 4 and 14% of variants per sample
Even calling of SNVs has issues Concordance between Illumina pipeline and BWA/GATK SNV calls Patient 2C2 Patient 2F Rosenfeld, Mason, and Smith. PLoS One. 2012.
But cannot simply exclude variants called by a single method Transition mutations occur more frequently than transversions thus: (Ti/Tv >1) Sequencing errors tend to be more transversions thus: (Ti/Tv < 1) • Elliott Margulies et. al. at Illumina have shown that only 95% of high quality SNVs are called the same in 14 stringently performed replicates • Solutions: • Ongoing algorithm development • Hold of until methods mature – we don’t yet “do” clinical SVs
Tertiary analysis Variant storage, analysis, prioritization, and reporting ~4,800,000 ~250,000 ~1,800 ~1,200 ~50 ~10 0-5
Analytical considerations/challenges on the variant impact axis • Existing WGS technologies produce many sequencing errors • Existing mapping algorithms in combination with short read technology give rise to many mapping errors • Bioinformatics limitations with variant calling (especially indels and SV) • Data is incorrect/outdated - Allele Frequencies • Data must be used appropriately - nucleotide or amino acid conservation scores • Data must be understood - SIFT, PolyPhen, Condel
e.g. Allele frequency data cannot be relied upon without review The allele frequency reported for a particular variant can vary widely amongst commonly used data sources These variants were not randomly selected – these variants are all associated with disease in HGMD Solution: compare multiple data sources and consider the sources in terms of possible source disease status as well as technological aspects etc.
Analytical considerations/challenges on the functional classification axis • Many of the datasets being used were never intended to be used in the way we are currently applying them • We estimate that up to ~8% of genotype to phenotype data in larger mutation databases is incorrect: – Typos/human entry errors are common • E.g. term breast cancer entered rather than cervical – Old/Outdated • <70% in PubMed is captured in DBs – Many entries subsequently disproven or even retracted – These are not seen as part of DBs “job” • It is important to understand the quality of the datasets and to curate/track corrections or in house annotations • Data sharing is hugely important • Identifying a variant with a functional impact on a gene is not the same as identifying the cause of the disease
e.g. Many reported “rare mutations” are not so rare polymorphisms • ALG6 • GLB1 • HEXB • HESX1 • DPYD • ATP7B • NPHS1 • HGSNAT • HADHA • ETFB • LAMA2 • ACADM • GAA • NHLRC1 • ADA • FKTN • AHI1 • IGHMBP2 • MTHFR • WNT10A • AMPD1 • SERPINA1 • GALC • PMM2 • ATP7B • NPHS1 • MEFV • MPL • CDH23 • SLC26A2 • CYP21A2 • POLG • SBDS • MYO5A • ARSB • BTD • NEFL • DPYD • CPT1A • NTRK1 • HUGE curation task to ensure data is correct and up to date. • Currently – never believe anything – many hours of verification required • Data sharing will be critical to our cumulative success!
Challenge: Data storage Data storage costs currently ~$1,200 per year per patient per WGS (on high performance disks; without compression)
At 1 Tb per genome 2013: 100,000 genomes 1 Tb each = 100 PetaBytes WGS for all new U.S. babies/year: 4,000,000 genomes = 4 ExaBytes WGS for all U.S. citizens over 50: 310,000,000 genomes = 310 ExaBytes WGS for all: 7,000,000,000 genomes = 7 ZettaBytes
At 400 Gb per genome 2013: 100,000 genomes 100 Gb each = 40 TeraBytes WGS for all new U.S. babies/year: 4,000,000 genomes = 1.6 PetaBytes WGS for all U.S. citizens over 50: 310,000,000 genomes = 124 PetaBytes WGS for all: 7,000,000,000 genomes = 2.8 ExaBytes
How much data is this? @400 Gb/genome All = 2.8 Eb @1Tb/genome All = 7 Zb http://www.allthingstechnology.net/2011/07/how-much-byte-make-yottabyte.html
Solutions • Development of and agreement on sustainable retention guidelines • Compression? – By retaining only the data related to variant calls we can compress 50-100 fold. – But we need to keep in mind clinical use cases. • Transparency of data used to determine that no variant existed (no call)? • Transparency of data that shows that a particular region was not covered sufficiently to be sure of accuracy of the call – gap calls?
Performing WGS in a clinically appropriate timeline HiSeq2000 HiSeq2500 Data generation/analysis/reporting steps (Current) (RapidRun) Barcode and Accessioning of samples 0.5 0.5 DNA extraction and Sequencing 289* 33.5 DNA Extraction 2 2 Sequencing Lib Prep and Quant 77* 4.5 Verification and Sequencing 210* 27 Secondary analysis 19 19 Tertiary analysis 5 5 Variant Annotation 3 3 Production of Clinical Report, Delivery 2 2 Interpretation and reporting 11 11 Interpretation 9 9 Preparation and review of final report 2 2 Total (hours) 324.5 69 * 3 samples processed simultaneously in these steps
Currently a $5,000 sequence, a $200 analysis, and a $1,250 interpretation + + • Analysis/Interpretation: • Bioinformatician: loading, processing, report generation - $200 • Clinical Geneticist: interpretation - $1000 • Follow up: • Technician: Sanger confirmation/analysis - $150 • Reporting : • Analyst / Clinical geneticist: Report finalization - $100 Getting better but not good enough
Summary • Capabilities – We are a diverse group of researchers with varied backgrounds and expertise in genetics, genomics, physiology, computer science, clinical informatics, statistics, clinical research, clinical diagnostics, bioinformatics etc. etc. • Needs – Progress is being made but many challenges remain in the continued development of Genomic Medicine. In Informatics realm we have significant challenges in: – Data storage and management, transfer and sharing – Secondary analysis – reference genome, mapping, variant calling – Tertiary analysis – annotation, prioritization, visualization – Clinical Interpretation – data mining, phenotype data extraction and analysis, genotype – phenotype correlation curation – Education – across the board • Goals – Collaboration to further our joint capabilities: – Identification of disease cohorts for collaborations in translational research – Collaborations to develop novel or reuse existing methodologies/algorithms addressing the challenges outlined – Identification of areas of expertise for submission of joint grants
Acknowledgements • Families and Patients • Bioinformatics/Curation • CHW • Referring Physicians /Systems support • Juliet Kersten • MCW- HMGC • Brandon Wilk • Paula North • Howard Jacob • Jeremy Harris • Tara Schmit • David Bick • Wendy Demos • Jack Routes • David Dimmock • Arthur Weborg • Altheia Roquemore-Goins • Mary Shimoyama • George Kowalski • Gail Bernadi • Jill Northup • Weihong Jin • Michael Gutzeit • Jenny Guerts • Weisong Liu • Steven Leuthner • Brad Taylor • Jeff DePons • Rodney Willoughby • CHW Genetics Center • Sharon Tsaih • Thomas May • Regan Veith • Oliver Hummel • Robert Kliegman • Angela Pickart • Stacy Zacher • Funding/Support: • William Rhead • Marek Tutaj • MCW Children’s • AGEN-Seq Technicians • Greg McQuestion Research Institute • Mike Tschannen • Kent Brodie • Jeffrey Modell • Daniel Helbling • Stan Laulederkind Foundation • Brett Chirempes • Victoria Petri • Private Donors • Jayme Wittke • Jennifer Smith • Jamie Wendt-Andrae • Alex Stoddard • Pushkala Jayaraman
Genomics & Personalized Medicine: Analysis & Clinical Implementation Breakout Sessions 1 & 2
Genomics and Science Education Dr. Tim Herman Milwaukee School of Engineering Dr. Neil Lamb HudsonAlpha Institute for Biotechnology
Teachers FIRST From Interesting Research to Scientific Teaching An NIH Science Education Partnership Award (SEPA) project Tim Herman Center for BioMolecular Modeling Milwaukee School of Engineering
The MSOE Center for BioMolecular Modeling …an instructional materials development laboratory , …with a science education outreach mission.
The Problem….. … recent advances in the technologies that have delivered the promise of genomic and proteomic medicine to clinics have completely outstripped the education of the public who stand to benefit from this new science ….. … as well as the very health care professionals who suddenly find themselves in a position to make use of this new technology.
In the old days …..
Genes, Genomes and Personalized Medicine …. Molecular Stories …. • Zinc Finger Nucleases and Genome Editing • The CCR5 Gene and Resistance to HIV • Cytochrome p450s and Pharmacogenomics • Beery Twins Story – Sepiapterin Reductase • Nic Volker Story –XIAP. Nic Volker Story –XIAP.
http://cbm.msoe.edu/stupro/so/module2012/xiapHome.html
an online game about complex disease, risk assessment and prevention/treatment funded through a 5-year NIH Science and Education Partnership Award (SEPA) Neil E. Lamb, Ph.D. Director of Educational Outreach HudsonAlpha Institute for Biotechnology
What is Touching Triton? setting: preparing to launch the Argos 1 - a 20-year • exploratory mission to Triton, a moon of Neptune • goal: assess disease risks for the six-person crew and pack the ship with supplies that reduce risk and/or provide treatment • game can be played in single person or classroom format • Students’ activities and decisions are available to the teacher • accompanying website with background information
Touching Triton Gameplay select mission choose crew member determine risks family history genomic findings medical records identify ‘ combined disease risks ’ make packing launch mission recommendations observe outcomes
Crew Selection
Crew Member Dashboard
Medical Records
Genomic Data MACULAR DEGENERATION
Family History
Final Risk Assessment
Packing
Advisor support
About Risks • grounded in research • risk from genomic data - multiply odds ratios from each SNP, compare to population risk • risks from medical records and family history selected using a slider bar • overall risk assessment - slider bar • objective: understand how students perceive risks in the context of complex disorders ‘ consensus lifetime risk ’ - determined from a survey of • medical geneticists, genetic counselors and clinical researchers - teachers can compare this to student risk estimates
Recommend
More recommend