Big Data Meets DNA How Biological Data Science is improving our health, foods, and energy needs Michael Schatz April 8, 2014 IEEE Fellows Night Syracuse @mike_schatz
The secret of life Your DNA, along with your environment and experiences, shapes who you are • Height • Hair, eye, skin color • Broad/narrow, small/large features • Susceptibility to disease • Response to drug treatments • Longevity and Intelligence Physical traits tend to be strongly genetic, social characteristics tend to be strongly environmental, and everything else is a combination
Cells & DNA Each cell of your body contains an exact copy of your 3 billion base pair genome. Your specific nucleotide sequence encodes the genetic program for your cells and ultimately your traits
The Origins of DNA Sequencing Sanger et al. (1977) Nature 1 st Complete Organism Radioactive Chain Termination Bacteriophage � X174; 5375 bp 5000bp / week / person http://en.wikipedia.org/wiki/File:Sequencing.jpg Awarded Nobel Prize in 1980 http://www.answers.com/topic/automated-sequencer
Milestones in DNA Sequencing Applied Biosystems Sanger Sequencing 768 x 1000 bp reads / day = ~1Mbp / day (TIGR/Celera, 1995-2001)
Cost per Genome http://www.genome.gov/sequencingcosts/
Massively Parallel Sequencing 1. Attach 2. Amplify Illumina HiSeq 20 00 Sequencing by Synthesis >60 Gbp / day 3. Image Metzker (2010) Nature Reviews Genetics 11:31-46 http://www.youtube.com/watch?v=l99aKKHcxC4
Genomics across the tree of life
Unsolved Questions in Biology What is your genome sequence? • How does your genome compare to my genome? • Where are the genes and how active are they? • The instruments provide the data, but not How does gene activity change during development? • the answers to any of these questions. How does splicing change during development? • How does methylation change during development? • What software and systems will? How does chromatin change during development? • How does is your genome folded in the cell? • Where do proteins bind and regulate genes? • What virus and microbes are living inside you? • How do your mutations relate to disease? • What drugs should we give you? • Plus hundreds and hundreds more •
Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
Sequencing Centers Worldwide capacity exceeds 15 Pbp/year 25 Pbp/year as of Jan 15 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com
How much is a petabyte? Unit Size Byte 1 Kilobyte 1,000 Megabyte 1,000,000 Gigabyte 1,000,000,000 Terabyte 1,000,000,000,000 Petabyte 1,000,000,000,000,000 *Technically a kilobyte is 2 10 and a petabyte is 2 50
How much is a petabyte? 100 GB / Genome 4.7GB / DVD ~20 DVDs / Genome X 10,000 Genomes = 1PB Data 787 feet of DVDs 500 2 TB drives ~1/6 of a mile tall $500k 200,000 DVDs
DNA Data Tsunami Current world-wide sequencing capacity is growing at ~3x per year! 1400 ~1 exabyte by 2018 1200 1000 800 600 400 200 0 2014 2015 2016 2017 2018 Petabytes per year
DNA Data Tsunami Current world-wide sequencing capacity is growing at ~3x per year! 900 ~1 zettabyte 800 by 2024 700 600 500 400 300 ~1 exabyte by 2018 200 100 0 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 Exabytes per year
How much is a zettabyte? Unit Size Byte 1 Kilobyte 1,000 Megabyte 1,000,000 Gigabyte 1,000,000,000 Terabyte 1,000,000,000,000 Petabyte 1,000,000,000,000,000 Exabyte 1,000,000,000,000,000,000 Zettabyte 1,000,000,000,000,000,000,000
How much is a zettabyte? 100 GB / Genome 4.7GB / DVD ~20 DVDs / Genome X 10,000,000,000 Genomes = 1ZB Data 150,000 miles of DVDs Both currently ~100Pb ~ ½ distance to moon But growing exponentially 200,000,000,000 DVDs
Sequencing Centers 2014 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com
Sequencing Centers 2024 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com
Biological Sensor Network Oxford Nanopore DC Metro via the LA Times The rise of a digital immune system Schatz, MC, Phillippy, AM (2012) GigaScience 1:4
Data Production & Collection Expect massive growth to sequencing and other biological sensor data over the next 10 years Exascale biology is certain, zettascale on the horizon • Compression helps, but need to aggressively throw out data • Requires careful consideration of the “preciousness” of the • sample Major data producers concentrated in hospitals, universities, agricultural companies, research institutes Major efforts in human health and disease, agriculture, • bioenergy But also widely distributed mobile sensors Schools, offices, sports arenas, transportations centers, farms & • food distribution centers Monitoring and surveillance, as ubiquitous as weather stations • The rise of a digital immune system? •
Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
Sequencing Centers 2024
Informatics Centers 2024 The cloud? The DNA Data Deluge ! Schatz, MC and Langmead, B (2013) IEEE Spectrum . July, 2013 !
Informatics Centers 2014 The DNA Data Deluge ! Schatz, MC and Langmead, B (2013) IEEE Spectrum . July, 2013 !
DOE Systems Biology Knowledgebase http://kbase.us: Predictive Biology in Microbes, Plants, and Meta-communities
Personal Genomics How does your genome compare to the reference? Heart Disease Cancer Creates magical technology
MUMmerGPU h"p://mummergpu.sourceforge.net2 • Map many reads simultaneously on GPU Find matches by walking the tree • Find coordinates with depth first search • Performance on nVidia GTX 8800 • 4 Match kernel was ~10x faster than CPU • 1 Search kernel was ~4x faster than CPU • End-to-end runtime ~4x faster than CPU • 2 3 • Cores are only part of the solution. • Need fast storage & IO • Locality is king High-throughput sequence alignment using Graphics Processing Units. Schatz, MC, Trapnell, C, Delcher, AL, Varshney, A. (2007) BMC Bioinformatics 8:474.
Crossbow h"p://bow5e6bio.sourceforge.net/crossbow2 • Align billions of reads and find SNPs – Reuse software components: Hadoop Streaming – Mapping with Bowtie, SNP calling with SOAPsnp • 4 hour end-to-end runtime including upload – Costs $85; Todays costs <$10 …2 …2 • Very compelling example of cloud computing in genomics • Commercial vendors probably have better security than your institution • Need more applications! Searching for SNPs with Cloud Computing. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Genome Biology. 10 :R134
Genomics Algorithms De novo Differential Phylogeny, Assembly Analysis Evolution, and Modeling
Compute & Algorithmic Challenges Expect to see many dozens of major informatics centers that consolidate regional / topical information • Clouds for Cancer, Autism, Heart Disease, etc • Plus many smaller warehouses down to individuals • Move the code to the data Parallel hardware and algorithms are required • Expect to see >1000 cores in a single computer • Compute & IO needs to be considered together • Rewriting efficient parallel software is complex and expensive Applications will shift from individuals to populations • Read mapping & assembly fade out • Population analysis and time series analysis fade in • Need for network analysis, probabilistic techniques
Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
Recommend
More recommend