big data meets dna
play

Big Data Meets DNA How Biological Data Science is improving our - PowerPoint PPT Presentation

Big Data Meets DNA How Biological Data Science is improving our health, foods, and energy needs Michael Schatz April 8, 2014 IEEE Fellows Night Syracuse @mike_schatz The secret of life Your DNA, along with your environment and experiences,


  1. Big Data Meets DNA How Biological Data Science is improving our health, foods, and energy needs Michael Schatz April 8, 2014 IEEE Fellows Night Syracuse @mike_schatz

  2. The secret of life Your DNA, along with your environment and experiences, shapes who you are • Height • Hair, eye, skin color • Broad/narrow, small/large features • Susceptibility to disease • Response to drug treatments • Longevity and Intelligence Physical traits tend to be strongly genetic, social characteristics tend to be strongly environmental, and everything else is a combination

  3. Cells & DNA Each cell of your body contains an exact copy of your 3 billion base pair genome. Your specific nucleotide sequence encodes the genetic program for your cells and ultimately your traits

  4. The Origins of DNA Sequencing Sanger et al. (1977) Nature 1 st Complete Organism Radioactive Chain Termination Bacteriophage � X174; 5375 bp 5000bp / week / person http://en.wikipedia.org/wiki/File:Sequencing.jpg Awarded Nobel Prize in 1980 http://www.answers.com/topic/automated-sequencer

  5. Milestones in DNA Sequencing Applied Biosystems Sanger Sequencing 768 x 1000 bp reads / day = ~1Mbp / day (TIGR/Celera, 1995-2001)

  6. Cost per Genome http://www.genome.gov/sequencingcosts/

  7. Massively Parallel Sequencing 1. Attach 2. Amplify Illumina HiSeq 20 00 Sequencing by Synthesis >60 Gbp / day 3. Image Metzker (2010) Nature Reviews Genetics 11:31-46 http://www.youtube.com/watch?v=l99aKKHcxC4

  8. Genomics across the tree of life

  9. Unsolved Questions in Biology What is your genome sequence? • How does your genome compare to my genome? • Where are the genes and how active are they? • The instruments provide the data, but not How does gene activity change during development? • the answers to any of these questions. How does splicing change during development? • How does methylation change during development? • What software and systems will? How does chromatin change during development? • How does is your genome folded in the cell? • Where do proteins bind and regulate genes? • What virus and microbes are living inside you? • How do your mutations relate to disease? • What drugs should we give you? • Plus hundreds and hundreds more •

  10. Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

  11. Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

  12. Sequencing Centers Worldwide capacity exceeds 15 Pbp/year 25 Pbp/year as of Jan 15 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com

  13. How much is a petabyte? Unit Size Byte 1 Kilobyte 1,000 Megabyte 1,000,000 Gigabyte 1,000,000,000 Terabyte 1,000,000,000,000 Petabyte 1,000,000,000,000,000 *Technically a kilobyte is 2 10 and a petabyte is 2 50

  14. How much is a petabyte? 100 GB / Genome 4.7GB / DVD ~20 DVDs / Genome X 10,000 Genomes = 1PB Data 787 feet of DVDs 500 2 TB drives ~1/6 of a mile tall $500k 200,000 DVDs

  15. DNA Data Tsunami Current world-wide sequencing capacity is growing at ~3x per year! 1400 ~1 exabyte by 2018 1200 1000 800 600 400 200 0 2014 2015 2016 2017 2018 Petabytes per year

  16. DNA Data Tsunami Current world-wide sequencing capacity is growing at ~3x per year! 900 ~1 zettabyte 800 by 2024 700 600 500 400 300 ~1 exabyte by 2018 200 100 0 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 Exabytes per year

  17. How much is a zettabyte? Unit Size Byte 1 Kilobyte 1,000 Megabyte 1,000,000 Gigabyte 1,000,000,000 Terabyte 1,000,000,000,000 Petabyte 1,000,000,000,000,000 Exabyte 1,000,000,000,000,000,000 Zettabyte 1,000,000,000,000,000,000,000

  18. How much is a zettabyte? 100 GB / Genome 4.7GB / DVD ~20 DVDs / Genome X 10,000,000,000 Genomes = 1ZB Data 150,000 miles of DVDs Both currently ~100Pb ~ ½ distance to moon But growing exponentially 200,000,000,000 DVDs

  19. Sequencing Centers 2014 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com

  20. Sequencing Centers 2024 Next Generation Genomics: World Map of High-throughput Sequencers http://omicsmaps.com

  21. Biological Sensor Network Oxford Nanopore DC Metro via the LA Times The rise of a digital immune system Schatz, MC, Phillippy, AM (2012) GigaScience 1:4

  22. Data Production & Collection Expect massive growth to sequencing and other biological sensor data over the next 10 years Exascale biology is certain, zettascale on the horizon • Compression helps, but need to aggressively throw out data • Requires careful consideration of the “preciousness” of the • sample Major data producers concentrated in hospitals, universities, agricultural companies, research institutes Major efforts in human health and disease, agriculture, • bioenergy But also widely distributed mobile sensors Schools, offices, sports arenas, transportations centers, farms & • food distribution centers Monitoring and surveillance, as ubiquitous as weather stations • The rise of a digital immune system? •

  23. Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

  24. Sequencing Centers 2024

  25. Informatics Centers 2024 The cloud? The DNA Data Deluge ! Schatz, MC and Langmead, B (2013) IEEE Spectrum . July, 2013 !

  26. Informatics Centers 2014 The DNA Data Deluge ! Schatz, MC and Langmead, B (2013) IEEE Spectrum . July, 2013 !

  27. DOE Systems Biology Knowledgebase http://kbase.us: Predictive Biology in Microbes, Plants, and Meta-communities

  28. Personal Genomics How does your genome compare to the reference? Heart Disease Cancer Creates magical technology

  29. MUMmerGPU h"p://mummergpu.sourceforge.net2 • Map many reads simultaneously on GPU Find matches by walking the tree • Find coordinates with depth first search • Performance on nVidia GTX 8800 • 4 Match kernel was ~10x faster than CPU • 1 Search kernel was ~4x faster than CPU • End-to-end runtime ~4x faster than CPU • 2 3 • Cores are only part of the solution. • Need fast storage & IO • Locality is king High-throughput sequence alignment using Graphics Processing Units. Schatz, MC, Trapnell, C, Delcher, AL, Varshney, A. (2007) BMC Bioinformatics 8:474.

  30. Crossbow h"p://bow5e6bio.sourceforge.net/crossbow2 • Align billions of reads and find SNPs – Reuse software components: Hadoop Streaming – Mapping with Bowtie, SNP calling with SOAPsnp • 4 hour end-to-end runtime including upload – Costs $85; Todays costs <$10 …2 …2 • Very compelling example of cloud computing in genomics • Commercial vendors probably have better security than your institution • Need more applications! Searching for SNPs with Cloud Computing. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Genome Biology. 10 :R134

  31. Genomics Algorithms De novo Differential Phylogeny, Assembly Analysis Evolution, and Modeling

  32. Compute & Algorithmic Challenges Expect to see many dozens of major informatics centers that consolidate regional / topical information • Clouds for Cancer, Autism, Heart Disease, etc • Plus many smaller warehouses down to individuals • Move the code to the data Parallel hardware and algorithms are required • Expect to see >1000 cores in a single computer • Compute & IO needs to be considered together • Rewriting efficient parallel software is complex and expensive Applications will shift from individuals to populations • Read mapping & assembly fade out • Population analysis and time series analysis fade in • Need for network analysis, probabilistic techniques

  33. Quantitative Biology Technologies Results Domain Knowledge Machine Learning classification, modeling, visualization & data Integration Scalable Algorithms Streaming, Sampling, Indexing, Parallel Compute Systems CPU, GPU, Distributed, Clouds, Workflows IO Systems Hardrives, Networking, Databases, Compression, LIMS Sensors & Metadata Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

Recommend


More recommend