ngs data analysis
play

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting - PowerPoint PPT Presentation

NGS Data Analysis M E T H O D S A N D P R O T O C O L S Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last


  1. NGS Data Analysis M E T H O D S A N D P R O T O C O L S

  2. Shifting Paradigms 2  Thousand years ago: science was em pirical describing natural phenom ena  Last few hundred years: theoretical branch using m odels, generalizations  Last few decades: a com putational branch sim ulating com plex phenom ena  Today: data exploration (eScience) unify theory, experim ent, and sim ulation  Data captured by instruments or generated by simulator  Processed by software  Information/ knowledge stored in computer  Scientist analyzes database/ files using data management and statistics Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  3. Big Data Biology 3  The term “Big Data” is not only for size:  Speed  Volume  Computational and analytical capacity to manage data and derive insight  The “ Forth Paradigm ” is at hand in Life Sciences  the analysis of massive data sets NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  4. “It’s the data, stupid” 4  It’s a new scientific methodology based on the power of data-intensive science  Capturing  Curation, and  Analysis of large data  The goal, Dr. Gray insisted, was not to have the biggest, fastest single computer, but rather “ to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each other .”  At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  5. Big Data Biology 5  Moving from traditional small-scale, focused experiments to more hypothesis-neutral studies  Small biology labs can become  Big data generators  Big data users NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  6. The story so far… 6 “ We can know m ore than w e can tell ” Michael Polanyi (1891-1976) 5000 100 80 4000 60 3000 40 20 2000 0 1000 0 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 20 0 7-20 0 8 : Grid Computing Cloud Computing sequencers begin giving flurries of 500 data 400 300 200 100 0 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences) Grid Computing Cloud Computing NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  7. Words of the story... 7  391 abstracts from PubMed Common terms  4,770 unique terms • comput • data • system • provid • technolog Word Cloud for • applic “ Grid ” abstracts • resour • analysi Grid terms Cloud terms • grid • cloud • model • servic Word Cloud for • distribut • sequenc “ Cloud ” • bioinformat • health • molecular • genom abstracts NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  8. Any field in particular? 8  Research areas from SCOPUS Biochemistry, Genetics and Medicine ( 20 3 ) Molecular Biology ( 1,0 29 ) Biochemistry, Genetics and Medicine ( 20 1 ) Molecular Biology ( 116 ) Health Professions ( 10 9 ) Health Professions ( 8 5 ) Multidisciplinary ( 65 ) Multidisciplinary ( 69 ) Agricultural and Biological Agricultural and Biological Sciences ( 4 4 ) Sciences ( 4 2 ) Pharmacology, Toxicology Pharmacology, Toxicology and Pharmaceutics ( 23 ) and Pharmaceutics ( 21 ) Nursing ( 22 ) Environmental Science ( 13 ) Immunology and Environmental Science ( 10 ) Microbiology ( 12 ) Neuroscience ( 9 ) Nursing ( 11 ) Immunology and Neuroscience ( 9 ) Microbiology ( 1 ) Grid Cloud NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  9. Making the bridge… 9 “ Ba: Know ledge creation requires a tim e and place in w hich people share know ledge and w ork together as a com m unity.” Kitaro Nishida x “Grid computing” in 2004:  “Cloud computing” in 2014: NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  10. 10 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  11. NGS pushes bioinformatics needs up 11  Need for large amount of CPU power  Informatics groups must manage compute clusters  Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment  Another level of software complexity and challenges to interoperability  VERY large text files (~10 million lines long)  Can’t do “business as usual” with familiar tools such as Perl/ Python  Impossible memory usage and execution time  Impossible to browse for problems  Need sequence Quality filtering NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  12. Data Management Issues 12  Raw data are large. How long should be kept?  Processed data are manageable for most people  20 million reads (50bp) ~ 1 Gbyte  More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM  Certain studies much more data intensive than others  Whole genome sequencing  A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte  50 genome pairs ~ 25 TB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  13. So what? 13  In NGS we have to process really big amounts of data, which is not trivial in computing terms.  Big NGS projects require supercomputing infrastructure  Or put another way: it’s not the case that anyone can study everything.  small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  14. Intermediate Solution #1: Cloud Computing 14  Pros:  Flexibility  You pay what you use  Don’t need to maintain a data center  Cons:  Transfer big datasets over internet is slow  You pay for consumed bandwidth. That is a problem with big datasets  Lower performance, specially in disk read/ write  Privacy/ security concerns  More expensive or big and long term projects NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  15. Intermediate Solution #2: Grid Computing 15  Pros  Cheaper  More resources available  Cons  Heterogeneous environment  Slow connectivity  Much time required to find good resources in the grid NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  16. AppDB: Ready-to-use Apps in EGI 16  The EGI Applications Database (AppDB) is a central service that stores and provides to the public, information about:  software solutions for scientists and developers to use,  the programmers and the scientists who developed them, and  the publications derived from the registered solutions NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  17. What about the data? 17  There is a VT on this! Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  18. Crossbow 18  Identifies SNPs from high-coverage, short- read resequencing data  Combines the Aligner Bowtie and the SNP caller SOAPsnp  Hadoop MapReduce approach  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  19. Rainbow 19  Large scale Whole Genome Sequencing (WGS) analysis  Supports FASTQ and BAM input  Load balancing  Active workflow monitoring  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  20. CloudMap 20  Greatly simplifies the analysis of mutant whole genome sequences  Offers predefined workflows to pinpoint variations in animal genomes  Available on the Galaxy web platform  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  21. CloudBurst 21  Parallel read-mapping algorithm optimized for mapping NGS data to the human and other reference genomes  Modeled after the short read-mapping RMAP program  Parallelization overcomes computational barriers and allows deeper analysis  Hadoop MapReduce approach  Almost linear increase in performance to the number of CPU cores available NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  22. RSD-Cloud 22  Large comparative genomics analysis tool  Redesigned the reciprocal smallest distance algorithm (RSD) to run on a cloud computing environment  Fast and cost efficient solution  Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  23. Cloud BioLinux 23  Publicly accessible VM  Platform for developing bioinformatics infrastructures on the cloud  Quick provision of on-demand infrastructures for HPC in bioinformatics  Pre-configured tools and GUI  Tested on Amazon EC2, Eucalyptus, Okeanos and Virtual box NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  24. CloVR 24  Portable VM  Several automated analysis pipelines for microbial genomics provided, including 16S, whole genome and metagenome sequence analysis  Run on a local PC but also supports use of remote cloud computing resources on multiple cloud computing platforms. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

  25. Mercury 25  Integration of multiple sequence analysis tool in a single DNAnexus based platform  Simplified workflow construction GUI  Applet based workflows  Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015

Recommend


More recommend