NGS Data Analysis M E T H O D S A N D P R O T O C O L S
Shifting Paradigms 2 Thousand years ago: science was em pirical describing natural phenom ena Last few hundred years: theoretical branch using m odels, generalizations Last few decades: a com putational branch sim ulating com plex phenom ena Today: data exploration (eScience) unify theory, experim ent, and sim ulation Data captured by instruments or generated by simulator Processed by software Information/ knowledge stored in computer Scientist analyzes database/ files using data management and statistics Jim Gray on eScience, The Forth Paradigm , Microsoft Research, 2009 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Big Data Biology 3 The term “Big Data” is not only for size: Speed Volume Computational and analytical capacity to manage data and derive insight The “ Forth Paradigm ” is at hand in Life Sciences the analysis of massive data sets NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
“It’s the data, stupid” 4 It’s a new scientific methodology based on the power of data-intensive science Capturing Curation, and Analysis of large data The goal, Dr. Gray insisted, was not to have the biggest, fastest single computer, but rather “ to have a w orld in w hich all of the science literature is online, all of the science data is online, and they interoperate w ith each other .” At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order, but of dimensionally agnostic statistics. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Big Data Biology 5 Moving from traditional small-scale, focused experiments to more hypothesis-neutral studies Small biology labs can become Big data generators Big data users NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
The story so far… 6 “ We can know m ore than w e can tell ” Michael Polanyi (1891-1976) 5000 100 80 4000 60 3000 40 20 2000 0 1000 0 "Grid Computing"[Title/ Abstract] "Cloud Computing"[Title/ Abstract] 20 0 7-20 0 8 : Grid Computing Cloud Computing sequencers begin giving flurries of 500 data 400 300 200 100 0 "Grid Computing" SCOPUS (Life and Health sciences) "Cloud Computing" SCOPUS (Life and Health sciences) Grid Computing Cloud Computing NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Words of the story... 7 391 abstracts from PubMed Common terms 4,770 unique terms • comput • data • system • provid • technolog Word Cloud for • applic “ Grid ” abstracts • resour • analysi Grid terms Cloud terms • grid • cloud • model • servic Word Cloud for • distribut • sequenc “ Cloud ” • bioinformat • health • molecular • genom abstracts NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Any field in particular? 8 Research areas from SCOPUS Biochemistry, Genetics and Medicine ( 20 3 ) Molecular Biology ( 1,0 29 ) Biochemistry, Genetics and Medicine ( 20 1 ) Molecular Biology ( 116 ) Health Professions ( 10 9 ) Health Professions ( 8 5 ) Multidisciplinary ( 65 ) Multidisciplinary ( 69 ) Agricultural and Biological Agricultural and Biological Sciences ( 4 4 ) Sciences ( 4 2 ) Pharmacology, Toxicology Pharmacology, Toxicology and Pharmaceutics ( 23 ) and Pharmaceutics ( 21 ) Nursing ( 22 ) Environmental Science ( 13 ) Immunology and Environmental Science ( 10 ) Microbiology ( 12 ) Neuroscience ( 9 ) Nursing ( 11 ) Immunology and Neuroscience ( 9 ) Microbiology ( 1 ) Grid Cloud NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Making the bridge… 9 “ Ba: Know ledge creation requires a tim e and place in w hich people share know ledge and w ork together as a com m unity.” Kitaro Nishida x “Grid computing” in 2004: “Cloud computing” in 2014: NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
10 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
NGS pushes bioinformatics needs up 11 Need for large amount of CPU power Informatics groups must manage compute clusters Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment Another level of software complexity and challenges to interoperability VERY large text files (~10 million lines long) Can’t do “business as usual” with familiar tools such as Perl/ Python Impossible memory usage and execution time Impossible to browse for problems Need sequence Quality filtering NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Data Management Issues 12 Raw data are large. How long should be kept? Processed data are manageable for most people 20 million reads (50bp) ~ 1 Gbyte More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM Certain studies much more data intensive than others Whole genome sequencing A 30X coverage genome pair (tumor/ normal) ~ 500 Gbyte 50 genome pairs ~ 25 TB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
So what? 13 In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big NGS projects require supercomputing infrastructure Or put another way: it’s not the case that anyone can study everything. small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Intermediate Solution #1: Cloud Computing 14 Pros: Flexibility You pay what you use Don’t need to maintain a data center Cons: Transfer big datasets over internet is slow You pay for consumed bandwidth. That is a problem with big datasets Lower performance, specially in disk read/ write Privacy/ security concerns More expensive or big and long term projects NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Intermediate Solution #2: Grid Computing 15 Pros Cheaper More resources available Cons Heterogeneous environment Slow connectivity Much time required to find good resources in the grid NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
AppDB: Ready-to-use Apps in EGI 16 The EGI Applications Database (AppDB) is a central service that stores and provides to the public, information about: software solutions for scientists and developers to use, the programmers and the scientists who developed them, and the publications derived from the registered solutions NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
What about the data? 17 There is a VT on this! Support for dataset retrieval and replication in AppDB Support for multiple versions and locations per dataset in AppDB NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Crossbow 18 Identifies SNPs from high-coverage, short- read resequencing data Combines the Aligner Bowtie and the SNP caller SOAPsnp Hadoop MapReduce approach Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Rainbow 19 Large scale Whole Genome Sequencing (WGS) analysis Supports FASTQ and BAM input Load balancing Active workflow monitoring Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
CloudMap 20 Greatly simplifies the analysis of mutant whole genome sequences Offers predefined workflows to pinpoint variations in animal genomes Available on the Galaxy web platform Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
CloudBurst 21 Parallel read-mapping algorithm optimized for mapping NGS data to the human and other reference genomes Modeled after the short read-mapping RMAP program Parallelization overcomes computational barriers and allows deeper analysis Hadoop MapReduce approach Almost linear increase in performance to the number of CPU cores available NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
RSD-Cloud 22 Large comparative genomics analysis tool Redesigned the reciprocal smallest distance algorithm (RSD) to run on a cloud computing environment Fast and cost efficient solution Amazon EC2 NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Cloud BioLinux 23 Publicly accessible VM Platform for developing bioinformatics infrastructures on the cloud Quick provision of on-demand infrastructures for HPC in bioinformatics Pre-configured tools and GUI Tested on Amazon EC2, Eucalyptus, Okeanos and Virtual box NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
CloVR 24 Portable VM Several automated analysis pipelines for microbial genomics provided, including 16S, whole genome and metagenome sequence analysis Run on a local PC but also supports use of remote cloud computing resources on multiple cloud computing platforms. NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Mercury 25 Integration of multiple sequence analysis tool in a single DNAnexus based platform Simplified workflow construction GUI Applet based workflows Amazon EC2 / Local Cluster NGS Data Analysis Training Workshop - Part I 11/ 11/ 2015
Recommend
More recommend