Highly Scalable Genome Assembly on Campus Grids Christopher Moretti - PowerPoint PPT Presentation

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti Michael Olson, Scott Emrich, Douglas Thain Michael Olson, Scott Emrich, Douglas Thain University of Notre Dame Christopher Moretti – University of Notre Dame 1 11/16/2009

Overview � Scientists get stuck in a loop: CODE � DEBUG � SCALE UP � RE-CODE … cloud � We believe: � The Many-Task Paradigm: coordinating 1000s of serial programs on commodity hardware is an effective mechanism for designing solutions that don’t require scientists to change their existing solutions when scaling up to multi-institutional campus grid resources. Christopher Moretti – University of Notre Dame 2 11/16/2009

Genome Assembly Genome sequencing extracts DNA {A,G,T,C} from biological samples in reads of 25-1000 bases each. Biologists need much longer DNA strings Biologists need much longer DNA strings to perform their analyses. Assembly is the process of putting the pieces together into long contiguous sequences. Christopher Moretti – University of Notre Dame 3 11/16/2009

Assembly Pipeline (1) Unordered reads from sequencing (1) Unordered reads from sequencing Christopher Moretti – University of Notre Dame 4 11/16/2009

Assembly Pipeline – Candidate Selection (2) Candidates based on short exact matches (2) Candidates based on short exact matches Christopher Moretti – University of Notre Dame 5 11/16/2009

Assembly Pipeline – Alignment (3) Actual Overlaps are Computed (3) Actual Overlaps are Computed Christopher Moretti – University of Notre Dame 6 11/16/2009

Assembly Pipeline – Consensus (4) Alignments are ordered and combined into contigs (4) Alignments are ordered and combined into contigs Christopher Moretti – University of Notre Dame 7 11/16/2009

Complete Assembly of A. Gambiae Mosquito Candidate Sel. Alignment Consensus Celera Combined 4.5 hours 3 hours Complete SW (1.5 hrs serially) (12 days serially) 3 hours 5 minutes 45 minutes Banded SW (1.5 hrs serially) (7 hrs serially) 3 hours 5 minutes 5 minutes 11 minutes 11 minutes Similarly, we can bring the candidate selection and alignment time for the much larger S. Bicolor grass down from more than 9 days on Celera to 3 hours (Complete) and 1.25 hours (Banded). So why did we choose to attack Candidate Selection and Alignment? And what about Amdahl’s Law? Christopher Moretti – University of Notre Dame 8 11/16/2009

Candidate Selection � 1M reads 1 trillion alignments � 8M reads 64 trillion alignments … 50,000 CPUYears! k -mer counting heuristic: “two sequences that share a short exact match are more likely to overlap significantly than two sequences that don’t share an exact match” Even optimized k -mer counting is extremely memory intensive – 16GB for the 8M read data set. Worse, it is not naturally parallelizable. Christopher Moretti – University of Notre Dame 9 11/16/2009

Parallel Candidate Selection We chose to trade off increased computational complexity for the ability to parallelize the Candidate Selection with 10,000’s of separate tasks and decreased memory consumption per node. k -mer counting is O( nm ) – n reads of average length m Instead, we divide the input into n/l subsets of size l Instead, we divide the input into n/l subsets of size l Compute every pair – O( n 2 /l 2 ) – each completed in O( lm ) For a total complexity of: O( n 2 m/l ) 0 vs 0 1 2 1 vs 1 2 2 vs 2 Christopher Moretti – University of Notre Dame 10 11/16/2009

Parallel Candidate Selection We chose to trade off increased computational complexity for the ability to parallelize the Candidate Selection with 10,000’s of separate tasks and decreased memory consumption per node. k -mer counting is O( nm ) – n reads of average length m Instead, we divide the input into n/l subsets of size l Instead, we divide the input into n/l subsets of size l Compute every pair – O( n 2 /l 2 ) – each completed in O( lm ) For a total complexity of: O( n 2 m/l ) Memory Memory CPU CPU Christopher Moretti – University of Notre Dame 11 11/16/2009

5078 2705 1533 664 507 332 175

Alignment Now we have candidate pairs whose alignment can be computed independently in parallel using sequential programs: for i in Candidates; do batch_submit aligner $i done What’s wrong with this? � Batch system latency � Local and remote replication of many copies of each sequence and/or requirement of a global FS Christopher Moretti – University of Notre Dame 14 11/16/2009

put “Align” Candidate (Work) List put “>Seq1\nATGCTAG\n…” 1.in Seq1 Seq2 run “Align < 1.in > 1.out” Seq1 Seq3 Worker get 1.out Seq2 Seq3 Seq4 Seq5 Input Align data Master Master >Seq1 ATGCTAG >Seq2 AGCTGA … Output Alignment Results Input Sequence Data (raw format) Christopher Moretti – University of Notre Dame 15 11/16/2009

12.6M candidates from 1.8M reads.

121.3M candidates from 7.9M reads.

Scaling to larger numbers of workers … W W W W W M Christopher Moretti – University of Notre Dame 18 11/16/2009

Scaling to larger numbers of workers W W W W W … (idle) (idle) (idle) (idle) (idle) M Christopher Moretti – University of Notre Dame 19 11/16/2009

Scaling to larger numbers of workers W W W W W … (busy) (idle) (idle) (idle) (idle) M Christopher Moretti – University of Notre Dame 20 11/16/2009

Scaling to larger numbers of workers W W W W W … (busy) (busy) (idle) (idle) (idle) M Christopher Moretti – University of Notre Dame 21 11/16/2009

Scaling to larger numbers of workers W W W W W … (busy) (busy) (busy) (idle) (idle) M Christopher Moretti – University of Notre Dame 22 11/16/2009

Scaling to larger numbers of workers W W W W W … (idle) (idle) (busy) (busy) (idle) M This is exacerbated when network links slow down, for instance when harnessing resources at another institution. Christopher Moretti – University of Notre Dame 23 11/16/2009

Putting it all together Finally, we can run our distributed Candidate Selection and Alignment concurrently in order to pipeline these stages of the assembly (and save a bit of time versus the two modules run back-to-back). Inserting our distributed modules in place of the default candidate selection and alignment procedures, we candidate selection and alignment procedures, we decrease these two steps of the assembly from hours to minutes on one of our genomes, and from nine days to less than one hour on our largest genome. Christopher Moretti – University of Notre Dame 26 11/16/2009

For More Information � Christopher Moretti and Prof. Douglas Thain � Cooperative Computing Lab � http://cse.nd.edu/~ccl � cmoretti@cse.nd.edu � dthain@cse.nd.edu � Michael Olson and Prof. Scott Emrich � ND Bioinformatics Laboratory � http://www.nd.edu/~biocmp � molson3@nd.edu � semrich@nd.edu � Funding acknowledgements: � University of Notre Dame strategic initiative for Global Health. � National Institutes of Health (NIAID contract 266200400039C) � National Science Foundation (grant CNS06-43229). Christopher Moretti – University of Notre Dame 27 11/16/2009

How? � On my workstation. � Write my program, make sure to make it partitionable, because it takes a really long time and might crash, debug it. Now run it for 39 days – 2.3 years. � On my department’s 128-node research cluster � Learn MPI, determine how I want to move many GBs of data around, re-write my program and re-debug, wait until the cluster can give me 8-128 homogeneous nodes at once, or go cluster can give me 8-128 homogeneous nodes at once, or go buy my own. Now run it. � BlueGene � Get $$$ or access, learn custom MPI-like computation and communication working language, determine how I want to handle communication and data movement, re-write my program, wait for configuration or access, re-debug my program, re-run. Christopher Moretti – University of Notre Dame 29 11/16/2009

So? � Serially � Cluster � Supercomputer � So I can either take my program as-is and it’ll take forever, or I can do a new custom implementation to a certain particular architecture to a certain particular architecture and re-write and re-debug it every time we upgrade (assuming I’m lucky enough to have a BlueGene in the first place)? � Well what about Condor? Christopher Moretti – University of Notre Dame 30 11/16/2009

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti - PowerPoint PPT Presentation

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti Michael Olson, Scott Emrich, Douglas Thain Michael Olson, Scott Emrich, Douglas Thain University of Notre Dame Christopher Moretti University of Notre Dame 1 11/16/2009

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Scientific Computing I Grids Strcutured Grids Unstrcutured Grids Module 7: Grid Generation

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Highly Scalable Highly Scalable Ethernets Ethernets Paul Bottorff, Chief Architect, Carrier

Scalable Certification for Scalable Certification for Typed Assembly Language Typed Assembly

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

Pecan Campus Technology Campus Starr County Mid Valley Nursing & Allied Campus Campus

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Using chromosome conformation capture to assemble genomes to perfection Nadge Guiglielmoni,

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Genome Assembly Sample Prepara1on Fragments Sequencing Reads

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti - PowerPoint PPT Presentation

Highly Scalable Genome Assembly on Campus Grids Christopher Moretti Michael Olson, Scott Emrich, Douglas Thain Michael Olson, Scott Emrich, Douglas Thain University of Notre Dame Christopher Moretti University of Notre Dame 1 11/16/2009

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

BayeHem: Bayesian Optimisation of Genome Assembly 1. Genome Assembly 2. Bayesian Optimisation

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Scientific Computing I Grids Strcutured Grids Unstrcutured Grids Module 7: Grid Generation

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev

10X Genome Assembly Technology and Single Cell CNV Credit: 10X Genomics Diana Burkart-Waco DNA

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Highly Scalable Highly Scalable Ethernets Ethernets Paul Bottorff, Chief Architect, Carrier

Scalable Certification for Scalable Certification for Typed Assembly Language Typed Assembly

Optical-Kermit: Optical map guided genome assembly Miika Leinonen, Leena Salmela University of

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

Pecan Campus Technology Campus Starr County Mid Valley Nursing &amp; Allied Campus Campus

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Using chromosome conformation capture to assemble genomes to perfection Nadge Guiglielmoni,

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Genome Assembly Sample Prepara1on Fragments Sequencing Reads

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Pecan Campus Technology Campus Starr County Mid Valley Nursing & Allied Campus Campus