SPAdes: a New Genome Assembler for Single-Cell Sequencing - PowerPoint PPT Presentation

Problem setting and preparing data The SPAdes approach SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg Academic University A. Bankevich, S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A. Korobeynikov, A.S. Kulikov, V.M. Lesin, S.I. Nikolenko, S. Pham, A.D. Prjibelski, A.V. Pyshkin, A.V. Sirotkin, N. Vyahhi, G. Tesler, M.A. Alekseyev, P.A. Pevzner August 27, 2012 Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Outline Problem setting and preparing data 1 Assembly: problem and pipeline Error correction: BayesHammer The SPAdes approach 2 De Bruijn graphs, mate-pairs, and simplification Paired de Bruijn graphs and repeats Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Single-cell sequencing Recent years have seen the advent of single-cell sequencing as a way to sequence genomes that we previously couldn’t. It turns out that many bacteria (“dark matter of life”) cannot be sequenced by standard techniques, most often because they cannot be cloned millions of times to get large DNA samples needed for regular sequencing. This is usually due to the fact that these bacteria come in metagenomic samples (ocean samples, microbiomes of larger organisms etc.) and cannot be cultivated alone. For now, metagenomic analysis can yield more or less only individual genes. Single-cell sequencing can get reasonable sequencing coverage by amplifying a single cell (more on that later). Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Single-cell sequencing Important examples: [Woyke et al., 2009, 2010]: marine cells sequencing from metagenomic ocean samples; [Dalerba et al., 2011]: studying tumor heterogeneity; [Islam et al., 2011]: characterizing single cell transcriptome; [Xu et al., 2012; Hou et al., 2012]: single-cell sequencing of tumor cells; [Yoon et al., 2011]: sequencing bacteria from the human microbiome; [Chitsaz et al., 2011]: assembling single-cell sequencing data from marine samples. So how does single-cell sequencing work and why do we need new assemblers for it? Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer The assembly problem First, let us go back to the general assembly problem. We have a long DNA sequence but cannot read it directly. Instead, we can perform sequencing , getting many random small pieces of the DNA ( reads ). For popular sequencing technologies, reads are usually L ≈ 35-400bp long. The assembly problem is to reconstruct the original sequence from the set of reads. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer The assembly problem In many sequencing methods, we read ≈ L nucleotides from both ends of a longer DNA fragment. This leads to paired-end reads (read pairs): two reads for which we know the distance between them [Chaisson et al., 2009]. left read genome right read read length read length gap distance insert size Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology Recent single-cell sequencing projects use the MDA technology [Dean et al., 2001; 2002]: Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology However, MDA has a number of problems that seriously complicate assembly. First and foremost, non-uniform coverage: Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology This means that a coverage threshold (often used in conventional assemblers) would throw a lot of the genome away: Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology The insert size distribution is also not nearly as nice: Finally, there are simply more errors, including lots of chimeric connections. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology Computational problems of MDA: highly non-uniform coverage; 1 noisier, more complicated insert size distribution; 2 more chimeric connections (sometimes with large coverage); 3 more frequent errors. 4 First single-cell assembler: E+V-SC (Euler+Velvet-Single-Cell assembler) [Chitsaz et al., 2011]; captured 25% more genes than “regular” assemblers on single-cell data. Meet SPAdes – a new assembler specifically designed for the single-cell case. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Assembly pipeline Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer Basic idea: break reads into k-mers and study the set of k -mers. ACGTGTGATGCATGATCG ACGTGTGATGC CGTGTGATGCA GTGTGATGCAT TGTGATGCATG GTGATGCATGA TGATGCATGAT GATGCATGATC ATGCATGATCG k -mers are the basic building block in modern assemblers (de Bruijn graph). However, de Bruijn graphs are not directly applicable because of errors in reads. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer The first step in assembly is to fix as many errors as we can. If we know what k -mers are correct and how to correct others, it’s easy to correct reads. original read ACGTGTGATGCATGATCG ACGTGTGATGCATGATCG ACGTGTGATGC ACGTGTGATGC correction CGTGTGATGCA CGTGAGATGCA correction GTGTGATGCAT GTGAGATGCAT TGTGATGCATG TGTGATGCATG solid k -mers GTGATGCATGA GTGATGCATGA correction TGATGCATGAT AGATGCATGAT GATGCATGATC GATGCATGATC ATGCATGATCG ATGCATGATCG ACGTGAGATGCATGATCG corrected read Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer In regular (multi-cell) error correction, we can look at how many copies of a k -mer there are and assume that rare k -mers represent errors. In single-cell datasets, this idea fails due to non-uniform coverage. 10 4 10 2 10 0 0 1 , 000 2 , 000 3 , 000 4 , 000 KBases Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer Basic idea: a k -mer is covered many times (even if non-uniformly). Thus, if we look at the set of similar k -mers we can find out what to do because nucleotides of the central k -mer will be better covered than others. Problem: there may be several centers in such a cluster. ATGTGTGATGC ATGTGGGATGC ACGTGGGATGC ACGTGGGATGC ACGTGTGATGC ACGTGGGATGC ACGTGTGATGC ACGTGTGATGC ATGTGTGATGC ATGTGTGATGC ACGTGAGATGC ACGTGTGATAC ACGTGTGATGC ACGTGTGATGC ACGTGGGATGC Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer In Hammer, Medvedev et al. (2011) constructed and roughly clustered the Hamming graph of k -mers; BayesHammer uses probabilistic subclustering to get multiple centers in a cluster. Reads k -mers Hamming graph ACGTGTG ACGTG ACATG CGTGT GTGTG ACCTG CGTGT ACATGTG ACATG ACGTG CATGT CATGT ATGTG ATGTG CCTGT ACCTGTC ACCTG CCTGT GTGTG CTGTC CTGTC As a result, BayesHammer corrects single-cell datasets much better than other tools. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats Outline Problem setting and preparing data 1 Assembly: problem and pipeline Error correction: BayesHammer The SPAdes approach 2 De Bruijn graphs, mate-pairs, and simplification Paired de Bruijn graphs and repeats Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs When there are relatively few k -mers left, we can begin global processing (the actual assembly). Basic idea: de Bruijn graph [Idury & Waterman, 1995; Pevzner et al., 2001]. Algorithmic Biology Lab, SPbAU SPAdes

Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs Algorithmic Biology Lab, SPbAU SPAdes

SPAdes: a New Genome Assembler for Single-Cell Sequencing - PowerPoint PPT Presentation

Problem setting and preparing data The SPAdes approach SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg Academic University A. Bankevich, S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A.

SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg

Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Assembler, Linker, and SPIM October 10, 2008 () Assembler, Linker, and SPIM October 10, 2008 1

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Introduction to Single Cell RNA Sequencing Sarah Boswell Director of the Single Cell Core,

Introduction to Single Cell RNA Sequencing Sarah Boswell Director of the Single Cell Core,

Lectures 20, 21: Single-cell Sequencing and Assembly Spring

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Recursive Datatypes and Lists Types and constructors data Suit = Spades | Hearts | Diamonds |

Genetic Testing: Genome Sequencing A-Z for Mitochondrial Disease Christine Stanley PhD, FACMG

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Specific Aims One Page The single most important page in a grant Specific Aims Specific Aims

Advising the Federal Government Susan L. Graham University of California, Berkeley LISPI

Certainty in Uncertain Times Certainty is Only a Molecule Away Investor Call, Q1 FY17 NASDAQ:

CO COVID-19 Vir irtual al Communit ity Meetin ing March 27, 2020 11:00 12:00 AM PDT

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

SPAdes: a New Genome Assembler for Single-Cell Sequencing - PowerPoint PPT Presentation

Problem setting and preparing data The SPAdes approach SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg Academic University A. Bankevich, S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A.

SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg

Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

Assembler, Linker, and SPIM October 10, 2008 () Assembler, Linker, and SPIM October 10, 2008 1

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Introduction to Single Cell RNA Sequencing Sarah Boswell Director of the Single Cell Core,

Introduction to Single Cell RNA Sequencing Sarah Boswell Director of the Single Cell Core,

Lectures 20, 21: Single-cell Sequencing and Assembly Spring

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Recursive Datatypes and Lists Types and constructors data Suit = Spades | Hearts | Diamonds |

Genetic Testing: Genome Sequencing A-Z for Mitochondrial Disease Christine Stanley PhD, FACMG

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of

CSEP 590 A Computational Biology &quot; &quot; Genes and Gene Prediction &quot; &quot; A

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Specific Aims One Page The single most important page in a grant Specific Aims Specific Aims

Advising the Federal Government Susan L. Graham University of California, Berkeley LISPI

Certainty in Uncertain Times Certainty is Only a Molecule Away Investor Call, Q1 FY17 NASDAQ:

CO COVID-19 Vir irtual al Communit ity Meetin ing March 27, 2020 11:00 12:00 AM PDT

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A