High-throughput sequencing Genome assembly problem Contig assembly Gap filling End Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End 2 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ENTRAL DOGMA OF BIOLOGY protein-protein interaction intron exon TFBS gene 1 gene 2 DNA transcription promoter enhancer silencer pre-mRNA ... alternative splicing mature mRNA transcripts translation proteins binding Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015 3 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End S EQUENCING ATLAS De novo sequencing / Whole genome resequencing Bisulfite sequencing ChIP sequencing Targeted resequencing gene DNA methylation TFBS primer primer mature mRNA transcripts RNA sequencing binding proteins Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015 4 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End H IGH - THROUGHPUT SEQUENCING DNA ⇒ amplification ⇒ break apart ⇒ size selection paired-end reads sequencing ⇒ length ~450 100 ~250 100 5 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End H IGH - THROUGHPUT SEQUENCING DNA ⇒ amplification ⇒ break apart ⇒ size selection paired-end reads sequencing ⇒ length ~450 100 ~250 100 5 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End H IGH - THROUGHPUT SEQUENCING DNA ⇒ amplification ⇒ break apart ⇒ size selection paired-end reads sequencing ⇒ length ~450 100 ~250 100 We assume here that DNA is a single stranded, single chromosome 5 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End I LLUMINA H I S EQ X 6 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End G ENOME ASSEMBLY PROBLEM 4 . 6 · 10 6 E.coli 3 . 2 · 10 9 Human 25 · 10 9 Spurce INPUT: A collection of paired-end reads OUTPUT: The genome 7 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End G ENOME ASSEMBLY PROBLEM 4 . 6 · 10 6 E.coli 3 . 2 · 10 9 Human 25 · 10 9 Spurce INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations: ◮ Shortest superstring problem (NP-hard) ◮ Build a graph with reads as nodes, and significant overlaps between reads as directed edges: ◮ Find a walk that passes through every node exactly once (NP-complete) ◮ Find a walk that passes through every node at least once 7 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End G ENOME ASSEMBLY PROBLEM 4 . 6 · 10 6 E.coli 3 . 2 · 10 9 Human 25 · 10 9 Spurce INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations: ◮ Shortest superstring problem (NP-hard) ◮ Build a graph with reads as nodes, and significant overlaps between reads as directed edges: ◮ Find a walk that passes through every node exactly once (NP-complete) ◮ Find a walk that passes through every node at least once Unrealistic: ◮ Longer repeated regions are collapsed ◮ Genome coverage is not uniform ◮ We cannot choose between multiple solutions 7 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End P RACTICAL FORMULATIONS / PIPELINE 1. Contig assembly : assemble the reads into strings ( contigs ) that are guaranteed to occur in the genome ACGTACG GTACGATA CTAATTCGA GATATCTA CTAGTACCC ACGTACGATATCTA contig: 8 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End P RACTICAL FORMULATIONS / PIPELINE 1. Contig assembly : assemble the reads into strings ( contigs ) that are guaranteed to occur in the genome ACGTACG GTACGATA CTAATTCGA GATATCTA CTAGTACCC ACGTACGATATCTA contig: 2. Scaffolding : using paired-end reads, chain the contigs into scaffolds that are guaranteed to occur in the genome 8 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End P RACTICAL FORMULATIONS / PIPELINE (2) 3. Gap filling : fill the gaps in the scaffolds Tens of genome assembly programs available: ABySS, Velvet, Allpaths-LG, Bambus2, MSR-CA, SGA, Cortex, SOAPdenovo, Opera-LG, SPADES, ... 9 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End D E B RUIJN GRAPHS D EFINITION Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DB k ( R ) with ◮ node set: the set of k -mers of R ◮ edge set: the set of k + 1-mers of the strings of R Also edges occur in the strings of R ! 10 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End D E B RUIJN GRAPHS D EFINITION Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DB k ( R ) with ◮ node set: the set of k -mers of R ◮ edge set: the set of k + 1-mers of the strings of R Also edges occur in the strings of R ! GT CG ATGCGTGGCA ATGCG CGTG AT TG GC CA TGGCA GG k = 2 10 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY joint work with Paul Medvedev 11 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY ◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k + 1 v t v t + 1 12 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY ◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k + 1 v t v t + 1 Given a dBG G : ◮ a genomic walk of G is a circular edge-covering walk of G ◮ a walk is safe if it is a sub-walk of all genomic walks of G 12 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY ◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k + 1 v t v t + 1 Given a dBG G : ◮ a genomic walk of G is a circular edge-covering walk of G ◮ a walk is safe if it is a sub-walk of all genomic walks of G We now assume that the dBG admits a genomic walk (i.e., is strongly connected) and is not a single cycle. 12 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY (2) We say that a contig assembly algorithm is ◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output 13 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY (2) We say that a contig assembly algorithm is ◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output The unitig algorithm is: ◮ outputting all maximal unitigs ◮ sound ◮ not complete 13 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End C ONTIG ASSEMBLY (2) We say that a contig assembly algorithm is ◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output The unitig algorithm is: ◮ outputting all maximal unitigs ◮ sound ◮ not complete Is there a sound and complete contig assembly algorithm? 13 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End N ON - SWITCHING CONTIGS v 0 v 1 v 2 v k v t v k + 1 v t + 1 ◮ A path with all out-branching nodes before all in-branching nodes 14 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End N ON - SWITCHING CONTIGS v 0 v 1 v 2 v k v t v k + 1 v t + 1 ◮ A path with all out-branching nodes before all in-branching nodes ◮ Related to transformation-based algorithms of ◮ Kingsford, Schatz, Pop 2010 ◮ Jackson 2009 ◮ Medvedev, Georgiou, Myers, Brudno 2007 14 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End N ON - SWITCHING CONTIGS v 0 v 1 v 2 v k v t v k + 1 v t + 1 ◮ A path with all out-branching nodes before all in-branching nodes ◮ Related to transformation-based algorithms of ◮ Kingsford, Schatz, Pop 2010 ◮ Jackson 2009 ◮ Medvedev, Georgiou, Myers, Brudno 2007 T HEOREM There is an O ( | G | + | output | ) -time algorithm to output all maximal non-switching contigs of G. T HEOREM The non-switching contig assembly algorithm is sound but not complete. 14 / 25
High-throughput sequencing Genome assembly problem Contig assembly Gap filling End Omni TIGS v j v 0 v i v t + 1 e 0 e j − 1 e j e i − 1 e i e t We say that a walk w = ( v 0 , e 0 , v 1 , e 1 , . . . , v t , e t , v t + 1 ) is an omnitig if for all 1 ≤ i ≤ j ≤ t , there is no proper v j - v i path with first edge different from e j , and last edge different from e i − 1 . 15 / 25
Recommend
More recommend