sequencing of a genome bioinformatics algorithms
play

Sequencing of a genome Bioinformatics Algorithms From the DNA - PDF document

Sequencing of a genome Bioinformatics Algorithms From the DNA molecules (input of experiment) we want to get the (Fundamental Algorithms, module 2) sequence of the nucleotides (desired output). Zsuzsanna Lipt ak Masters in Medical


  1. Sequencing of a genome Bioinformatics Algorithms From the DNA molecules (input of experiment) we want to get the (Fundamental Algorithms, module 2) sequence of the nucleotides (desired output). Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester ...AACAGTACCATGCTAGGTCAATCGA... Fragment Assembly with de Bruijn Graphs 1 ...TTGTCATGGTACGATCCAGTTAGCT... 1 These slides mainly based on Compeau, Pevzner, Tesler: How to apply de Bruijn graphs to genome assembly , Nature Biotechnology 29 (11). 2 / 27 Sequence assembly Sequence assembly (also called Fragment Assembly Problem) Input: Molecule (many identical copies) broken up into fragments. Many short sequences/strings (the fragments). many identical copies Goal: Reconstruct original string (the target sequence). 3 / 27 4 / 27 Overlap graph approach Sanger sequencing vs. short read sequencing (NGS) (Recall from the first module of this course) NGS Previous approach (Sanger sequencing technology) Next generation sequencing technologies (Illumina, 454, SOLiD, . . . ) Shortest common superstring ˆ = a heaviest path in the overlap graph of generate a much larger number of reads F = { TACC , ACTAC , CGGACT , ACGGA } ˆ = a heaviest Hamiltonian path. • high-throughput: fast acquisition, low cost • lower quality (more errors) 1 a = TACC c = CGGACT • short reads (Illumina: typically 60-100 bp) 1 • much higher number of reads 0 1 0 3 3 0 4 0 While overlap graph approach (with many additional details and modifications!) worked for Sanger type sequences, it no longer works for NGS data. Reason: Input too large, no e ffi cient algorithms exist (e ffi cient 1 b = ACTAC d = ACGGA = polynomial time in input size), since SCS (and all other problem 2 variants) are NP-hard. 5 / 27 6 / 27

  2. Solution: Use Euler cycle/path approach Solution: Use Euler cycle/path approach Solution: Solution: Use Euler cycle/path in a de Bruijn graph (instead of heaviest Hamiltonian Use Euler cycle/path in a de Bruijn graph (instead of heaviest Hamiltonian cycle/path in an overlap graph). cycle/path in an overlap graph). Euler cycle/path vs. Hamiltonian cycle/path • Hamiltonian cycle/path: uses every vertex exactly once • Euler cycle/path: uses every edge exactly once 7 / 27 7 / 27 Solution: Use Euler cycle/path approach Solution: Use Euler cycle/path approach Solution: Solution: Use Euler cycle/path in a de Bruijn graph (instead of heaviest Hamiltonian Use Euler cycle/path in a de Bruijn graph (instead of heaviest Hamiltonian cycle/path in an overlap graph). cycle/path in an overlap graph). Euler cycle/path vs. Hamiltonian cycle/path Euler cycle/path vs. Hamiltonian cycle/path • Hamiltonian cycle/path: uses every vertex exactly once • Hamiltonian cycle/path: uses every vertex exactly once • Euler cycle/path: uses every edge exactly once • Euler cycle/path: uses every edge exactly once Fact Fact Finding an Euler cycle (or Euler path) can be solved in polynomial time. Finding an Euler cycle (or Euler path) can be solved in polynomial time. But: We have to find a way of modelling our problem in the right way. 7 / 27 7 / 27 Recall: Eulerian cycles and the bridges of K¨ onigsberg Recall Euler cycle/path Theorem A directed graph has an Euler cycle (=Euler tour) if and only if it is connected and for all vertices v : indeg ( v ) = outdeg ( v ) (i.e. all vertices are balanced). Such a graph is called Eulerian. Theorem b A directed graph has an Euler path if and only if • it is Eulerian, or • it is connected, there are two vertices s , t , for which indeg ( s ) = outdeg ( s ) − 1 and indeg ( t ) = outdeg ( t ) + 1, and all other vertices are balanced. 8 / 27 9 / 27 � � � � �

  3. Recall Euler cycle/path Application to the Fragment Assembly problem Theorem If G is Eulerian, then an Euler cycle can be found in time O ( | E | ). Proof Use Hierholzer’s algorithm: • Start from any vertex v , go along so far untraversed edges. This is We will use de Bruijn graph for modelling our problem: always possible, because every vertex is balanced. • create a de Bruijn graph from the input fragments • Eventually we get back to v (why?). Now if there are still untraversed • find an Eulerian path in this de Bruijn graph edges, then there must be a vertex u in the cycle so far visited which • this Eulerian path will yield the desired string has untraversed incident edges, since the graph is connected. • Create a new cycle starting from u , unite the new cycle with the old one. • Until no untraversed edges are left. Note: Similar for Eulerian path, start from s , will end up in t . 10 / 27 11 / 27 De Bruijn graphs Definition of (full) de Bruijn graphs 0011 3 Let Σ be our alphabet. 001 011 0010 1011 (E.g. Σ = { A , C , G , T } or Σ = { 0 , 1 } or Σ = { a , b , c } ) 7 9 0111 2 0001 10 1010 0000 1111 Definition 2 14 1001 0110 6 000 1 010 101 11 111 4 The de Bruijn graph over Σ of order k is a directed graph G = ( V , E ) s.t. 8 0101 V = Σ k − 1 and ( u , v ) ∈ E if u 2 . . . u k − 1 = v 1 . . . v k − 2 . 1110 12 1000 16 15 13 (Equivalently: ( u , v ) ∈ E if exists a word w ∈ Σ k s.t. u is the 0100 1101 100 110 problem. ( k − 1)-length prefix of w and v is the ( k − 1)-length su ffi x of w .) 5 1100 N.B. Note that E = Σ k , and that the graph has loops (e.g. (000 , 000) ∈ E ). The numbers give the order of the edges in an Eulerian cycle.— Named after Nicolaas de Bruijn, who introduced these graphs in 1946, for a di ff erent problem. 2 Some people call these de Bruijn graphs of order k − 1. 12 / 27 13 / 27 Modelling our problem with de Bruijn graphs Alternative definition of de Bruijn (sub)graphs Let Σ be our alphabet. (E.g. Σ = { A , C , G , T } or Σ = { 0 , 1 } or Σ = { a , b , c } ) N.B. Definition For simplicity, for now our sequence to be reconstructed is assumed to be A directed graph G = ( V , E ) is called a de Bruijn (sub)graph of order k if circular. E.g. bacterial genomes are circular. V ⊆ Σ k − 1 and for all u , v ∈ V : if ( u , v ) ∈ E then there exists a word w ∈ Σ k s.t. u is the ( k − 1)-length prefix of w and v is the ( k − 1)-length a su ffi x of w . A A T Example C G String can be read as: ATGGCGTGCA, u = GCA , v = CAA , w = GCAA . TGGCGTGCAA, GGCGTGCAAT, ... G G Short-read sequencing C T N.B. G These are subgraphs of the original de Bruijn graph. Many researchers, esp. in bioinformatics call these graphs de Bruijn graphs . There exists also the version with multiple edges (multigraph, later). 14 / 27 15 / 27

  4. Modelling our problem with de Bruijn graphs Modelling our problem with de Bruijn graphs Input: A collection F of strings. Input: A collection F of strings. First step: Generate all k -length substrings of fragments in F . First step: Generate all k -length substrings of fragments in F . Example Example F = { ATGGCGT , CAATGGC , CGTGCAA , GGCGTGC , TGCAATG } . F = { ATGGCGT , CAATGGC , CGTGCAA , GGCGTGC , TGCAATG } . For k = 3, we get: For k = 3, we get: AAT , ATG , CAA , CGT , GCA , GCG , GGC , GTG , TGC , TGG . 16 / 27 16 / 27 Modelling our problem with de Bruijn graphs Modelling our problem with de Bruijn graphs • edges: AAT , ATG , CAA , CGT , GCA , GCG , GGC , GTG , TGC , TGG (remember to only put an edge if the k -mer is present!) • vertices: AA , AT , CA , CG , GC , GG , GT , TG Now from the k -mers, we generate the ( k − 1)-length prefixes and su ffi xes: AA, AT, CA, CG, GC, GG, GT, TG . These are the vertices. The edges are the k -mers. • F = { ATGGCGT , CAATGGC , CGTGCAA , GGCGTGC , TGCAATG } , k = 3 • edges: AAT , ATG , CAA , CGT , GCA , GCG , GGC , GTG , TGC , TGG • vertices: AA , AT , CA , CG , GC , GG , GT , TG 17 / 27 18 / 27 Modelling our problem with de Bruijn graphs Comparison to other models • edges: AAT , ATG , CAA , CGT , GCA , GCG , GGC , GTG , TGC , TGG Compare to modelling the same problem with overlap graphs: (remember to only put an edge if the k -mer is present!) F = { ATGGCGT , CAATGGC , CGTGCAA , GGCGTGC , TGCAATG } • vertices: AA , AT , CA , CG , GC , GG , GT , TG d b AT CGTGCAA ATGGCGT AAT ATG 1 10 AA TG 3 GGCGTGC 2 TGCAATG CGTGCAA TGG ATGGCGT CAA 2 9 1 4 TGCAATG GTG 5 CA 6 TGC 7 GG -mers from edges CAATGGC GGCGTGC CAATGGC 3 ATGGCGT 8 GGC GCA Genome: ATGGCGTGCAATGGCGT GT GC 5 4 CGT GCG CG Note that not all non-zero weight edges are included in the figure. The numbers on the edges give a Hamiltonian cycle: ATGGCGTGCA . The numbers on the edges give an Eulerian cycle in this graph: ATGGCGTGCA 18 / 27 19 / 27

Recommend


More recommend