splitmem graphical pan genome analysis with suffix skips
play

splitMEM: graphical pan-genome analysis with suffix skips Shoshana - PowerPoint PPT Presentation

splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis Objective Input ! Output ! A" B" C" D" Several


  1. splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

  2. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  3. Objective Input ! Output ! A" B" C" D" • Several complete genomes ! Compressed de Bruijn graph ! • Available today for many • Graphical representation microbial species, near future depicts how population for higher eukaryotes ! variants relate to each other, • Pan-genome: analyze multiple especially where they diverge genomes of species together at branch points ! • How well conserved is a sequence? ! • What are network properties? !

  4. de Bruijn graph • Node for each distinct kmer • Directed edge connects consecutive kmers • Nodes overlap by k- 1 bp • Self-loops, multi-edges AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once

  5. Compressed de Bruijn graph • Merge non-branching chains of nodes • Min. number of nodes that preserve path labels ! Usually built from uncompressed graph ! We build directly in O(n log n) time and space

  6. Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

  7. Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=1000

  8. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  9. Suffix Tree • Rooted, directed tree with leaf for each su ffj x. • Each internal node, except the root, has at least two children. • Each edge is labeled with nonempty substring. • No two siblings begin with the same character. • Path from root to leaf i spells su ffj x S[i . . . n]. • Append special character $ to guarantee each su ffj x ends at leaf.

  10. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" suf 1" banana$" " suf 1" "

  11. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" b a n a n a $ suf 2" " suf 2" anana$" " suf 1" " "

  12. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" nana$" suf 3" suf 2" nana$" " suf 1" suf 3" " " "

  13. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" nana$" suf 4" suf 2" ana$" " suf 1" suf 3" " " "

  14. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" banana$" suf 4" suf 2" ana$" suf 4" " suf 1" suf 3" " " " "

  15. Constructing Suffix Tree Naïve"Algorithm" S"="banana$" banana$" suf 1" $" suf 6" anana$" suf 2" na" banana$" " nana$" suf 3" ana$" suf 4" suf 7" suf 2" suf 5" " na$" suf 5" suf 3" " " a$" suf 4" suf 6 " suf 1" " " " $" suf 7" O(n 2 )"Eme" "

  16. Constructing Suffix Tree O(n)"Eme" Suffix"Links" $" suf 6" " na" " banana$" ana"" """na""" suf 7" " "" suf 2" suf 5" " " "a" suf 3" " " suf 4" suf 1" " " " On#line'Construc/n'of'Suffix'Trees ,""E."Ukkonen" Algorithmica"(1995)""

  17. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" banana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  18. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" b anana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  19. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" ba nana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " "

  20. Suffix Tree Query S"="banana$" $" Search for ban suf 6" na" ban ana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Found 1 occurrence

  21. Suffix Tree Query S"="banana$" $" Search for band suf 6" na" ban ana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Not found

  22. Suffix Tree Query S"="banana$" $" Search for an suf 6" n a" banana$" " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " Found 2 occurrences

  23. Suffix Tree ! Many applications in computational biology ! Linear time construction algorithms Linear time solutions to • Genome alignment • Finding longest common substring • All-pairs suffix-prefix matching • Locating all maximal repetitions • And many more…

  24. MEMs Maximal"Exact"Match"(MEM)"" Exact"match"within"sequence"that"cannot"be" extended"leT"or"right"without"introducing" mismatch." T G C AC G C A A We"are"interested" in"MEMs""length" ≥ k"

  25. MEMs Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. MEMs are internal nodes in the suffix tree that have left-diverse descendants. (have descendant leaves that represent suffixes with different characters preceding them) ! Linear-time suffix tree traversal to locate MEMs.

  26. MEMs in Suffix Tree Possible MEMs: a, ana, na S"="banana$" banana$" suf 1" $" MEM? suf 6" anana$" suf 2" na" banana$" MEM? " nana$" suf 3" MEM? ana$" suf 4" suf 7" suf 2" suf 5" " na$" suf 5" suf 3" " " a$" suf 4" suf 6 " suf 1" " " " $" suf 7" MEMs are internal nodes in suffix " tree with left-diverse descendants

  27. MEMs in Suffix Tree MEMs: a, ana S"="banana$" b anana$" ! suf 2" MEM? $" suf 6" n ana$" suf 4" na" banana$" ! MEM? " MEM? " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " MEMs are internal nodes in suffix tree with left-diverse descendants

  28. MEMs in Suffix Tree MEMs: a, ana S"="banana$" a nana$" suf 3" MEM $" X suf 6" a na$" suf 5" na" banana$" MEM? " MEM " suf 7" suf 2" suf 5" " suf 3" " " suf 4" suf 1" " " " MEMs are internal nodes in suffix tree with left-diverse descendants

  29. Outline 1 Overview 2 Data Structures 3 splitMEM Algorithm 4 Pan-genome Analysis

  30. Compresssed de Bruijn graph Types of nodes: i. repeatNodes ii. uniqueNodes Input: AGAAGTCC$ATAAGTTA

  31. splitMEM Nodes in compressed de Bruijn graph classified as i. repeatNodes ii. uniqueNodes Algorithm: 1 Construct set of repeatNodes 2 Sort start positions of repeatNodes 3 Create edges and uniqueNodes to link non- contiguous repeatNodes

  32. repeatNodes 1 Construct set of repeatNodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length ≥ k 3. Preprocess suffix tree for LMA queries 4. Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length ≥ k

  33. 1 MEM occurs twice T G C AC … G G C A A GCA "

  34. Overlapping MEMs T G C C AT C G C C A AC C AT T G C C AT C G C C A AC C AT

  35. Tandem Repeat AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA

  36. repeatNodes 1 Construct set of repeatNodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length ≥ k 3. Preprocess suffix tree for LMA queries 4. Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length ≥ k

  37. Split"MEM"to"repeatNodes " …" …" …" …" "" x"x"y"z"" α β y"x"y"z"" α β u"" α γ " z α " " MEM" " u � α" x y z � α" α " α" β " " α � β" α � γ" MEM"

  38. Split"MEM"to"repeatNodes " …" …" …" …" "" x"x"y"z"" α β y"x"y"z"" α β u"" α γ " z α " " MEM" " α " x y z α β" β " " MEM" Find"MEM"in"suffix"tree."

Recommend


More recommend