introduction
play

Introduction Outline 1. Strings and Graphs 2. Our String Problem: - PowerPoint PPT Presentation

Maximal Common Subsequence Enumeration 1 How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science Department of Computer Science Mauriana Pesaresi PhD Seminars April 20th 2020 1 A. Conte, R. Grossi, G.


  1. Maximal Common Subsequence Enumeration 1 How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science Department of Computer Science Mauriana Pesaresi PhD Seminars – April 20th 2020 1 A. Conte, R. Grossi, G. Punzi, T. Uno; “Maximal Common Subsequence Enumeration”, SPIRE 2019. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 1 / 24

  2. Introduction Outline 1. Strings and Graphs 2. Our String Problem: Enumerating Maximal Common Subsequences 3. Why is it hard? 4. A Change of Perspective: Graphs 5. Conclusions and Future Work Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 2 / 24

  3. Introduction Strings and Graphs a b a c b Strings and Graphs are both ubiquitous in Computer Science. Strings: most information is textual. Graphs: essential to represent relationships and network structure. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 3 / 24

  4. Introduction Combining Strings and Graphs Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

  5. Introduction Combining Strings and Graphs Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages. ↓ We will study one instance where a difficult string problem was solved using the underlying graph structure: Maximal Common Subsequence Enumeration Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

  6. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  7. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Definition Given X, Y over Σ , a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  8. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Definition Given X, Y over Σ , a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length. Definition (Sakai 2018) Given X, Y over Σ , a string S is a Maximal Common Subsequence of X and Y , denoted S ∈ MCS ( X, Y ) , if 1. S ⊂ X and S ⊂ Y ; 2. S ⊂ W with W ⊂ X , W ⊂ Y ⇒ S = W . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  9. Introduction Maximal Common Subsequences Example Let Σ = { A , C , G , T } and consider X = A T C AGG T Y = G AC TA T then: 1. S = ACT is a common subsequence of X and Y . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

  10. Introduction Maximal Common Subsequences Example Let Σ = { A , C , G , T } and consider X = ATCAGGT Y = GACTAT then: 1. S = ACT is a common subsequence of X and Y ; 2. MCS ( X, Y ) = { ACAT , ATAT , GT } . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

  11. Introduction MCS vs LCS LCS : one of the main string comparison tools Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  12. Introduction MCS vs LCS LCS : one of the main string comparison tools ↓ Limitation : LCS has a quadratic conditional lower bound (Abboud et al, 2015) Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  13. Introduction MCS vs LCS LCS : one of the main string comparison tools ↓ Limitation : LCS has a quadratic conditional lower bound (Abboud et al, 2015) MCS are a natural generalization of LCS. ◮ One MCS can be found in O ( n log log( n )) time (Sakai 2018) ◮ Might reveal alternative smaller alignments Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  14. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  15. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  16. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Problem (MCS Enumeration) List all distinct maximal common subsequences S ∈ MCS ( X, Y ) , for X, Y of length O ( n ) over Σ of size σ , with polynomial delay. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  17. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Problem (MCS Enumeration) List all distinct maximal common subsequences S ∈ MCS ( X, Y ) , for X, Y of length O ( n ) over Σ of size σ , with polynomial delay. Note that by distinct we mean as elements of the set MCS ( X, Y ) : strings with multiple occurrences need to be output once . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  18. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  19. Our Aim: MCS Enumeration Example (Enumeration) X = TA A GC C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  20. Our Aim: MCS Enumeration Example (Enumeration) X = T A AGC C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  21. Our Aim: MCS Enumeration Example (Enumeration) X = T A AG C C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  22. Our Aim: MCS Enumeration Example (Enumeration) X = TA A G C C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  23. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  24. Our Aim: MCS Enumeration Example (Enumeration) X = TAA G C C Y = TA G AC T Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  25. Our Aim: MCS Enumeration Example (Enumeration) X = TAA GC C Y = TA G AC T Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  26. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: ◮ TAGC ◮ TAAC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  27. Pitfalls of MCS Enumeration Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  28. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  29. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. Example X = AGA | TGA Y = TAG | GAT MCS ( X, Y ) = { AGGA , AGAT , TGA } : the combination AGT of the two blue submaximals is not maximal. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  30. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  31. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number MCS can be exponential even for | Σ | = 2 . Example The two strings Y = A ◦ ( CA ) ⌊ 3 n 2 ⌋ . X = A ◦ ( CCA ) n ; have an exponential number of MCS. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  32. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number MCS can be exponential even for | Σ | = 2 . 3. Using an incremental approach? Let X and Y be any two strings; is it true that MCS ( X, Y ) ◦ c ↔ MCS ( X, Y ◦ c )? Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  33. Pitfalls of MCS Enumeration Incremental Approach is Inefficient Some incremental properties can be derived, but they are intrinsically inefficient. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 11 / 24

Recommend


More recommend