a linear time algorithm for seeds computation
play

A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , - PowerPoint PPT Presentation

A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Wale University of Warsaw SODA 2012 Kyoto, January 18, 2012 Tomasz Kociumaka A Linear Time Algorithm for Seeds


  1. A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń University of Warsaw SODA 2012 Kyoto, January 18, 2012 Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 1/20

  2. Periodicity and quasiperiodicity Periodicity: = a b a a a b a a a b a a a b a a One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  3. Periodicity and quasiperiodicity Periodicity: = a b a a a b a a a b a a a b a a One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  4. Periodicity and quasiperiodicity Periodicity: = a b a a a b a a a b a a a b a a One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  5. Periodicity and quasiperiodicity Periodicity: = a b a a a b a a a b a a a b a a One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  6. Periodicity and quasiperiodicity Periodicity: a b a a a b a a a b a a a b a a One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  7. Periodicity and quasiperiodicity Periodicity: a b a a a b a a a b a a a b a a a b One of the key concepts in text algorithms. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  8. Periodicity and quasiperiodicity Periodicity: a b a a a b a a a b a a a b a a a b Quasiperiodicity: a a b a a b a a a a b a a a b a a Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  9. Periodicity and quasiperiodicity Periodicity: a b a a a b a a a b a a a b a a a b Quasiperiodicity: a a b a a b a a a a b a a a b a a b Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

  10. Covers and seeds Cover: a b a a b a a b a b a a b a b a a b a a b Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

  11. Covers and seeds Cover: a b a a b a a b a b a a b a b a a b a a b Each letter of the word is covered by an occurrence of the cover. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

  12. Covers and seeds Seed: a a b a a b a b a a b a b a a b a a Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

  13. Covers and seeds Seed: a a b a a b a b a a b a b a a b a a Each letter of the word is covered by an occurrence of the seed. The occurrences can be external. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

  14. The main problem Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ , compute the shortest seed of w . Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

  15. The main problem Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ , compute the shortest seed of w . Problem (All-Seeds) Given a word w of length n over an alphabet Σ , compute an O ( n ) -sized representation of all the seeds of w . Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

  16. The main problem Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ , compute the shortest seed of w . Problem (All-Seeds) Given a word w of length n over an alphabet Σ , compute an O ( n ) -sized representation of all the seeds of w . Theorem (Our result) � 0 , 1 , . . . , n O ( 1 ) � The All-Seeds Problem for Σ = can be solved in O ( n ) time. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

  17. Background Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O ( n log n ) -time algorithm for the All-Seeds Problem over a fixed-size alphabet is given. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

  18. Background Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O ( n log n ) -time algorithm for the All-Seeds Problem over a fixed-size alphabet is given. No o ( n log n ) algorithm even for the Shortest-Seed Problem for binary alphabet up to now. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

  19. Background Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O ( n log n ) -time algorithm for the All-Seeds Problem over a fixed-size alphabet is given. No o ( n log n ) algorithm even for the Shortest-Seed Problem for binary alphabet up to now. W.F. Smyth stated finding a linear algorithm for the All-Seeds Problem as a hard open problem in his survey (2000). Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

  20. Background An O ( log n ) -time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

  21. Background An O ( log n ) -time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994. For covers linear algorithms for similar problems are known: shortest covers of each prefix (Breslauer, 1992) all covers (Moore & Smyth, SODA 1994) Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

  22. Background An O ( log n ) -time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994. For covers linear algorithms for similar problems are known: shortest covers of each prefix (Breslauer, 1992) all covers (Moore & Smyth, SODA 1994) Variants of seeds have been studied: approximate seeds (Christodoulakis et al., 2003) λ -seeds (Guo, Zhang & Iliopoulos, 2006) Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

  23. Constraints for seeds Two different types of constraints Border constraints, easier a b a a b a a a a b a a a b a a b Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/20

  24. Constraints for seeds Two different types of constraints Border constraints, easier Maxgap constrains, harder ≤ 5 a b a a b a a a a b a a a b a a b ≤ 5 Maxgap is a maximal distance between the starting positions of two consecutive occurrences of a given subword. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/20

  25. Quasiseeds The All-Seeds Problem can be linearly reduced to computing the maxgaps of all subwords (encoded in a suffix tree). No o ( n log n ) algorithm known. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/20

  26. Quasiseeds The All-Seeds Problem can be linearly reduced to computing the maxgaps of all subwords (encoded in a suffix tree). No o ( n log n ) algorithm known. Definition (Quasiseed) A subword v is a quasiseed of w if there there are less than | v | letters both before its first occurrence and after the last one and each letter between those two occurrences is covered by an occurrence of v . < 5 a a b a a b a b a a b a b a a b a a < 5 all letters covered Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/20

  27. Useful properties of quasiseeds An O ( n ) representation on the suffix tree. a aaa aaaaa b a aaaa aaaaa aaaaaa aaaaaaa b b b a b aaa aaa b a a non-quasiseeds a a a quasiseeds v Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 9/20

  28. Useful properties of quasiseeds Lemma (Restricted-Quasiseeds) Given an integer d and a word w of length n , the representation of all quasiseeds of length in { d , d + 1 , . . . , 2 d } can be found in O ( n ) time. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/20

  29. Useful properties of quasiseeds Lemma (Restricted-Quasiseeds) Given an integer d and a word w of length n , the representation of all quasiseeds of length in { d , d + 1 , . . . , 2 d } can be found in O ( n ) time. The All-Seeds Problem can be linearly reduced to computing (the representation of) all quasiseeds. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/20

  30. Main problem Problem (All-Quasiseeds) Given a word of length n , compute the representation of all its quasiseeds. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 11/20

  31. Recursive structure of the algorithm Interval m -staircase m 3 m w : Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/20

  32. Recursive structure of the algorithm Interval m -staircase m 3 m w : Lemma (Short Quasiseeds) A subword v of length < m is a quasiseed of w if and only if it is a quasiseed of each subword corresponding to an m -staircase interval. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/20

  33. Recursive structure of the algorithm The total length of the intervals in the staircase (size of the staircase) is about 3 n . Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

  34. Recursive structure of the algorithm The total length of the intervals in the staircase (size of the staircase) is about 3 n . If it were 1 2 n , the recursion could yield a linear algorithm. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

  35. Recursive structure of the algorithm The total length of the intervals in the staircase (size of the staircase) is about 3 n . If it were 1 2 n , the recursion could yield a linear algorithm. We need to reduce the staircase. Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

Recommend


More recommend