order preserving incomplete suffix trees and order
play

Order-Preserving Incomplete Suffix Trees and Order-Preserving - PowerPoint PPT Presentation

Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes Maxime Crochemore 3 , 5 Costas S. Iliopoulos 3 , 4 Tomasz Kociumaka 1 Marcin Kubica 1 Alessio Langiu 3 Solon P. Pissis 3 , 4 Jakub Radoszewski 1 Wojciech Rytter 1 , 2 Tomasz


  1. Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes Maxime Crochemore 3 , 5 Costas S. Iliopoulos 3 , 4 Tomasz Kociumaka 1 Marcin Kubica 1 Alessio Langiu 3 Solon P. Pissis 3 , 4 Jakub Radoszewski 1 Wojciech Rytter 1 , 2 Tomasz Waleń 1 1 University of Warsaw, Warsaw, Poland 2 Copernicus University, Toruń, Poland 3 King’s College London, London, UK 4 University of Western Australia, Perth, Australia 5 Université Paris-Est, France SPIRE 2013, 2013–10–09 1/19

  2. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . 2/19

  3. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ 1 3 2 4 2 6 3 8

  4. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ �≈ 1 3 2 4 2 6 3 8 3 7 4 5 2/19

  5. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ �≈ 1 3 2 4 2 6 3 8 3 7 4 5 i j i j x i < x j y i > y j but 2/19

  6. Applications Motivation: ◮ melody matching of two musical scores, ◮ recognition of trends in the stock market, ◮ = is boring, ≈ has nice combinatorial definition. Related problems: ◮ suffix trees for quasi-suffix families, ◮ pattern avoidance (as subsequences not as subword!), ◮ parametrized matching, ◮ partial words. 3/19

  7. Previous results Pattern matching in order-preserving model For a pattern of length m and text of length n detect order-preserving occurrences. Known results ◮ single pattern matching: O ( n + m ) , Kubica et al. IPL 2013, ◮ multiple pattern matching: O ( n + M ) , Kim et al. arXiv 2013, ◮ pattern matching with k -mismatches: O ( n ( log log m + k log log k )) , Gawrychowski, Uznański, arXiv 2013. 4/19

  8. Our results Problem Preprocess text w of length n , in such a way that you can answer the occurrence queries efficiently. Our results: ◮ O ( n log log n ) — preprocessing time, ◮ O ( m + Occ ) — query time (for pattern of length m ) 5/19

  9. Algorithm outline ◮ encoding function Code that reduces testing of ≈ relation into regular equality, ◮ relaxation of suffix tree definition to make the implementation easier, ◮ modification of Ukkonen’s algorithm, ◮ algorithmic toolbox for speeding-up the factors encoding and suffix tree navigation. 6/19

  10. Encoding function (1/2) For any i ∈ { 1 , . . . , n } define: α w ( i ) = distance to predecessor of w[i] among values from w[1..(i-1)] β w ( i ) = distance to successor of w[i] among values from w[1..(i-1)] α w ( 6 ) = 4 6 3 2 5 1 4 β w ( 6 ) = 2 7/19

  11. Encoding function (2/2) Code ( w ) = ( α w ( 1 ) , β w ( 1 )) , . . . , ( α w ( | w | ) , β w ( | w | )) . Example w = 1 4 2 3 Code ( w ) = ( − , − ) ( 1 , − ) ( 2 , 1 ) ( 1 , 2 ) Observation x ≈ y ⇔ Code ( x ) = Code ( y ) . 8/19

  12. How to compute Encoding function? Lemma (Off-line Code computation) For a string w of length n , Code ( w ) can be computed in O ( n ) time. Lemma (Arbitrary factor Code computation) For a string w of length n , after O ( n ) preprocessing any element of Code ( v ) for any factor v of w can be computed in O ( log n ) time. Restricted case If we restrict computation of Code to sliding window over w we can reduce computation time to O ( log log n ) per code element. 9/19

  13. Order-preserving suffix trees Order-preserving suffix tree of w (of length n ) is a compacted TRIE of all the sequences in: { Code ( w [ 1 .. n ])# , Code ( w [ 2 .. n ])# , . . . , Code ( w [ n .. n ])# } Example ( 1 , 1 ) # w = ( 1 , 2 , 4 , 4 , 2 , 5 , 5 , 1 ) (1,1) 8 (2,1) # ( 1 , 2 ) 7 ( 2 , 3 ) (1,1) ( 1 , 3 ) ( 2 , 4 ) # ( 3 , 3 ) (4,3) 6 2 5 4 1 3 Additionally each explicit node stores a suffix link. 10/19

  14. Suffix links ◮ in standard suffix trees the suffix links always point to explicit nodes, ◮ in order-preserving suffix trees it may happen that suffix link points to an implicit node. 11/19

  15. Incomplete suffix trees Relaxed definition The incomplete order-preserving suffix tree of w is an order-preserving suffix tree in which each explicit node v can have one outgoing edge that does not store its first character. parent ( v ) ( 2 , 5 ) v ? - this edge misses label ( 3 , 2 ) ( 5 , 10 ) 12/19

  16. Why incomplete edges are not harmful? Lemma Let x and y be two strings of length t and x ′ = x [ 1 . . t − 1 ] , y ′ = y [ 1 . . t − 1 ] . Then: x ≈ y ⇔ x ′ ≈ y ′ ∧ ( y i ≤ y t ≤ y j ) , where i = t − α x ( t ) , j = t − β x ( t ) α x ( t ) y x x j y j x i x t y i y t β x ( t ) So we need Code only for x . 13/19

  17. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). 14/19

  18. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). Example v 14/19

  19. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). Example this edge is incomplete → v ( p , q ) 14/19

  20. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v ( 1 , 2 ) ( 3 , 1 )

  21. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v Case 1: ( p , q ) present among child edges v ′ ( 1 , 2 ) ( 3 , 1 ) ( p , q ) = ( 3 , 1 )

  22. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v Case 2: ( p , q ) not present among child edges v ′ ( 1 , 2 ) ( 3 , 1 ) ( p , q ) = ( 4 , 2 ) we have to verify (single) incomplete edge 15/19

  23. Algorithmic toolbox, continued We also require the following data structures: ◮ Weak Character Oracle – data structure based on y-fast trees (Willard 1983) for computing codes for newly created branches of the tree, ◮ Dynamic Weighted Ancestor data structure (Kopelowitz, Lewenstein 2007) used for fast navigation over constructed suffix tree. 16/19

  24. Example usage Theorem Given word w of length n , the incomplete order-preserving suffix tree can be constructed in O ( n log log n ) expected time. Theorem Given op-suffix tree T ( w ) and pattern x , we can locate all order-preserving occurrences of pattern x in word w in time O ( | x | + Occ ) . ◮ Compute Code ( x ) and traverse tree T ( w ) using successive symbols of the code. At each step we use function Transition . 17/19

  25. Complete suffix trees for op-model Theorem The order-preserving suffix tree of a string of length n can be constructed in O ( n log n / log log n ) expected time. ◮ This can be achieved by slightly different encoding function that allows a character oracle with O ( log n / log log n ) query time and o ( n log n / log log n ) preprocessing. 18/19

  26. Thank you for your attention! 19/19

Recommend


More recommend