On-line Construction of Compact Suffix Vectors and Maximal Repeats ´ Elise Prieur and Thierry Lecroq elise.prieur@univ-rouen.fr Laboratoire d’Informatique de Traitement de l’Information et des Syst` emes. Journ´ ees Montoises August 30th, 2006, Rennes
Introduction Suffix Vectors Computing maximal repeats Conclusion Plan Introduction 1 Suffix Vectors 2 Computing maximal repeats 3 Conclusion 4 ´ Elise Prieur Compact Suffix Vectors 2/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction 1 Motivation Suffix trees Ukkonen’s algorithm 2 Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector 3 Computing maximal repeats 4 Conclusion ´ Elise Prieur Compact Suffix Vectors 3/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Motivation Detecting repeats in long biological sequences. Adapted index structure. ´ Elise Prieur Compact Suffix Vectors 4/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Notations Suffix tree of tata$ y is a sequence of length n on the alphabet A . (4,1)$ (0,2)ta $ is a terminator symbol. (1,1) a a 4 ta Suffix tree (2,3) (4,1)$ (2,3) index structure; (4,1) ta$ ta$ $ 3 all substrings represented; 1 0 edges labeled (begin position, 2 length); leaves represent suffixes. ´ Elise Prieur Compact Suffix Vectors 5/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Ukkonen’s algorithm On-line algorithm Construction split into n phases which are also split into extensions. During the phase i , construction of the implicit tree of y [0 ..i ] from the one of y [0 ..i − 1]. During the extension j of the phase i , the suffix y [ j + 1 ..i ] is added to the tree. The last added substring is w = y [ j + 1 ..i − 1]. ´ Elise Prieur Compact Suffix Vectors 6/24
Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield 1 : Rule 1 =y[j+1...i−1] w 1 Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology , Cambridge University Press, 1997 ´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1 y[i]=y[j+1...i] w ´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2 x w ´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2 w x y[i] ´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 3 y[i]x w ´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Some properties leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension j ℓ + 1, where j ℓ is the number of the last created leaf; phase i ends at the first extension j > j ℓ such that rule 3 is applied. ´ Elise Prieur Compact Suffix Vectors 8/24
Introduction Suffix Vectors Computing maximal repeats Conclusion 1 Introduction Motivation Suffix trees Ukkonen’s algorithm Suffix Vectors 2 Introduction Compact Suffix Vectors On-line construction of a compact suffix vector 3 Computing maximal repeats 4 Conclusion ´ Elise Prieur Compact Suffix Vectors 9/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Root (0, 2) − (1,1) − (4, 1) (4,1)$ (0,2)ta (1,1) a 0 1 2 3 4 a 4 ta t a t a $ (2,3) (4,1)$ (2,3) (4,1) ta$ ta$ $ 3 1 2 3 (4,1) 0 1 3 (4,1) 2 ´ Elise Prieur Compact Suffix Vectors 10/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors (0 , 1) − (2 , 1) − (13 , 1) Root (13 , 1) $ R (2 , 1) t 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 (0 , 1) a 0 (13 , 1) $ (5 , 1) a a a t t t a t t t a t t a $ 2 (13 , 1) $ 11 (2 , 2) tt 5 ′ 12 (3 , 1) t (6 , 2) tt (1 , 13) 3 | 2 | (13 , 1) 3 ′ 3 (5 , 1) a 7 ′′′ 2 | 2 | (13 , 1) (12 , 2) a$ 0 (12 , 2) a$ 5 (8 , 6) tatta$ 4 3 | 4 | (12 , 2) (4 , 4) tatt (4 , 4) tatt 8 2 | 4 | (5 , 1) (13 , 1) $ (6 , 2) tt 9 10 7 7 ′ (12 , 2) a$ 7 | 6 | (12 , 2) (12 , 2) a$ 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 7 ′′ 4 | 6 | (12 , 2) (8 , 6) tatta$ (8 , 6) tatta$ (12 , 2) a$ 2 6 1 | 13 | (2 , 2) − (13 , 1) 1 5 (8 , 6) tatta$ 3 7 ´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors (0 , 1) − (2 , 1) − (13 , 1) Root Alternative data structure to 0 1 2 3 4 5 6 7 8 9 10 11 12 13 suffix trees a a t t t a t t t a t t a $ same information in reduced space 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) introduced by K. Monostori in 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 2001 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Definition (0 , 1) − (2 , 1) − (13 , 1) Root A succession of boxes whose lines contain: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a a t t t a t t t a t t a $ the depth of the node; the natural edge; 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) the edge list. 3 | 4 | (12 , 2) The root is a special box. 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) Notations 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) - B j : box at position j in y , 1 | 13 | (2 , 2) − (13 , 1) - The natural edge of a line in B j is the end position of the edge beginning by y [ j + 1]. ´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Example Root (0 , 1) − (2,1) − (13 , 1) tatt is a substring of y ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 The root contains the edge (2 , 1) a a t t t a t t t a t t a $ beginning by t leading to B 2 . The edge (5 , 1) by a leads to B 5 . 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) The natural edge begins by tt . 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 1 | (5,1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Compact a vector Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges. ´ Elise Prieur Compact Suffix Vectors 12/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2. d Rule B d−1 Rule A d−2 Rule C d−3 ´ Elise Prieur Compact Suffix Vectors 13/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Compacting V ( aatttatttatta$ ) Root (0 , 1) − (2 , 1) − (13 , 1) Root (0 , 1) − (2 , 1) − (13 , 1) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a a t t t a t t t a t t a $ a a t t t a t t t a t t a $ 3 | 2 | (13 , 1) 3 | 2 | (13 , 1) 2 2 | 2 | (13 , 1) = ⇒ 3 | 4 | (12 , 2) 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 1 | 2 | (5 , 1) 7 | 6 | (12 , 2) 4 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 14/24
Introduction Suffix Vectors Computing maximal repeats Conclusion y Monostori Extended vector Monostori − − − − − − → − − − − − − → Compact vector O ( n ) O ( n ) ´ Elise Prieur Compact Suffix Vectors 15/24
Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Monostori ✲ Monostori ✲ y extended vector compact vector O ( n ) O ( n ) ✻ Prieur, Lecroq O ( n ) Faster and more space economical construction. ´ Elise Prieur Compact Suffix Vectors 16/24
Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Proposition When an edge is added to the node w of depth d in a box B p , this edge will be added to all the nodes in B p of depth smaller then d in the group of nodes of w . p+1 j i a a y v v w w p’+1 j i a a y v v w w ´ Elise Prieur Compact Suffix Vectors 17/24
Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added. ´ Elise Prieur Compact Suffix Vectors 18/24
Recommend
More recommend