suffix tries
play

Suffix Tries Slides adapted from the course by Ben Langmead - PowerPoint PPT Presentation

Suffix Tries Slides adapted from the course by Ben Langmead ben.langmead@gmail.com Indexing with su ffi xes Until now, our indexes have been based on extracting substrings from T A very di ff erent approach is to extract su ffi xes from T. This


  1. Suffix Tries Slides adapted from the course by Ben Langmead ben.langmead@gmail.com

  2. Indexing with su ffi xes Until now, our indexes have been based on extracting substrings from T A very di ff erent approach is to extract su ffi xes from T. This will lead us to some interesting and practical index data structures: $ B A N A N A 6 $ A $ B A N A N 5 A$ A N A $ B A N 3 ANA$ 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A Su ffi x Tree Su ffi x Trie Su ffi x Array FM Index

  3. Tries A trie (pronounced “try”) is a tree representing a collection of strings with one node per common pre fi x Smallest tree such that: Each edge is labeled with a character c ∈ Σ A node has at most one outgoing edge labeled c , for c ∈ Σ Each key is “spelled out” along some path starting at the root Natural way to represent either a set or a map where keys are strings

  4. Su ffi x trie Build a trie containing all su ffi xes of a text T T: G T T A T A G C T G A T C G C G G C G T A G C G G $ G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ m(m+1)/2 A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ chars C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $ C G T A G C G G $ G T A G C G G $ T A G C G G $ A G C G G $ G C G G $ C G G $ G G $ G $ $

  5. Su ffi x trie First add special terminal character $ to the end of T $ is a character that does not appear elsewhere in T , and we de fi ne it to be less than other characters (for DNA: $ < A < C < G < T ) $ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no su ffi x is a pre fi x of any other su ffi x. T: G T T A T A G C T G A T C G C G G C G T A G C G G $ G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $

  6. Tries Smallest tree such that: Each edge is labeled with a character from Σ A node has at most one outgoing edge labeled with c , for any c ∈ Σ Each key is “spelled out” along some path starting at the root

  7. Su ffi x trie a b $ Shortest (non-empty) abaaba $ T: abaaba T $ : su ffi x a b $ a Each path from root to leaf represents a su ffi x; each su ffi x is represented by some b a a $ path from root to leaf a a $ b Would this still be the case if we hadn’t added $ ? $ b a a $ $ Longest su ffi x

  8. Su ffi x trie a b T: abaaba Each path from root to leaf represents a a b a su ffi x; each su ffi x is represented by some path from root to leaf b a a Would this still be the case if we hadn’t No added $ ? a a b b a a

  9. Su ffi x trie a b $ We can think of nodes as having labels , where the label spells out characters on the a b $ a path from the root to the node b a a $ a a $ b baa $ b a a $ $

  10. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. S = baa a a $ b Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S is not a substring of T If we exhaust S without falling o ff , S is a $ substring of T

  11. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. a a $ b Start at the root and follow the edges labeled with the characters of S $ b a If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S = abaaba S is not a substring of T Yes, it’s a substring If we exhaust S without falling o ff , S is a $ substring of T

  12. Su ffi x trie a b $ How do we check whether a string S is a substring of T ? a b $ a Note: Each of T ’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a pre fi x of some su ffi x of T. a a $ b Start at the root and follow the edges x labeled with the characters of S $ b a S = baabb No, not a substring If we “fall o ff ” the trie -- i.e. there is no outgoing edge for next character of S , then a $ S is not a substring of T If we exhaust S without falling o ff , S is a $ substring of T

  13. Su ffi x trie a b $ How do we check whether a string S is a su ffi x of T ? a b $ a Same procedure as for substring, but additionally check whether the fi nal node in b a a $ the walk has an outgoing edge labeled $ S = baa a a $ b Not a su ffi x $ b a a $ $

  14. Su ffi x trie a b $ How do we check whether a string S is a su ffi x of T ? a b $ a Same procedure as for substring, but additionally check whether the fi nal node in b a a $ S = aba the walk has an outgoing edge labeled $ Is a su ffi x a a $ b $ b a a $ $

  15. Su ffi x trie a b $ How do we count the number of times a string S occurs as a substring of T ? a b $ a Follow path corresponding to S . b a a $ S = aba Either we fall o ff , in which case 2 occurrences n answer is 0, or we end up at node n and the answer = # of leaf nodes in a a $ b the subtree rooted at n . Leaves can be counted with depth- fi rst $ b a traversal. a $ $

  16. Su ffi x trie a b $ How do we fi nd the longest repeated substring of T ? a b $ a Find the deepest node with more b a a $ than one child aba a a $ b $ b a a $ $

  17. Suffix Trie implementation (derived from Ben Langmead) class SuffixTrie (object): ''' building a suffix Trie ''' def __init__(self, t): """ Make suffix trie from t """ if t[-1]!='$': t += '$' # special terminator symbol self.root = {} for i in range(len(t)): # for each suffix cur = self.root for c in t[i:]: # for each character in i'th suffix if c == '$': cur[c] = i # add outgoing edge and suffix position elif c not in cur: cur[c] = {} # add outgoing edge if necessary cur = cur[c]

  18. Suffix Trie implementation: followPath class SuffixTrie (object): …. def followPath (self, s): """ Follow path given by characters of s. Return node at end of path, or None if we fall off . """ cur = self.root for c in s: if c not in cur: return None cur = cur[c] return cur

  19. Suffix Trie implementation: find all positons class SuffixTrie (object): …. def findLeaves (self,v): """ Return the leaves from a given vertex v""" leaves=[] if v == None: return leaves for c in v: if c == '$': leaves+=[v[c]] else : leaves+=self.findLeaves(v[c]) return leaves def findPositions (self,s): """ Return a list of matching positions of s """ return self.findLeaves(self.followPath(s))

  20. Examples if __name__ == '__main__': seq='abaaba' print "seq=",seq strie=SuffixTrie(seq) for p in ['a','ba','aa','bb']: print "find postion of ",p,"in seq",strie.findPositions(p) print "find the leaves=",strie.findLeaves(strie.root) $ python ../codes/ST/STrie.py seq= abaaba find postion of a in seq [2, 0, 3, 5] find postion of ba in seq [1, 4] find postion of aa in seq [2] find postion of bb in seq [] find the leaves= [2, 0, 3, 5, 1, 4, 6]

  21. Su ffi x trie How many nodes does the su ffi x trie have? T = aaaa a $ Is there a class of string where the number of su ffi x trie nodes grows linearly with m ? a $ Yes: e.g. a string of m a’s in a row (a m ) a $ • 1 Root a $ • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2 m + 2 nodes

  22. Su ffi x trie Is there a class of string where the number of su ffi x trie nodes grows with m 2 ? Yes: a n b n • 1 root • n nodes along “b chain,” right Figure & example • n nodes along “a chain,” middle by Carl Kingsford • n chains of n “b” nodes hanging o ff each“a chain” node • 2 n + 1 $ leaves (not shown) n 2 + 4 n + 2 nodes, where m = 2 n

  23. Su ffi x trie: upper bound on size Could worst-case # nodes be worse than O( m 2 )? Root Max # nodes from top to bottom = length of longest su ffi x + 1 Su ffi x trie = m + 1 Deepest leaf Max # nodes from left to right O ( m 2 ) is worst case = max # distinct substrings of any length ≤ m

Recommend


More recommend