algorithms in bioinformatics a practical introduction
play

Algorithms in Bioinformatics: A Practical Introduction Suffix tree - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Suffix tree Overview What is suffix tree? Simple application of suffix tree Linear time algorithm for constructing suffix tree Suffix array FM-index 1-mismatch search


  1. Common substrings of more than 2 strings (VI) By traversing T, for each internal node v, compute C(v). [C(v) 3. is defined as the number of distinct termination symbols in the subtree rooted at v] This step takes O(Kn) time.  3 c a % # $ g 3 3 4 5 5 g c g a 3 3 2 c g # % a c a c c $ $ # $ $ # # % % 4 3 3 1 2 1 4 2 3 2 1

  2. Common substrings of more than 2 strings (VII) Traverse T and visit every internal node v. For each v, if V(C(v)) < 4. string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.] l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)} . 5. This two steps take O(n) time.  For our example, V(2) = 3, V(3) = 2.  Thus, l(3) = 2, l(2) = 3.  In total, this algorithm takes O(Kn) time.  Actually, we can improve this algorithm to O(n) time by mean  of lcp!

  3. Linear time algorithm for constructing suffix tree

  4. Straightforward construction of suffix tree  Consider S = s 1 s 2 … s n where s n =$  Algorithm:  Initialize the tree with only a root  For i = n to 1  Includes S[i..n] into the tree  Time: O(n 2 )

  5. Example of construction  S=acca$ Init For-loop c c a c a a a $ a $ $ $ c $ $ $    c  c $ $ a c a a a $ $ a $ $ $ $ 4 4 5 5 5 3 4 3 2 4 1 3 2 5 5 I 4 I 3 I 5 I 2 I 1

  6. Construction of generalized suffix tree  S ’ = c# Init For-loop c c c a a a c c c $ # $ # $ c  c  c c a c a c a a a a # $ $ a $ $ a $ $ a $ $ $ $ $ $ 4 1 3 2 2 4 1 3 2 2 4 1 3 2 1 5 5 5 I 1 J 2 J 1

  7. Can we construct a suffix tree in o(n 2 ) time? Yes. We can construct it in O(n) time.  Weiner ’ s algorithm [1973]  Linear time for constant size alphabet, but much space  McGreight ’ s algorithm [JACM 1976]  Linear time for constant size alphabet, quadratic space  Ukkonen ’ s algorithm [Algorithmica, 1995]  Online algorithm, linear time for constant size alphabet, less space  Farach ’ s algorithm [FOCS 1997]  Linear time for general alphabet  Hon,Sadakane, and Sung ’ s algorithm [FOCS 2003]  O(n) bit space O(n log e n) time for 0<e<1  O(n) bit space O(n) time for suffix array construction  We will discuss Farach ’ s algorithm later. 

  8. Idea  Build Odd Suffix Tree and Even Suffix Tree  Then, merge odd and even suffix tree. $ a g $ a g ca ca $ $ 7 6 7 6 g g c c a a $ $ c c a a g g g g 5 5 $ $ c c $ $ a a g g g g $ $ 4 4 $ $ 2 2 1 3 1 3 Even Suffix Tree Odd Suffix Tree

  9. Idea Input: a string S of length n  Recursively compute the suffix tree T o of all 1. suffixes beginning at the odd positions. T o is of size n/2.  From T o , compute T e which is the suffix tree 2. for all suffixes beginning at the even positions. Merge T o and T e to form the suffix tree for 3. S.

  10. Stage 1: Constructing odd suffix tree  Given a string S[1..n], we generate a new string S ’ [1..n/2] as follows.  we map pairs of characters into single characters as follows:  S[1..2], S[3..4], S[5..6], … , S[n-1..n].  Remove the duplicates from the pairs of characters and sort them by radix sort.  S ’ [i] = rank of S[2i-1..2i] in the sorted list, for i=1, 2, … , n/2.  By recursion, we get the suffix tree T ’ for S ’  Convert T ’ to the odd suffix tree T o .

  11. Example (I)  S = aaabbbabbaba$  S[1..2]=aa, S[3..4]=ab, S[5..6]=bb, S[7..8]=ab, S[9..10]=ba, S[11..12]=ba.  By stable sort, aa < ab < ba < bb.  Rank(aa)=1, Rank(ab)=2, Rank(ba)=3, Rank(bb)=4.  So, S ’ =124233$.

  12. Example (II)  By recursion, construct the suffix tree T ’ for S ’ : $ 2 3 7 3 $ 6 5 1 4 2

  13. Example (III)  Convert T ’ to the odd tree: $ a 13 5 b i  2i-1 $ 11 9 7 1 This is not a suffix tree 3

  14. Example (IV)  Refine the odd tree T o : $ b a 13 b 5 a b $ 11 9 7 1 3

  15. Time complexity for building the odd tree  Let Time(n) be the time to build a suffix tree for a string of length n.  Stable sorting and refinement of the odd trees take O(n) time.  Build suffix tree for S ’ takes Time(n/2).  So, Stage 1 takes Time(n/2)+O(n) time.

  16. Stage 2: Build the even tree Generate the lex-ordering of the 1. leaves in T e . For any two adjacent leaves 2i and 2j, 2. we find lcp(2i, 2j). Construct the even tree T e from left to 3. right (according to the lex-ordering).

  17. Build the even tree (Step 1)  We get the lex-ordering of the leaves in T o .  Generate the lex-ordering of the leaves in T e .  For each leaf i in T o , get the preceding character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1.  Perform stable sorting on those pairs. We get the lex-ordering of the leaves in T e .

  18. Example S = aaabbbabbaba$  Lex-ordering of the leaves in T o :  13 < 1 < 7 < 3 < 11 < 9 < 5  The pairs are:  (a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5).  After stable sorting, we have  ($, 1), (a, 13), (a, 3), (a, 11), (b, 7), (b, 9), (b, 5).  Hence, the lex-ordering of the leaves of T e :  12 < 2 < 10 < 6 < 8 < 4

  19. Build the even tree (Step 2)  For any two adjacent leaves 2i and 2j, we first find lcp(2i, 2j).  Observation: lcp(2i, 2j) =  lcp(2i+1, 2j+1)+1 if S[2i]=S[2j]  0 otherwise  Proof:  If S[2i] ≠ S[2j], lcp(2i,2j)=0.  Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).

  20. Example  Recall that the lex- $ b a ordering of leaves: 13 b  12 < 2 < 10 < 6 < 8 < 4. 5 a b  By the previous observation, we have $  lcp(8,4)=lcp(9,5)+1=2  Similarly, we have 11 9  lcp(12,2)=1, lcp(2,10)=1, 7 lcp(10,6)=0, lcp(6,8)=1, 1 lcp(8,4)=2 3

  21. Build the even tree (Step 3)  Construct the even tree T e from left to right. a a $ $ a a b b 12 12 b b b b a 12 a 10 b b b b a a b b a a $ $ 2 2

  22. Build the even tree (Step 3) b b a a a b $ $ $ a a a b b b 12 12 12 b b b b b b a a a 10 10 8 10 b b b 8 b b b 6 6 a a a b b b 6 a a a 4 $ $ $ 2 2 2

  23. Time complexity for building the even tree  Step 1: O(n) time  Step 2: O(n) time  Step 3: O(n) time

  24. Stage 3: Merge odd and even trees We can merge T o and T e by DFS. However, it takes O(n 2 ) time.  $ $ b b a a babbaba$ babbaba$ 13 13 b b a,1 a,1 b,1 b,1 a a 5 5 $,1 $,1 a a b b a a b b 13 13 b,1 b,1 b b a,1 a,1 b,1 b,1 b b b b b b b b $,1 $,1 a a b b $ $ a a a a a a b b a a b b a,1 a,1 a,2 a,2 12 12 $ $ b b a a b b b b b,1 b,1 a a b b $ $ a,2 a,2 $ $ b,1 b,1 b b 11 11 9 9 a a a a $,1 $,1 a,2 a,2 a a b b 10 10 b,8 b,8 b b $ $ 8 8 a a 12 12 7 7 a,2 a,2 b b $ $ a,11 1 1 b b 11 11 9 9 b,5 b,5 a a 1 1 3 3 10 10 b,10 b,5 b,5 b b 4 4 8 8 a,4 b,9 a,4 b,9 Odd tree 5 5 b b a a 7 7 3 3 6 6 2 2 b b 6 6 a a 4 4 $ $ 2 2 Even tree

  25. Stage 3: Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the  same character. The merge is ended when one edge is longer than the other. $ b a b a 13 5 a b $ a b 12 b $ b a 10 b 8 b 11 9 a b 6 7 a 4 1 $ 2 3

  26. Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the  same character. The merge is ended when one edge is longer than the other. a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3

  27. Merge odd and even trees The merging may over-merged some nodes.  To correct the tree, we need to unmerge some nodes.  a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3

  28. Definition of L() and d()  For every node u which may be over-merged, there exist two leaves 2i and 2j-1 such that u=lca(2i, 2j-1).  Denote L(u) be the correct depth of u, that is, lcp(2i,2j-1).  Note that lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise.  Let v be lca(2i+1,2j).  Denote d(u) = v.  Note that d() is equivalent to suffix link!

  29. Example of d() a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 d() $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3

  30. Relationship between L() and d() Suppose u = lca(2i, 2j-1).  Note1: if u is not the root, then S[2i]=S[2j-1].  Note2: lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0  otherwise Note3: d(u) = lcp(2i+1,2j)  Hence, L(u) = 1 + L(d(u)) if u is not the root. Otherwise,  L(u)=0. Lemma: L(u) = the length of the purple path from u to the root. 

  31. Example of L() L( )=1 a,1 b,1 $,1 L( )=1 b,1 13 a,1 b,2 $,1 L( )=2 L( )=2 12 L( )=2 a,4 a,11 $,1 $,1 b,3 b,8 10 8 L( )=4 a,4 11 9 b,3 2 b,9 L( )=2 L( )=3 4 7 b,3 5 $,1 6 1 3

  32. Unmerge the border nodes based on L() (I) L( )=1 a,1 b,1 $,1 L( )=1 b,1 13 a,1 b,2 $,1 L( )=2 L( )=2 12 L( )=2 a,4 a,11 $,1 $,1 b,3 b,8 10 8 L( )=4 a,4 11 9 b,3 2 b,9 L( )=2 L( )=3 4 7 b,3 5 $,1 6 1 3

  33. Unmerge the border nodes based on L() (II) a,1 b,1 $,1 b,1 13 a,1 b,1 $,1 a,1 a,2 12 b,1 a,2 b,1 $,1 a,2 b,8 10 8 a,2 a,11 11 9 b,5 1 b,10 b,5 b,9 a,4 4 5 7 3 6 2

  34. Time complexity for merging  Merge the tree using DFS takes O(n) time.  Compute the links d() takes O(n) time.  Compute L() takes O(n) time.  Unmerge takes O(n) time.

  35. Total time complexity of Farach ’ s algorithm  Stage 1: Time(n/2)+O(n)  Stage 2: O(n)  Stage 3: O(n)  Thus, Time(n) = Time(n/2)+O(n).  By solving the equation,  Time(n)= O(n).

  36. Disadvantage of suffix tree  Suffix tree is space inefficient. It requires O(n| Σ |log n) bits.  Manber and Myers (SIAM J. Comp 1993) proposes a new data structure, called suffix array, which has a similar functionality as suffix tree. Moreover, it only requires O(n log n) bits.

  37. Suffix Array (I) It is just sorted suffixes.  E.g. consider S = acacag$  Suffix Position SA[i] Suffix acacag$ 1 7 $ cacag$ 2 1 acacag$ acag$ 3 => 3 acag$ 5 cag$ 4 Sort ag$ ag$ 5 2 cacag$ g$ 6 4 cag$ $ 7 6 g$ Suffix array is an array of n indices. Thus, it takes O(n log n)  bits.

  38. Observation  The leaves of a suffix tree is in suffix array order. SA[i] Suffix $ a g ca 7 $ $ 7 6 1 acacag$ g c 3 acag$ a $ c 5 ag$ a g g 2 cacag$ 5 $ c $ a g 4 cag$ g $ 2 4 $ 6 g$ 1 3

  39. Linear time construction of suffix array from suffix tree  Recall that the suffix tree T of S[1..n] can be constructed in O(n) time.  Then, by “ lexical ” depth-first traversal of T , the suffix array of S is obtained.  This takes O(n) time.  However, the space used during construction is the same as that for suffix tree! This defeats the purpose of suffix array.  Today, we can build suffix array using O(n) bit space and O(n) time.

  40. range(T,Q)  For a pattern Q, its SA[i] Suffix occurrences in T form a 1 7 $ consecutive SA range. 2 1 acacag$ 3 3 acag$  Example: For T= acacag$, ca 4 5 ag$ occurs in SA[5] and SA[6]. 5 2 cacag$  Definition: 6 4 cag$  We called range(T,Q)=[st..ed] 7 6 g$ if  Q is a prefix of every T j for j=SA[st], SA[st+1], … , SA[ed]  where T j = j suffix of T = T[j..n].  Example: range(T,ca)=[5..6]

  41. Find occurrence of query Q in a string S using suffix array Input: (1) the suffix array of a string T of  length n and (2) a query Q of length m Aim: check if Q occurs in T  Idea: binary search! 

  42. Algorithm

  43. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$  L=1 4 5 ag$ 5 2 cacag$  R=7 6 4 cag$ 7 6 g$  M=(L+R)/2=4

  44. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$  L=1 3 3 acag$ 4 5  R=7 ag$ 5 2 cacag$  M=(L+R)/2=4 6 4 cag$ 7 6 g$  suffix-SA[M] > Q.  Set R=M=4.

  45. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$  L=1 3 3 acag$ 4 5  R=4 ag$ 5 2 cacag$  M=(L+R)/2=2 6 4 cag$ 7 6 g$  suffix-SA[M] < Q.  Set L=M=2.

  46. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$  L=2 4 5 ag$ 5 2 cacag$  R=4 6 4 cag$  M=(L+R)/2=3 7 6 g$  The pattern Q is found at SA[M]=3.

  47. Can we do better?  During each step of binary search,  we need to compare Q with a suffix using O(m) time, which is time consuming.  Can we do better?  We have the following observation.  Suppose LCP(Q, suffix-SA[L]) is l and LCP(Q, suffix-SA[R]) is r.  Then, LCP(Q, suffix-SA[M]) > min{l,r}.  Below, we describe how to utilize this observation to speedup the computation.

  48. Algorithm

  49. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$  L=1, l=0 4 5 ag$ 5 2 cacag$  R=7, r=0 6 4 cag$ 7 6 g$  mlr = min(l,r)=0  M=(L+R)/2=4

  50. Example Consider T = acacag$  i SA[i] Suffix Pattern Q = acag  1 7 $ L=1, l=0 2 1 acacag$  R=7, r=0 3 3 acag$  mlr = min(l,r)=0  4 5 ag$ M=(L+R)/2=4, m=1  5 2 cacag$ 6 4 cag$ The (m+1) char of suffix-SA[M] is g.  The (m+1) char of Q is c. 7 6 g$  So, suffix-SA[M] > Q.  Set R=M=4 and r=m=1. 

  51. Example Consider T = acacag$  i SA[i] Suffix Pattern Q = acag  1 7 $ L=1, l=0 2 1 acacag$  R=4, r=1 3 3 acag$  mlr = min(l,r)=0  4 5 ag$ M=(L+R)/2=2, m=3  5 2 cacag$ 6 4 cag$ The (m+1) char of suffix-SA[M] is c.  The (m+1) char of Q is g. 7 6 g$  So, suffix-SA[M] < Q.  Set L=M=2 and l=m=3. 

  52. Example  Consider T = acacag$ i SA[i] Suffix  Pattern Q = acag 1 7 $ 2 1 acacag$  L=2, l=3 3 3 acag$ 4 5  R=4, r=1 ag$ 5 2 cacag$  mlr = min(l,r)=1 6 4 cag$  M=(L+R)/2=3, m=4 7 6 g$  The pattern Q is found at SA[M]=3.

  53. Time analysis  Binary search will perform log n comparisons  Each comparison takes at most O(m) time  In the worst case, O(m log n) time.  Myers and Manber report that, in practice, the time is O(m + log n).

  54. Suffix array and suffix tree  We show one example of replacing suffix tree by suffix array  Note that most applications related to suffix tree can be solved using suffix array with some time blow up!  When space is limited, replacing suffix tree by suffix array is a good choice.

  55. The size is still too big!  Why?  DNA sequences can be very long! E.g. Fly: ~100M bases, Human: ~3G bases, Tree: ~9G bases  Storage to store indexing data structure for human genome Suffix Tree: ~40G bytes Suffix Array: ~13G bytes  Can we further reduce the space?

  56. Solution  Grossi, Vitter (STOC2000)  Compressed suffix array (CSA)  Ferragine, Manzini (FOCS2000)  FM-index  Both of them can be stored in O(n) bit space  For Human Genome  Both CSA and FM-index can be stored within 2G bytes.

  57. FM-index Consider a text T=acacag$  FM-index stores:  SA[i] Suffix T[SA[i]-1] The string BW=g$ccaaa A. 7 $ g C[x] = total no. of 1 acacag$ $ B. occurrences of each symbol 3 acag$ c less than x. E.g. C[a]=1, 5 ag$ c C[c]=4, C[g]=6, C[t]=7 2 cacag$ a A data-structure occ(x, i) C. 4 cag$ a which tells us the number of 6 g$ a occurrences of x in BW[1..i] using O(1) time.

  58. Data-structure for answering the occ(x, i) query? n/log 2 n 1 2 3 4 BW ……… log 2 n log n ……… 1 2 3 log n 00x0xxx0x0 Given the text BW[1..n], we divide BW[1..n] into buckets of size log 2 n.  For each bucket i = 1, … , n/log 2 n, we store  P[i] = number of x ’ s in BW[1.. ilog 2 n].  Each bucket is further subdivided into log n sub-buckets of size log n.  For each sub-bucket j of the bucket i, we store  Q[i][j] = number of x ’ s in the first j sub-buckets. 

  59. Data-structure for answering the occ(x, i) query?  We also need a lookup table rank(b,k)  b is any string of length (log n)/2  1 ≤ k ≤ (log n)/2  rank(b,k) = number of x in the first k characters of b. 1 2 3 ……………… (log n)/2 Number of entries 00 … 0 log n log n log n = = 2 n 2 2 2 … 1 1 2 ……………………… x0x..0 Each entry take s O (log log n ) bits … Total space xx … x = = O ( n log n log log n ) o ( n ) bits

  60. Space complexity of the occ() data-structure  P[1..n/log 2 n] uses O(n/log n) bits  Q[1..n/log 2 n][1..log n] uses O(n log log n / log n) bits  rank(b,k) uses 2 logn/2 (log n/2) = o(n) bits  In total, we use O(n log log n / log n) bits.

  61. How to compute occ(x,i)? n/log 2 n 1 2 3 4 BW ……… log 2 n log n ……… 1 2 3 log n 00x0xxx0x0 Suppose log n = 10.  To compute occ(x, 327),  The result is P[3]+Q[4][2]+rank(00x0x,5)+rank(xx0x0,2)  Hence, O(1) time to compute occ(x,i). 

  62. Size of FM-index  Structure A can be store in 2n bits  Structure B can be store in O(log n) bits  Structure C can be store in O(n log log n/log n) bits  In total, the size of FM-index is O(n) bits.

  63. Observation  C[x]+occ(x,i) is the number of SA[i] Suffix T[SA[i]-1] suffixes smaller than of xT SA[i] . 7 $ g 1 acacag$ $  Example: 3 acag$ c  C[c]=4 5 ag$ c  occ(c, 6)=2 2 cacag$ a  Number of suffixes smaller than 4 cag$ a cT[SA[6]..n]=ccag$ is 6. 6 g$ a

  64. Lemma  Suppose range(T,Q) is [st..ed].  Then, range(T,xQ) = [p..q] where  p = C[x] + occ(x,st-1) + 1  q = C[x] + occ(x,ed)  Proof:  p = 1+number of suffixes strictly smaller than xQ. The latter term = number of suffixes smaller than or equal to xT SA[st-1] .  q = number of suffixes smaller than or equal to xT SA[ed] .

  65. Backward search Given the text T and the FM-index, we want to determine if Q  exists in T. Algorithm BW_exist(Q[1..m]) x=Q[m], i=m-1; 1. /* find range(T,Q[m]) */ 2. st = C[x]+1, ed = C[x+1]; while (st ≤ ed and i>1) { 3. /* find range(T, Q[i-1..m]) */ x = Q[i-1];  st = C[x] + occ(x, st-1) + 1;  ed = C[x] + occ(x, ed);  i = i – 1;  } if st > ed, then pattern not found else pattern found. 4.

  66. Example  T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g  Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c  Q[3..3]=a 2 cacag$ a  sp=C[a] +1 4 cag$ a =1+1=2 6 g$ a  ep=C[c] =4 g $ c c a a a

  67. Example  T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g  Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c  Q[2..3]=ca 2 cacag$ a  st=C[c]+occ(c,st old -1)+1 4 cag$ a =4+0+1=5 6 g$ a  ed=C[c]+occ(c,ed old ) =4+2=6 g $ c c a a a

  68. Example  T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g  Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c  Q[1..3]=aca 2 cacag$ a  st=C[a]+occ(a,st old -1)+1 4 cag$ a =1+0+1=2 6 g$ a  ed=C[a]+occ(a,ed old ) =1+2=3 g $ c c a a a

  69. Example  T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g  Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c  Q[1..3]=aca 2 cacag$ a  st=C[a]+occ(a,st old -1)+1 4 cag$ a =1+0+1=2 Q occurrences in T! 6 g$ a  ed=C[a]+occ(a,ed old ) =1+2=3 g $ c c a a a

  70. Time complexity of Backward Search  To find a pattern Q[1..m]  Step 1, 2, and 4 can be computed in O(1) time.  For step 3,  We need to iterate the loop for m-1 times.  Each iteration of the loop can be computed in O(1) time.  The loop takes O(m) time.  In total, O(m) time for backward search.

  71. Conclusion  Suffix tree is a powerful data-structure which has a lot of applications in Computational Biology.  Problems:  Suffix tree is too big!  Can be solved using CSA and FM-index  Suffix tree can only solve exact match problem! (Most of the biology problems are approximate match!)  Many works have been done on this area! But still not pratical. One of the important area to explore!

Recommend


More recommend