Common substrings of more than 2 strings (VI) By traversing T, for each internal node v, compute C(v). [C(v) 3. is defined as the number of distinct termination symbols in the subtree rooted at v] This step takes O(Kn) time. 3 c a % # $ g 3 3 4 5 5 g c g a 3 3 2 c g # % a c a c c $ $ # $ $ # # % % 4 3 3 1 2 1 4 2 3 2 1
Common substrings of more than 2 strings (VII) Traverse T and visit every internal node v. For each v, if V(C(v)) < 4. string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.] l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)} . 5. This two steps take O(n) time. For our example, V(2) = 3, V(3) = 2. Thus, l(3) = 2, l(2) = 3. In total, this algorithm takes O(Kn) time. Actually, we can improve this algorithm to O(n) time by mean of lcp!
Linear time algorithm for constructing suffix tree
Straightforward construction of suffix tree Consider S = s 1 s 2 … s n where s n =$ Algorithm: Initialize the tree with only a root For i = n to 1 Includes S[i..n] into the tree Time: O(n 2 )
Example of construction S=acca$ Init For-loop c c a c a a a $ a $ $ $ c $ $ $ c c $ $ a c a a a $ $ a $ $ $ $ 4 4 5 5 5 3 4 3 2 4 1 3 2 5 5 I 4 I 3 I 5 I 2 I 1
Construction of generalized suffix tree S ’ = c# Init For-loop c c c a a a c c c $ # $ # $ c c c c a c a c a a a a # $ $ a $ $ a $ $ a $ $ $ $ $ $ 4 1 3 2 2 4 1 3 2 2 4 1 3 2 1 5 5 5 I 1 J 2 J 1
Can we construct a suffix tree in o(n 2 ) time? Yes. We can construct it in O(n) time. Weiner ’ s algorithm [1973] Linear time for constant size alphabet, but much space McGreight ’ s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space Ukkonen ’ s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space Farach ’ s algorithm [FOCS 1997] Linear time for general alphabet Hon,Sadakane, and Sung ’ s algorithm [FOCS 2003] O(n) bit space O(n log e n) time for 0<e<1 O(n) bit space O(n) time for suffix array construction We will discuss Farach ’ s algorithm later.
Idea Build Odd Suffix Tree and Even Suffix Tree Then, merge odd and even suffix tree. $ a g $ a g ca ca $ $ 7 6 7 6 g g c c a a $ $ c c a a g g g g 5 5 $ $ c c $ $ a a g g g g $ $ 4 4 $ $ 2 2 1 3 1 3 Even Suffix Tree Odd Suffix Tree
Idea Input: a string S of length n Recursively compute the suffix tree T o of all 1. suffixes beginning at the odd positions. T o is of size n/2. From T o , compute T e which is the suffix tree 2. for all suffixes beginning at the even positions. Merge T o and T e to form the suffix tree for 3. S.
Stage 1: Constructing odd suffix tree Given a string S[1..n], we generate a new string S ’ [1..n/2] as follows. we map pairs of characters into single characters as follows: S[1..2], S[3..4], S[5..6], … , S[n-1..n]. Remove the duplicates from the pairs of characters and sort them by radix sort. S ’ [i] = rank of S[2i-1..2i] in the sorted list, for i=1, 2, … , n/2. By recursion, we get the suffix tree T ’ for S ’ Convert T ’ to the odd suffix tree T o .
Example (I) S = aaabbbabbaba$ S[1..2]=aa, S[3..4]=ab, S[5..6]=bb, S[7..8]=ab, S[9..10]=ba, S[11..12]=ba. By stable sort, aa < ab < ba < bb. Rank(aa)=1, Rank(ab)=2, Rank(ba)=3, Rank(bb)=4. So, S ’ =124233$.
Example (II) By recursion, construct the suffix tree T ’ for S ’ : $ 2 3 7 3 $ 6 5 1 4 2
Example (III) Convert T ’ to the odd tree: $ a 13 5 b i 2i-1 $ 11 9 7 1 This is not a suffix tree 3
Example (IV) Refine the odd tree T o : $ b a 13 b 5 a b $ 11 9 7 1 3
Time complexity for building the odd tree Let Time(n) be the time to build a suffix tree for a string of length n. Stable sorting and refinement of the odd trees take O(n) time. Build suffix tree for S ’ takes Time(n/2). So, Stage 1 takes Time(n/2)+O(n) time.
Stage 2: Build the even tree Generate the lex-ordering of the 1. leaves in T e . For any two adjacent leaves 2i and 2j, 2. we find lcp(2i, 2j). Construct the even tree T e from left to 3. right (according to the lex-ordering).
Build the even tree (Step 1) We get the lex-ordering of the leaves in T o . Generate the lex-ordering of the leaves in T e . For each leaf i in T o , get the preceding character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1. Perform stable sorting on those pairs. We get the lex-ordering of the leaves in T e .
Example S = aaabbbabbaba$ Lex-ordering of the leaves in T o : 13 < 1 < 7 < 3 < 11 < 9 < 5 The pairs are: (a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5). After stable sorting, we have ($, 1), (a, 13), (a, 3), (a, 11), (b, 7), (b, 9), (b, 5). Hence, the lex-ordering of the leaves of T e : 12 < 2 < 10 < 6 < 8 < 4
Build the even tree (Step 2) For any two adjacent leaves 2i and 2j, we first find lcp(2i, 2j). Observation: lcp(2i, 2j) = lcp(2i+1, 2j+1)+1 if S[2i]=S[2j] 0 otherwise Proof: If S[2i] ≠ S[2j], lcp(2i,2j)=0. Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).
Example Recall that the lex- $ b a ordering of leaves: 13 b 12 < 2 < 10 < 6 < 8 < 4. 5 a b By the previous observation, we have $ lcp(8,4)=lcp(9,5)+1=2 Similarly, we have 11 9 lcp(12,2)=1, lcp(2,10)=1, 7 lcp(10,6)=0, lcp(6,8)=1, 1 lcp(8,4)=2 3
Build the even tree (Step 3) Construct the even tree T e from left to right. a a $ $ a a b b 12 12 b b b b a 12 a 10 b b b b a a b b a a $ $ 2 2
Build the even tree (Step 3) b b a a a b $ $ $ a a a b b b 12 12 12 b b b b b b a a a 10 10 8 10 b b b 8 b b b 6 6 a a a b b b 6 a a a 4 $ $ $ 2 2 2
Time complexity for building the even tree Step 1: O(n) time Step 2: O(n) time Step 3: O(n) time
Stage 3: Merge odd and even trees We can merge T o and T e by DFS. However, it takes O(n 2 ) time. $ $ b b a a babbaba$ babbaba$ 13 13 b b a,1 a,1 b,1 b,1 a a 5 5 $,1 $,1 a a b b a a b b 13 13 b,1 b,1 b b a,1 a,1 b,1 b,1 b b b b b b b b $,1 $,1 a a b b $ $ a a a a a a b b a a b b a,1 a,1 a,2 a,2 12 12 $ $ b b a a b b b b b,1 b,1 a a b b $ $ a,2 a,2 $ $ b,1 b,1 b b 11 11 9 9 a a a a $,1 $,1 a,2 a,2 a a b b 10 10 b,8 b,8 b b $ $ 8 8 a a 12 12 7 7 a,2 a,2 b b $ $ a,11 1 1 b b 11 11 9 9 b,5 b,5 a a 1 1 3 3 10 10 b,10 b,5 b,5 b b 4 4 8 8 a,4 b,9 a,4 b,9 Odd tree 5 5 b b a a 7 7 3 3 6 6 2 2 b b 6 6 a a 4 4 $ $ 2 2 Even tree
Stage 3: Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other. $ b a b a 13 5 a b $ a b 12 b $ b a 10 b 8 b 11 9 a b 6 7 a 4 1 $ 2 3
Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other. a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3
Merge odd and even trees The merging may over-merged some nodes. To correct the tree, we need to unmerge some nodes. a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3
Definition of L() and d() For every node u which may be over-merged, there exist two leaves 2i and 2j-1 such that u=lca(2i, 2j-1). Denote L(u) be the correct depth of u, that is, lcp(2i,2j-1). Note that lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise. Let v be lca(2i+1,2j). Denote d(u) = v. Note that d() is equivalent to suffix link!
Example of d() a,1 b,1 $,1 b,1 13 a,1 b,2 $,1 12 a,4 a,11 d() $,1 $,1 b,3 b,8 10 8 a,4 11 9 b,3 2 b,9 4 7 b,3 5 $,1 6 1 3
Relationship between L() and d() Suppose u = lca(2i, 2j-1). Note1: if u is not the root, then S[2i]=S[2j-1]. Note2: lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise Note3: d(u) = lcp(2i+1,2j) Hence, L(u) = 1 + L(d(u)) if u is not the root. Otherwise, L(u)=0. Lemma: L(u) = the length of the purple path from u to the root.
Example of L() L( )=1 a,1 b,1 $,1 L( )=1 b,1 13 a,1 b,2 $,1 L( )=2 L( )=2 12 L( )=2 a,4 a,11 $,1 $,1 b,3 b,8 10 8 L( )=4 a,4 11 9 b,3 2 b,9 L( )=2 L( )=3 4 7 b,3 5 $,1 6 1 3
Unmerge the border nodes based on L() (I) L( )=1 a,1 b,1 $,1 L( )=1 b,1 13 a,1 b,2 $,1 L( )=2 L( )=2 12 L( )=2 a,4 a,11 $,1 $,1 b,3 b,8 10 8 L( )=4 a,4 11 9 b,3 2 b,9 L( )=2 L( )=3 4 7 b,3 5 $,1 6 1 3
Unmerge the border nodes based on L() (II) a,1 b,1 $,1 b,1 13 a,1 b,1 $,1 a,1 a,2 12 b,1 a,2 b,1 $,1 a,2 b,8 10 8 a,2 a,11 11 9 b,5 1 b,10 b,5 b,9 a,4 4 5 7 3 6 2
Time complexity for merging Merge the tree using DFS takes O(n) time. Compute the links d() takes O(n) time. Compute L() takes O(n) time. Unmerge takes O(n) time.
Total time complexity of Farach ’ s algorithm Stage 1: Time(n/2)+O(n) Stage 2: O(n) Stage 3: O(n) Thus, Time(n) = Time(n/2)+O(n). By solving the equation, Time(n)= O(n).
Disadvantage of suffix tree Suffix tree is space inefficient. It requires O(n| Σ |log n) bits. Manber and Myers (SIAM J. Comp 1993) proposes a new data structure, called suffix array, which has a similar functionality as suffix tree. Moreover, it only requires O(n log n) bits.
Suffix Array (I) It is just sorted suffixes. E.g. consider S = acacag$ Suffix Position SA[i] Suffix acacag$ 1 7 $ cacag$ 2 1 acacag$ acag$ 3 => 3 acag$ 5 cag$ 4 Sort ag$ ag$ 5 2 cacag$ g$ 6 4 cag$ $ 7 6 g$ Suffix array is an array of n indices. Thus, it takes O(n log n) bits.
Observation The leaves of a suffix tree is in suffix array order. SA[i] Suffix $ a g ca 7 $ $ 7 6 1 acacag$ g c 3 acag$ a $ c 5 ag$ a g g 2 cacag$ 5 $ c $ a g 4 cag$ g $ 2 4 $ 6 g$ 1 3
Linear time construction of suffix array from suffix tree Recall that the suffix tree T of S[1..n] can be constructed in O(n) time. Then, by “ lexical ” depth-first traversal of T , the suffix array of S is obtained. This takes O(n) time. However, the space used during construction is the same as that for suffix tree! This defeats the purpose of suffix array. Today, we can build suffix array using O(n) bit space and O(n) time.
range(T,Q) For a pattern Q, its SA[i] Suffix occurrences in T form a 1 7 $ consecutive SA range. 2 1 acacag$ 3 3 acag$ Example: For T= acacag$, ca 4 5 ag$ occurs in SA[5] and SA[6]. 5 2 cacag$ Definition: 6 4 cag$ We called range(T,Q)=[st..ed] 7 6 g$ if Q is a prefix of every T j for j=SA[st], SA[st+1], … , SA[ed] where T j = j suffix of T = T[j..n]. Example: range(T,ca)=[5..6]
Find occurrence of query Q in a string S using suffix array Input: (1) the suffix array of a string T of length n and (2) a query Q of length m Aim: check if Q occurs in T Idea: binary search!
Algorithm
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$ L=1 4 5 ag$ 5 2 cacag$ R=7 6 4 cag$ 7 6 g$ M=(L+R)/2=4
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ L=1 3 3 acag$ 4 5 R=7 ag$ 5 2 cacag$ M=(L+R)/2=4 6 4 cag$ 7 6 g$ suffix-SA[M] > Q. Set R=M=4.
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ L=1 3 3 acag$ 4 5 R=4 ag$ 5 2 cacag$ M=(L+R)/2=2 6 4 cag$ 7 6 g$ suffix-SA[M] < Q. Set L=M=2.
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$ L=2 4 5 ag$ 5 2 cacag$ R=4 6 4 cag$ M=(L+R)/2=3 7 6 g$ The pattern Q is found at SA[M]=3.
Can we do better? During each step of binary search, we need to compare Q with a suffix using O(m) time, which is time consuming. Can we do better? We have the following observation. Suppose LCP(Q, suffix-SA[L]) is l and LCP(Q, suffix-SA[R]) is r. Then, LCP(Q, suffix-SA[M]) > min{l,r}. Below, we describe how to utilize this observation to speedup the computation.
Algorithm
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ 3 3 acag$ L=1, l=0 4 5 ag$ 5 2 cacag$ R=7, r=0 6 4 cag$ 7 6 g$ mlr = min(l,r)=0 M=(L+R)/2=4
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ L=1, l=0 2 1 acacag$ R=7, r=0 3 3 acag$ mlr = min(l,r)=0 4 5 ag$ M=(L+R)/2=4, m=1 5 2 cacag$ 6 4 cag$ The (m+1) char of suffix-SA[M] is g. The (m+1) char of Q is c. 7 6 g$ So, suffix-SA[M] > Q. Set R=M=4 and r=m=1.
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ L=1, l=0 2 1 acacag$ R=4, r=1 3 3 acag$ mlr = min(l,r)=0 4 5 ag$ M=(L+R)/2=2, m=3 5 2 cacag$ 6 4 cag$ The (m+1) char of suffix-SA[M] is c. The (m+1) char of Q is g. 7 6 g$ So, suffix-SA[M] < Q. Set L=M=2 and l=m=3.
Example Consider T = acacag$ i SA[i] Suffix Pattern Q = acag 1 7 $ 2 1 acacag$ L=2, l=3 3 3 acag$ 4 5 R=4, r=1 ag$ 5 2 cacag$ mlr = min(l,r)=1 6 4 cag$ M=(L+R)/2=3, m=4 7 6 g$ The pattern Q is found at SA[M]=3.
Time analysis Binary search will perform log n comparisons Each comparison takes at most O(m) time In the worst case, O(m log n) time. Myers and Manber report that, in practice, the time is O(m + log n).
Suffix array and suffix tree We show one example of replacing suffix tree by suffix array Note that most applications related to suffix tree can be solved using suffix array with some time blow up! When space is limited, replacing suffix tree by suffix array is a good choice.
The size is still too big! Why? DNA sequences can be very long! E.g. Fly: ~100M bases, Human: ~3G bases, Tree: ~9G bases Storage to store indexing data structure for human genome Suffix Tree: ~40G bytes Suffix Array: ~13G bytes Can we further reduce the space?
Solution Grossi, Vitter (STOC2000) Compressed suffix array (CSA) Ferragine, Manzini (FOCS2000) FM-index Both of them can be stored in O(n) bit space For Human Genome Both CSA and FM-index can be stored within 2G bytes.
FM-index Consider a text T=acacag$ FM-index stores: SA[i] Suffix T[SA[i]-1] The string BW=g$ccaaa A. 7 $ g C[x] = total no. of 1 acacag$ $ B. occurrences of each symbol 3 acag$ c less than x. E.g. C[a]=1, 5 ag$ c C[c]=4, C[g]=6, C[t]=7 2 cacag$ a A data-structure occ(x, i) C. 4 cag$ a which tells us the number of 6 g$ a occurrences of x in BW[1..i] using O(1) time.
Data-structure for answering the occ(x, i) query? n/log 2 n 1 2 3 4 BW ……… log 2 n log n ……… 1 2 3 log n 00x0xxx0x0 Given the text BW[1..n], we divide BW[1..n] into buckets of size log 2 n. For each bucket i = 1, … , n/log 2 n, we store P[i] = number of x ’ s in BW[1.. ilog 2 n]. Each bucket is further subdivided into log n sub-buckets of size log n. For each sub-bucket j of the bucket i, we store Q[i][j] = number of x ’ s in the first j sub-buckets.
Data-structure for answering the occ(x, i) query? We also need a lookup table rank(b,k) b is any string of length (log n)/2 1 ≤ k ≤ (log n)/2 rank(b,k) = number of x in the first k characters of b. 1 2 3 ……………… (log n)/2 Number of entries 00 … 0 log n log n log n = = 2 n 2 2 2 … 1 1 2 ……………………… x0x..0 Each entry take s O (log log n ) bits … Total space xx … x = = O ( n log n log log n ) o ( n ) bits
Space complexity of the occ() data-structure P[1..n/log 2 n] uses O(n/log n) bits Q[1..n/log 2 n][1..log n] uses O(n log log n / log n) bits rank(b,k) uses 2 logn/2 (log n/2) = o(n) bits In total, we use O(n log log n / log n) bits.
How to compute occ(x,i)? n/log 2 n 1 2 3 4 BW ……… log 2 n log n ……… 1 2 3 log n 00x0xxx0x0 Suppose log n = 10. To compute occ(x, 327), The result is P[3]+Q[4][2]+rank(00x0x,5)+rank(xx0x0,2) Hence, O(1) time to compute occ(x,i).
Size of FM-index Structure A can be store in 2n bits Structure B can be store in O(log n) bits Structure C can be store in O(n log log n/log n) bits In total, the size of FM-index is O(n) bits.
Observation C[x]+occ(x,i) is the number of SA[i] Suffix T[SA[i]-1] suffixes smaller than of xT SA[i] . 7 $ g 1 acacag$ $ Example: 3 acag$ c C[c]=4 5 ag$ c occ(c, 6)=2 2 cacag$ a Number of suffixes smaller than 4 cag$ a cT[SA[6]..n]=ccag$ is 6. 6 g$ a
Lemma Suppose range(T,Q) is [st..ed]. Then, range(T,xQ) = [p..q] where p = C[x] + occ(x,st-1) + 1 q = C[x] + occ(x,ed) Proof: p = 1+number of suffixes strictly smaller than xQ. The latter term = number of suffixes smaller than or equal to xT SA[st-1] . q = number of suffixes smaller than or equal to xT SA[ed] .
Backward search Given the text T and the FM-index, we want to determine if Q exists in T. Algorithm BW_exist(Q[1..m]) x=Q[m], i=m-1; 1. /* find range(T,Q[m]) */ 2. st = C[x]+1, ed = C[x+1]; while (st ≤ ed and i>1) { 3. /* find range(T, Q[i-1..m]) */ x = Q[i-1]; st = C[x] + occ(x, st-1) + 1; ed = C[x] + occ(x, ed); i = i – 1; } if st > ed, then pattern not found else pattern found. 4.
Example T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c Q[3..3]=a 2 cacag$ a sp=C[a] +1 4 cag$ a =1+1=2 6 g$ a ep=C[c] =4 g $ c c a a a
Example T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c Q[2..3]=ca 2 cacag$ a st=C[c]+occ(c,st old -1)+1 4 cag$ a =4+0+1=5 6 g$ a ed=C[c]+occ(c,ed old ) =4+2=6 g $ c c a a a
Example T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c Q[1..3]=aca 2 cacag$ a st=C[a]+occ(a,st old -1)+1 4 cag$ a =1+0+1=2 6 g$ a ed=C[a]+occ(a,ed old ) =1+2=3 g $ c c a a a
Example T = acacag$ SA[i] Suffix T[SA[i]-1] 7 $ g Q = aca 1 acacag$ $ 3 acag$ c 5 ag$ c Q[1..3]=aca 2 cacag$ a st=C[a]+occ(a,st old -1)+1 4 cag$ a =1+0+1=2 Q occurrences in T! 6 g$ a ed=C[a]+occ(a,ed old ) =1+2=3 g $ c c a a a
Time complexity of Backward Search To find a pattern Q[1..m] Step 1, 2, and 4 can be computed in O(1) time. For step 3, We need to iterate the loop for m-1 times. Each iteration of the loop can be computed in O(1) time. The loop takes O(m) time. In total, O(m) time for backward search.
Conclusion Suffix tree is a powerful data-structure which has a lot of applications in Computational Biology. Problems: Suffix tree is too big! Can be solved using CSA and FM-index Suffix tree can only solve exact match problem! (Most of the biology problems are approximate match!) Many works have been done on this area! But still not pratical. One of the important area to explore!
Recommend
More recommend