BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1
Outline Motivation - Statement of Problem - Related work BWT Arrays – A space-economic Index for String Matching String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees Experiments Conclusion and Future Work 2
Statement of Problem String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s . - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern Example: k = 4 a a a a a c a a a c target a c a c a c a g a a g c c c 3
Related Work Exact string matching On-line algorithms: - Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick - Index based: suffix trees ( Weiner ; McCreight ; Ukkonen ), suffix arrays ( Manber , Myers ), BWT- transformation ( Burrow - Wheeler ), Hash ( Karp , Rabin ) Inexact string matching String matching with k mismatches - Hamming distance ( Lan andau dau, U. Vish ishkin in; Amir mir at at al al.; .; - Cole ) String matching with k differences - Levelshtein distance ( Chang, Lampe pe ) - String matching with wild-cards ( Manber, Baeza-Yates ) - 4
BWT-Index Burrows-Wheeler Transform ( BWT ) s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ BWT construction: Rank correspondence: rk F F L rk L if SA [ i ] = 1; L [ i ] = $, $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ 1 - a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 1 1 L [ i ] = s [ SA [ i ] – 1], otherwise. rank: 3 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 2 1 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 3 - SA […] – suffix array a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 4 2 rank: 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 1 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 2 3 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 1 4 rk F ( e ) = rk L ( e ) 5
Backward Search of BWT-Index < z , [ , β]>, s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ if z appears in L ; search ( z , ) = otherwise. , Search p = aca Z : a character : a range in F Backward Search L : a range in L , corresponding to Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 6
Backward Search of BWT-Index search ( c , < a , [2, 5]>) search ( a , < c , [1, 2]>) Search sequence : < a , [2, 5]> < c , [1, 2]> < a , [3, 4]> Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 7 7
rankAll range | | arr er X suc Ar Arrange rrays ys eac each for or a char haract acter such th that at A X [ i ] (the (the i th th ent entry in in the the array for X ) is is the the number er of of appearanc rances es of of X wi within in L [1 .. .. i ]. ment L [ .. .. ] ( ) to Ins nstea ead of of sc scanning anning a ce certain ain seg segmen to find nd a sub subra range nge tain X , we ether A X [ - 1] = for or a ce certai we can can simply simply look look up up A X to to see see wh wheth A [ ]. If then does in .. .. ]. Othe wise, [ A X [ - 1] If it it is is the the case, case, then oes not oc occu cur in Otherwise, + 1, A X [ ] ] should ld be be the the found range. A $ A a A c A g A t F L Example 0 1 0 0 0 $ a 4 0 1 1 0 0 To find the first and the last appearance a 4 c 2 0 1 1 1 0 a 3 g 1 of c in L [2 .. 5], we only need to find 1 1 1 1 0 a 1 $ A c [2 – 1] = A c [1] = 0 and A c [5] = 2. So the 1 1 2 1 0 a 2 c 1 1 2 2 1 0 corresponding range is c 2 a 3 1 3 2 1 0 c 1 a 1 [ A c [2 - 1] + 1, A c [5]] = [1, 2]. 1 4 2 1 0 g 1 a 2
Reduce rankAll -Index Size F -ranks: F = <a; x a , y a > Find a range: top F ( x ) + A [ ( top -1) / ] + r +1 BWT array: L bot F ( x ) + A [ bot / ] + r Reduced appearance array: A with bucket r is the number of 's appearances within size . L [ ( top - 1)/ .. top - 1] Reduced suffix array: SA * with bucket size . r’ is the number of 's appearances within L [ bot / .. bot ] F = < ; x , y > L SA * i A $ A a A c A g A t F L rk L SA 8 a 4 1 8 0 1 0 0 0 $ a 4 1 F $ = <$; 1, 1> 7 7 c 2 2 0 1 1 0 0 a 4 c 2 1 F a = < a ; 2, 5> 5 g 1 3 5 0 1 1 1 0 a 3 g 1 1 + + + F c = < c ; 6, 7> 1 1 $ 4 1 1 1 1 0 a 1 $ - F g = < g ; 8, 8> 3 3 c 1 5 1 1 2 1 0 a 2 c 1 2 6 a 3 6 6 1 2 2 1 0 c 2 a 3 2 2 2 a 1 7 1 3 2 1 0 c 1 a 1 3 4 a 2 8 4 1 4 2 1 0 g 1 a 2 4 9
String Matching with k Mismatches Search Trees pattern: r = tcaca ; target: s = acagaca ; k = 2. v 0 < - , [1, 8]> T : r : v 1 r [1] = t v 2 v 3 < a , [1, 4]> < g , [1, 1]> < c , [1, 2]> v 6 r [2] = c v 4 v 5 < c , [1, 2]> < g , [1, 1]> < a , [2, 3]> v 7 < a , [4, 4]> v 10 v 8 v 9 v 11 r [3] = a < g , [1, 1]> < c , [2, 2]> < a , [2, 3]> < a , [4, 4]> v 14 < a , [4, 4>] v 12 v 13 v 15 r [4] = c < a , [3, 3]> < g , [1, 1]> < c , [2, 2]> v 18 v 17 < c , [2, 2]> v 19 < a , [3, 3]> r [5] = a v 16 < a , [4, 4]> <$, [-, -]> P 2 P 3 P 1 P 4 10
String Matching with k Mismatches Mismatching information R – mismatching table for r with | r| = m. R ij – containing the positions of the first 2 k + 1 mismatches between r [ i .. m – q + i ] and r [ j .. m – q + j ], where q = max{ i , j }, such that if R ij [ l ] = x ( ) then r [ i + x - 1] r [ j + x - 1] or one of them does not exist, and it is the l- th mismatch between them. i r : r 1 : tcacg 1 2 3 4 R 12 : r 2 : cacg r 1 : tcacg R 13 : 1 3 r 3 : acg tcacg r 1 : i R 14 1 2 r 4 : cg r 1 : tcacg R 15 1 g r 5 : 11
String Matching with k Mismatches Derivation of mismatching information We store only part of mismatching information, specifically: R 12 , …, R 1 m , while all the other mismatching information will be dynamically derived. Step 1: A 1 = R 12 : Step 2: Step 3: Derive the mismatching 4 4 4 1 2 3 1 2 3 1 2 3 information between p p p A 2 = R 13 : 1 = r [2 .. 4] = cacg and 1 3 1 3 1 3 2 = r [3 .. 5] = acg q q q from R 12 and R 13 . 1 [1]= c 2 [1]= a 1 [3]= c 2 [3]= g A : 1 1 2 3 1 2 12
String Matching with k Mismatches Algorithm for Derivation of mismatching information Let , 1 and 2 be three strings. Let A 1 and A 2 be two arrays containing all the positions of mismatches between and 1 , and and 2 , respectively. Create a new array A such that if A [ i ] = j ( ), then 1 [ j ] 1 [ j ], or one of them does not exists. It is the i th mismatch between them. 1. p p := 1; q q := 1; l l := 1; 1; 1. 2. If A 2 [ q ] ] < A 1 [ p ], ], then { A [ l ] ] := A 2 [ q ]; ]; q := q q + 1; l l := l l + 1;} ;} 2. 3. If A 1 [ p ] ] < A 2 [ q ], ], then { A [ l ] ] := A 1 [ p ]; ]; p p := p p + 1; l l := l l + 1;} ;} 3. ], then {if 1 [ p ] ] 2 [ q ], then { A [ l ] := q ; 4. If A 1 [ p ] ] = A 2 [ q ], ; l l := l l + 1;} p p := p p + 1; q q := q q + 1;} ;} 4. re , stop (if A 1 (or 5. If p p > | A 1 |, |, q > | A 2 |, |, or bot r both A 1 [ p ] ] and A 2 [ q ] ] are r A 2 ) has some re remaining aining 5. ot , first elements ts, , which ch are not st appe pend nd them to t the rear of A , and then stop.) .) 6. Ot Otherwi wise se, , go to (2). 6. 13 13
String Matching with k Mismatches Derivation of mismatching information for paths in a search tree. This part of P 3 will not be created. We derive <-, [1, 8]> the mismatching information for it according r : … to P 1 and R 21 . v 1 r [1] = t < a , [1, 4]> < c , [1, 2]> P : P : P : P : r [2] = c < c , [1, 2]> < a , [2, 3]> j i r [3] = a < a , [2, 3]> < g , [1, 1]> i … … j r [4] = c < a , [4, 4>]> < g , [1, 1]> < c , [2, 2]> r [5] = a P 3 < a , [4, 4]> P 1 14 14 14
Recommend
More recommend