yangjun chen yujia wu department of applied computer
play

Yangjun Chen, Yujia Wu Department of Applied Computer Science - PowerPoint PPT Presentation

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1 Outline Motivation - Statement of Problem - Related work BWT


  1. BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1

  2. Outline  Motivation - Statement of Problem - Related work  BWT Arrays – A space-economic Index for String Matching  String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees  Experiments  Conclusion and Future Work 2

  3. Statement of Problem  String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s . - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern Example: k = 4 a a a a a c a a a c target a c a c a c a g a a g c c c 3

  4. Related Work  Exact string matching On-line algorithms: - Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick - Index based: suffix trees ( Weiner ; McCreight ; Ukkonen ), suffix arrays ( Manber , Myers ), BWT- transformation ( Burrow - Wheeler ), Hash ( Karp , Rabin )  Inexact string matching String matching with k mismatches - Hamming distance ( Lan andau dau, U. Vish ishkin in; Amir mir at at al al.; .; - Cole ) String matching with k differences - Levelshtein distance ( Chang, Lampe pe ) - String matching with wild-cards ( Manber, Baeza-Yates ) - 4

  5. BWT-Index  Burrows-Wheeler Transform ( BWT )  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ BWT construction: Rank correspondence: rk F F L rk L if SA [ i ] = 1; L [ i ] = $, $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ 1 - a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 1 1 L [ i ] = s [ SA [ i ] – 1], otherwise. rank: 3 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 2 1 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 3 - SA […] – suffix array a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 4 2 rank: 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 1 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 2 3 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 1 4 rk F ( e ) = rk L ( e ) 5

  6. Backward Search of BWT-Index < z , [  , β]>,  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ if z appears in L  ; search ( z ,  ) = otherwise.  ,  Search p = aca Z : a character  : a range in F Backward Search L  : a range in L , corresponding to  Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 6

  7. Backward Search of BWT-Index search ( c , < a , [2, 5]>) search ( a , < c , [1, 2]>) Search sequence : < a , [2, 5]> < c , [1, 2]> < a , [3, 4]> Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 7 7

  8. rankAll range |  | arr er X   suc  Ar Arrange rrays ys eac each for or a char haract acter such th that at A X [ i ] (the (the i th th ent entry in in the the array for X ) is is the the number er of of appearanc rances es of of X wi within in L [1 .. .. i ]. ment L [  .. ..  ] (    ) to  Ins nstea ead of of sc scanning anning a ce certain ain seg segmen to find nd a sub subra range nge tain X   , we ether A X [  - 1] = for or a ce certai we can can simply simply look look up up A X to to see see wh wheth A  [  ]. If then  does in  .. ..  ]. Othe wise, [ A X [  - 1] If it it is is the the case, case, then oes not oc occu cur in Otherwise, + 1, A X [  ] ] should ld be be the the found range. A $ A a A c A g A t F L Example 0 1 0 0 0 $ a 4 0 1 1 0 0 To find the first and the last appearance a 4 c 2 0 1 1 1 0 a 3 g 1 of c in L [2 .. 5], we only need to find 1 1 1 1 0 a 1 $ A c [2 – 1] = A c [1] = 0 and A c [5] = 2. So the 1 1 2 1 0 a 2 c 1 1 2 2 1 0 corresponding range is c 2 a 3 1 3 2 1 0 c 1 a 1 [ A c [2 - 1] + 1, A c [5]] = [1, 2]. 1 4 2 1 0 g 1 a 2

  9. Reduce rankAll -Index Size F -ranks: F  = <a; x a , y a > Find a range:  top   F ( x  ) + A  [  ( top -1) /  ] + r +1 BWT array: L  bot   F ( x  ) + A  [  bot /  ] + r  Reduced appearance array: A  with bucket  r is the number of  's appearances within size  . L [  ( top - 1)/  .. top - 1] Reduced suffix array: SA * with bucket size  . r’ is the number of  's appearances within  L [  bot /  .. bot ] F  = <  ; x  , y  > L SA * i A $ A a A c A g A t F L rk L SA 8 a 4 1 8 0 1 0 0 0 $ a 4 1 F $ = <$; 1, 1> 7 7 c 2 2 0 1 1 0 0 a 4 c 2 1 F a = < a ; 2, 5> 5 g 1 3 5 0 1 1 1 0 a 3 g 1 1 + + + F c = < c ; 6, 7> 1 1 $ 4 1 1 1 1 0 a 1 $ - F g = < g ; 8, 8> 3 3 c 1 5 1 1 2 1 0 a 2 c 1 2 6 a 3 6 6 1 2 2 1 0 c 2 a 3 2 2 2 a 1 7 1 3 2 1 0 c 1 a 1 3 4 a 2 8 4 1 4 2 1 0 g 1 a 2 4 9

  10. String Matching with k Mismatches  Search Trees pattern: r = tcaca ; target: s = acagaca ; k = 2. v 0 < - , [1, 8]> T : r : v 1 r [1] = t v 2 v 3 < a , [1, 4]> < g , [1, 1]> < c , [1, 2]> v 6 r [2] = c v 4 v 5 < c , [1, 2]> < g , [1, 1]> < a , [2, 3]> v 7 < a , [4, 4]> v 10 v 8 v 9 v 11 r [3] = a < g , [1, 1]> < c , [2, 2]> < a , [2, 3]> < a , [4, 4]> v 14 < a , [4, 4>] v 12 v 13 v 15 r [4] = c < a , [3, 3]> < g , [1, 1]> < c , [2, 2]> v 18 v 17 < c , [2, 2]> v 19 < a , [3, 3]> r [5] = a v 16 < a , [4, 4]> <$, [-, -]> P 2 P 3 P 1 P 4 10

  11. String Matching with k Mismatches  Mismatching information R – mismatching table for r with | r| = m. R ij – containing the positions of the first 2 k + 1 mismatches between r [ i .. m – q + i ] and r [ j .. m – q + j ], where q = max{ i , j }, such that if R ij [ l ] = x (   ) then r [ i + x - 1]  r [ j + x - 1] or one of them does not exist, and it is the l- th mismatch between them. i r : r 1 : tcacg  1 2 3 4 R 12 : r 2 : cacg r 1 : tcacg    R 13 : 1 3 r 3 : acg tcacg r 1 : i    R 14 1 2 r 4 : cg r 1 : tcacg     R 15 1 g r 5 : 11

  12. String Matching with k Mismatches  Derivation of mismatching information We store only part of mismatching information, specifically: R 12 , …, R 1 m , while all the other mismatching information will be dynamically derived. Step 1: A 1 = R 12 : Step 2: Step 3: Derive the mismatching 4  4  4  1 2 3 1 2 3 1 2 3 information between p p p A 2 = R 13 :  1 = r [2 .. 4] = cacg and 1 3    1 3       1 3  2 = r [3 .. 5] = acg q q q from R 12 and R 13 .  1 [1]= c   2 [1]= a  1 [3]= c   2 [3]= g A : 1 1 2 3 1 2 12

  13. String Matching with k Mismatches  Algorithm for Derivation of mismatching information  Let  ,  1 and  2 be three strings. Let A 1 and A 2 be two arrays containing all the positions of mismatches between  and  1 , and  and  2 , respectively.  Create a new array A such that if A [ i ] = j (   ), then  1 [ j ]   1 [ j ], or one of them does not exists. It is the i th mismatch between them. 1. p p := 1; q q := 1; l l := 1; 1; 1. 2. If A 2 [ q ] ] < A 1 [ p ], ], then { A [ l ] ] := A 2 [ q ]; ]; q := q q + 1; l l := l l + 1;} ;} 2. 3. If A 1 [ p ] ] < A 2 [ q ], ], then { A [ l ] ] := A 1 [ p ]; ]; p p := p p + 1; l l := l l + 1;} ;} 3. ], then {if  1 [ p ] ]   2 [ q ], then { A [ l ] := q ; 4. If A 1 [ p ] ] = A 2 [ q ], ; l l := l l + 1;} p p := p p + 1; q q := q q + 1;} ;} 4. re  , stop (if A 1 (or 5. If p p > | A 1 |, |, q > | A 2 |, |, or bot r both A 1 [ p ] ] and A 2 [ q ] ] are r A 2 ) has some re remaining aining 5. ot  , first elements ts, , which ch are not st appe pend nd them to t the rear of A , and then stop.) .) 6. Ot Otherwi wise se, , go to (2). 6. 13 13

  14. String Matching with k Mismatches  Derivation of mismatching information for paths in a search tree. This part of P 3 will not be created. We derive <-, [1, 8]> the mismatching information for it according r : … to P 1 and R 21 . v 1 r [1] = t < a , [1, 4]> < c , [1, 2]> P : P  : P : P  : r [2] = c < c , [1, 2]> < a , [2, 3]> j i r [3] = a < a , [2, 3]> < g , [1, 1]> i … … j r [4] = c < a , [4, 4>]> < g , [1, 1]> < c , [2, 2]> r [5] = a P 3 < a , [4, 4]> P 1 14 14 14

Recommend


More recommend