String Matching with Variable Length Gaps By Philip Bille, Inge Li - - PowerPoint PPT Presentation
String Matching with Variable Length Gaps By Philip Bille, Inge Li - - PowerPoint PPT Presentation
String Matching with Variable Length Gaps By Philip Bille, Inge Li Grtz, Hjalte Wedel Vildhj and David Kofoed Wind Presented by Hjalte Wedel Vildhj October 13, 2010 SPIRE 2010, Los Cabos, Mexico The Variable Length Gap Problem Given some
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T.
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17
- 6
6
end pos in T
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17
- Not a valid match!
8 6
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17, 28
- 6
6
end pos in T
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17, 28
- 7
5
end pos in T
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17, 28, 31
- 7
3
end pos in T
The Variable Length Gap Problem
Given some string T ∈ Σ+ and a variable length gap pattern P = P1 · g{a1, b1} · P2 · g{a2, b2} · · · g{ak−1, bk−1} · Pk . Find the end positions for all occurrences of P in T. Some x ∈ Σ∗ s.t. a1 ≤ |x| ≤ b1 Example: P = A · g{6, 7} · CC · g{2, 6} · GT T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT Solution:
- 17, 28, 31
A Closer Look At The Problem
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
A Closer Look At The Problem
Parameters n = |T| α = # occ. of P1, P2, . . . , Pk in T m =
k
- i=1
|Pi| A =
k
- i=1
ai
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
A Closer Look At The Problem
Parameters n = |T| α = # occ. of P1, P2, . . . , Pk in T m =
k
- i=1
|Pi| A =
k
- i=1
ai Known Upper Bounds By Time Space Bille & Thorup1 O
- n(klog w
w
+ log k) + m log m + A
- O(m + A)
Morgante et al.2 O((n + m) log k + α) O(m + α)
- 1P. Bille and M. Thorup. Regular expression matching with multi-strings and
- intervals. In Proc. 21st SODA, 2010
- 2M. Morgante, A. Policriti, N. Vitacolonna, and A. Zuccolo. Structured motifs
- search. J. Comput. Bio., 12(8):1065-1082, 2005
A Closer Look At The Problem
Parameters n = |T| α = # occ. of P1, P2, . . . , Pk in T m =
k
- i=1
|Pi| A =
k
- i=1
ai Known Upper Bounds By Time Space Bille & Thorup O
- n(klog w
w
+ log k) + m log m + A
- O(m + A)
Morgante et al. O((n + m) log k + α) O(m + α)
Can you get the best of both?
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
dead
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
dead
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
dead
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2 L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Illustrating the Algorithm
P = A · g{6, 7} · CC · g{2, 6} · GT
P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P3 P3 P3 P3
A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
L2
dead
L3
Time and Space
Claim: The algorithm runs in O((n + m) log k + α) time and uses O(m + A) space.
Time and Space
Claim: The algorithm runs in O((n + m) log k + α) time and uses O(m + A) space. Time
◮ Processing T using AC automaton takes
O((n + m) log k + α) time.
◮ At most α ranges are added and removed, so O(α) extra
time is spent maintaining the lists.
Time and Space
Claim: The algorithm runs in O((n + m) log k + α) time and uses O(m + A) space. Time
◮ Processing T using AC automaton takes
O((n + m) log k + α) time.
◮ At most α ranges are added and removed, so O(α) extra
time is spent maintaining the lists. Space
◮ AC automaton takes O(m) space. ◮ How much space is used by L2, . . . , Lk?
Maximum Size of Li
Pi−1 R(x1) x1 is reported and R(x1) is added to Li
Position in T
x1
Maximum Size of Li
Pi−1 Pi R(x1) x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
Position in T
x1
Maximum Size of Li
Pi−1 Pi Pi−1 R(x1) R(xℓ) x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
Position in T
x1 xℓ
Maximum Size of Li
Pi−1 Pi Pi−1 R(x1) R(xℓ) R(x2) x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
- 1
Position in T
x1 xℓ
Maximum Size of Li
Pi−1 Pi Pi−1 R(x1) R(xℓ) R(x2) d |Pi| − 1 ai−1 ci−1 bi−1 + 1 x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
- 1
Position in T
x1 xℓ
Maximum Size of Li
Pi−1 Pi Pi−1 R(x1) R(xℓ) R(x2) d |Pi| − 1 ai−1 ci−1 bi−1 + 1 x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
- 1
Position in T
x1 xℓ
|Li| ≤
- d
ci−1 + 1
- + 1 =
2ci−1 + |Pi| + ai−1 ci−1 + 1
- = O(|Pi| + ai−1) .
Maximum Size of Li
Pi−1 Pi Pi−1 R(x1) R(xℓ) R(x2) d |Pi| − 1 ai−1 ci−1 bi−1 + 1 x1 is reported and R(x1) is added to Li Last position where R(x1) is still alive
- 1
Position in T
x1 xℓ
|Li| ≤
- d
ci−1 + 1
- + 1 =
2ci−1 + |Pi| + ai−1 ci−1 + 1
- = O(|Pi| + ai−1) .
Total space:
k
- i=2
|Li| = O
k
- i=2
|Pi| +
k−1
- i=1