Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens Stoye’s.
Text indexes Let T be a string of length n over alphabet Σ (which we refer to as text in the following). A text index (or string index) is a data structure built on the text which allows to answer a certain type of query (e.g. pattern matching) without traversing the whole text. Typically, we want 1. the index not to use too much space (linear or sublinear in n ), and 2. the query time to be fast (ideally: independent of n ). 2 / 17
A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. 3 / 17
A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . 3 / 17
A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . Variants: • output all occurrences of P in T — ”all-occurrences version” • decide whether P occurs in T (yes - no) — ”decision version” • output the number of occurrences of P in T — ”counting version” We usually refer to the number of occurrences of P as occ P . 3 / 17
Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) 4 / 17
Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. 4 / 17
Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! 4 / 17
Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! • That is why we need text indexes. 4 / 17
The k -mer index 5 / 17
The k -mer index r u r P 2 ( s ) 0 0 AA 1 1 AC Recall that a k -mer (or k -gram) 2 1 AG is a string of length k . 3 0 AT 4 2 CA k -mer index 5 CC 0 Earlier in this course, we saw 6 CG 0 the k -mer profile, P k ( s ) 7 CT 0 (or q -gram profile) 8 GA 0 9 GC 1 of a string s . 10 GG 2 Ex. 11 GT 0 12 TA 0 s = ACAGGGCA , 13 TC 0 on the right is P 2 ( s ). 14 TG 0 15 0 TT 6 / 17
The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA 9 6 GC 10 4 , 5 GG 11 GT 12 TA 13 TC 14 TG 15 TT 7 / 17
The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA 13 TC 14 TG 15 TT 7 / 17
The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG 15 TT 7 / 17
The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG N.B.: works only for patterns of 15 TT length exactly k 7 / 17
The suffix tree 8 / 17
The suffix tree T = BANANA$ (add sentinel character $ / ∈ Σ) labels only conceptual! two pointers into string [3,4] NA [7,7] $ NA$ [5,7] 7 7 B A [2,2] A 3 3 [7,7] N $ [ 1 A , 7 N $ [7,7] 5 [ ] 5 N 3 A , A $ 4 6 6 ] N [ 1 5 1 A , 7 $ $ [7,7] ] 2 4 2 4 9 / 17
The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; 10 / 17
The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , 10 / 17
The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), 10 / 17
The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, 10 / 17
The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, • a locus ( u , d ) is a position on an edge ( v , u ) where u is a node of ST ( T ) and sd ( v ) < d ≤ sd ( u ): d is the string-depth of locus ( u , d ). 10 / 17
• N.B.: the edge labels are not stored explicitly: • they are represented by two pointers [ b , e ] into T : beginning and end of an occurrence of the edge label; • this representation is not necessarily unique • e.g. in the example, any edge with label NA can be represented by [3 , 4] or [5 , 6] labels only conceptual! two pointers into string [3,4] $ NA [7,7] NA$ [5,7] 7 7 BANANA$ A [2,2] 3 3 $ [7,7] [1,7] $ [7,7] 5 [3,4] 5 NA 6 6 NA$ [5,7] 1 1 $ [7,7] 2 2 4 4 11 / 17
Suffix tree properties • The leaves of ST ( T ) correspond to the suffixes of T $. 12 / 17
Recommend
More recommend