The q -gram distance Bioinformatics Algorithms • In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. • But sometimes, other distance functions serve the purpose better. Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester The q -gram distance 2 / 21 The q -gram distance What is a q -gram? • In many situations, edit distance is a good model for di ff erences / Let Σ be the alphabet, with | Σ | = σ . similarity between strings. Def. • But sometimes, other distance functions serve the purpose better. A q -gram is a string of length q . Motivations for using q -gram distance 1. If two parts of a sequence are exchanged (e.g. two paragraphs, two long substrings, two genes), then one can argue that the resulting strings still have high similarity; however, the edit distance will be big. The q -gram distance can be more appropriate in this case. 2. The edit distance needs quadratic computation time, but this is often too slow. The q -gram distance can be computed in linear time. 2 / 21 3 / 21 What is a q -gram? Occurrence count Let Σ be the alphabet, with | Σ | = σ . Let s be a string of length n � q , and u be a q -gram. The occurrence Def. count of u in s is A q -gram is a string of length q . Note N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , q -grams are also called k -mers, w -words, or k -tuples. Typically, q (or k , the number of times q -gram u occurs in s . w , etc.) is small, much smaller than the strings we will want to compare. Ex. Let s = ACAGGGCA and q = 2. We will fix q , and use the number of occurrences of q -grams to compute distances between strings. 3 / 21 4 / 21
Occurrence count q -gram profile Fix some enumeration (listing) of Σ q , i.e. some order in which we want to Let s be a string of length n � q , and u be a q -gram. The occurrence list all q -grams; e.g. the lexicographic order. count of u in s is Def. N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , Let s be a string over Σ , | s | � q . The q -gram profile of s , P q ( s ) is an array of size σ q , where the i th entry is the number of times q -gram u occurs in s . Ex. P q ( s )[ i ] = N ( s , u i ) , Let s = ACAGGGCA and q = 2. Then and u i is the i th q -gram in the enumeration. N ( s , AC ) = N ( s , AG ) = N ( s , GC ) = 1 , N ( s , CA ) = N ( s , GG ) = 2, and for all other q -grams u over Σ , N ( s , u ) = 0. 4 / 21 5 / 21 q -gram distance Example: P q ( s ) P q ( t ) P q ( v ) u Let Σ = { A , C , G , T } and q = 2. 0 1 1 AA 1 1 1 AC (Introduced by Ukkonen, 1992) 1 0 1 Let AG 0 0 0 AT s = ACAGGGCA , Def.: Given two strings s , t , the q -gram distance of s and t is 2 2 1 CA t = GGGCAACA , 0 0 0 CC v = AAGGACA . X dist q − gram ( s , t ) = | N ( s , u ) � N ( t , u ) | . 0 0 0 CG Then the q -gram profiles of s , t , v are 0 0 0 u ∈ Σ q CT shown on the right. 0 0 1 GA Equivalent def.: Given two strings s , t , the q -gram distance of s and t is GC 1 1 0 2 2 1 GG σ q Notice that the sum of all entries of 0 0 0 GT X dist q − gram ( s , t ) = | P q ( s )[ i ] � P q ( t )[ i ] | , 0 0 0 TA P q ( s ) = | s | � q +1 = total number of i =1 0 0 0 TC q -gram occurrences in s = number of 0 0 0 TG which is the Manhattan distance 1 of the q -gram profiles of s and t . distinct positions in s where a q -gram 0 0 0 TT starts. 1 The Manhattan distance, or L 1 -distance, of two vectors x , y ∈ R n is defined as P n i =1 | x i − y i | . 6 / 21 7 / 21 q -gram distance The q -gram distance is a pseudo-metric In the previous example ( q = 2, s = ACAGGGCA , t = GGGCAACA , and Lemma v = AAGGACA ), we have The q -gram distance is a pseudo-metric, i.e. it is non-negative, symmetric, and obeys the triangle inequality (but it is possible to have x 6 = y with dist q − gram ( x , y ) = 0). dist 2 − gram ( s , t ) = 2 , dist 2 − gram ( s , v ) = 5 , and dist 2 − gram ( t , v ) = 5 . Proof: The three properties follow from the fact that the Manhattan metric is a Note that it is possible to have distinct strings with q -gram distance 0, e.g. metric. The example above shows that dist q − gram ( x , y ) = 0 does not imply x = y . for w = AGGGCACA , we have dist 2 − gram ( s , w ) = 0 . Exercise: Prove the lemma explicitly. (Don’t just believe this, double check it!) 8 / 21 9 / 21
Connection to edit distance Connection to edit distance q -gram Lemma Let d edit ( s , t ) denote the (unit-cost) edit distance of s and t . Then Examples dist q − gram ( s , t ) With the earlier examples, we have d edit ( s , t ) . 2 q 1. Exchange of two long substrings: d edit ( s , t ) = 6 , d edit ( s , w ) = 4 (compare to: dist q − gram ( s , t ) = 2 , dist q − gram ( s , w ) = 0 , with q = 2). Proof 2. The q -gram distance is at most 2 q times edit distance ( q -gram Every edit operation contributes to the q -gram distance at most 2 q : Consider the lemma): d edit ( s , v ) = 2 simplest case, a substitution in position i of s , where character s i is substituted by character x , and let s 0 be the resulting string. If q i | s | � q + 1, then there (compare to: dist q − gram ( s , v ) = 5 8 = d edit ( s , v ) · 2 q , with q = 2) are exactly q q -grams of s a ff ected by the substitution: s i � q +1 . . . s i , up to s i . . . s i + q � 1 (otherwise fewer); the counts of all these are decremented by 1, while Based on the q -gram lemma and the fact that the q -gram distance can be the counts of the new q -grams s i � 1+1 . . . x , s i . . . xs i + q , etc. are incremented by 1. computed in linear time, we can use the q -gram distance as a filter for edit Therefore, dist q � gram ( s , s 0 ) 2 q (it could be less because these q -grams need not distance computations. be all distinct). For a deletion, the number of q -grams whose count is decremented is at most q , while those whose count is incremented is at most q � 1; for an insertion the other way around.—The claim follows by induction on the number of edit operations. 10 / 21 11 / 21 Computation of the q -gram distance Computation of the q -gram distance Basic ideas Algorithm for computing q -gram distance input: Strings s , t of length | s | = n and | t | = m • Use a sliding window of size q over s and t output: dist q − gram ( s , t ) • Use an array d q of size σ q 1. initialize d q [0 . . . σ q � 1] with 0s • First slide a window over s , increment respective entry for every q -gram seen 2. for i = 1 , . . . , n � q + 1 : r rank ( s i . . . s i + q − 1 ) d q [ r ] d q [ r ] + 1 • Then slide over t , decrement respective entry for every q -gram seen 3. for i = 1 , . . . , m � q + 1 : r rank ( t i . . . t i + q − 1 ) • Now d q [ r ] = N ( s , u r ) � N ( t , u r ). d q [ r ] d q [ r ] � 1 • Sum up the absolute values of the entries: 4. d 0 dist q − gram ( s , t ) = P i | d q [ i ] | 5. for i = 0 . . . σ q � 1 : d d + | d q [ i ] | . We will see: This algorithm runs in linear time. 6. return d But: how do we know where to find the entry for the current q -gram? This is called ranking (coming soon) For an example, see next slide. 12 / 21 13 / 21 r u r d q after the d q after the r u r d q after the pass thru s pass thru t pass thru s Goal 0 0 0 AA 0 � 1 AA Given q -gram u , we want to know which entry of Example: 1 1 1 1 0 AC AC the array u corresponds to. 2 1 1 2 AG 1 AG Ex.: Where is the q -gram CG ? In position 6. 3 0 0 3 0 AT AT s = ACAGGGCA , 4 2 0 4 2 CA CA t = GGGCAACA . Ranking functions 5 0 0 5 0 CC CC • A ranking function is a bijection 6 0 0 6 0 CG CG On the right, the array d q rank : Σ q ! [0 . . . σ q � 1]. 7 0 0 7 0 CT CT after line 2. of the algo 8 0 0 8 0 GA GA • rank ( u ) gives us the position of u in the (now d q equals P q ( s )) 9 1 9 GC 1 0 GC enumeration of Σ q and after line 3. 10 2 10 GG 2 0 GG • needs to be very e ffi ciently computable 11 0 Finally, we have 11 0 0 GT GT 12 0 0 12 TA 0 d 2 ( s , t ) = | � 1 | + 1 = 2. TA • the ranking function we use will give us 13 0 0 13 0 TC TC constant time per q -gram of s 14 0 0 14 0 TG TG 15 0 0 15 0 TT TT 14 / 21 15 / 21
Recommend
More recommend