The q -gram distance Bioinformatics Algorithms In many situations, - PDF document

The q -gram distance Bioinformatics Algorithms • In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. • But sometimes, other distance functions serve the purpose better. Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester The q -gram distance 2 / 21 The q -gram distance What is a q -gram? • In many situations, edit distance is a good model for di ff erences / Let Σ be the alphabet, with | Σ | = σ . similarity between strings. Def. • But sometimes, other distance functions serve the purpose better. A q -gram is a string of length q . Motivations for using q -gram distance 1. If two parts of a sequence are exchanged (e.g. two paragraphs, two long substrings, two genes), then one can argue that the resulting strings still have high similarity; however, the edit distance will be big. The q -gram distance can be more appropriate in this case. 2. The edit distance needs quadratic computation time, but this is often too slow. The q -gram distance can be computed in linear time. 2 / 21 3 / 21 What is a q -gram? Occurrence count Let Σ be the alphabet, with | Σ | = σ . Let s be a string of length n � q , and u be a q -gram. The occurrence Def. count of u in s is A q -gram is a string of length q . Note N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , q -grams are also called k -mers, w -words, or k -tuples. Typically, q (or k , the number of times q -gram u occurs in s . w , etc.) is small, much smaller than the strings we will want to compare. Ex. Let s = ACAGGGCA and q = 2. We will fix q , and use the number of occurrences of q -grams to compute distances between strings. 3 / 21 4 / 21

Occurrence count q -gram profile Fix some enumeration (listing) of Σ q , i.e. some order in which we want to Let s be a string of length n � q , and u be a q -gram. The occurrence list all q -grams; e.g. the lexicographic order. count of u in s is Def. N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , Let s be a string over Σ , | s | � q . The q -gram profile of s , P q ( s ) is an array of size σ q , where the i th entry is the number of times q -gram u occurs in s . Ex. P q ( s )[ i ] = N ( s , u i ) , Let s = ACAGGGCA and q = 2. Then and u i is the i th q -gram in the enumeration. N ( s , AC ) = N ( s , AG ) = N ( s , GC ) = 1 , N ( s , CA ) = N ( s , GG ) = 2, and for all other q -grams u over Σ , N ( s , u ) = 0. 4 / 21 5 / 21 q -gram distance Example: P q ( s ) P q ( t ) P q ( v ) u Let Σ = { A , C , G , T } and q = 2. 0 1 1 AA 1 1 1 AC (Introduced by Ukkonen, 1992) 1 0 1 Let AG 0 0 0 AT s = ACAGGGCA , Def.: Given two strings s , t , the q -gram distance of s and t is 2 2 1 CA t = GGGCAACA , 0 0 0 CC v = AAGGACA . X dist q − gram ( s , t ) = | N ( s , u ) � N ( t , u ) | . 0 0 0 CG Then the q -gram profiles of s , t , v are 0 0 0 u ∈ Σ q CT shown on the right. 0 0 1 GA Equivalent def.: Given two strings s , t , the q -gram distance of s and t is GC 1 1 0 2 2 1 GG σ q Notice that the sum of all entries of 0 0 0 GT X dist q − gram ( s , t ) = | P q ( s )[ i ] � P q ( t )[ i ] | , 0 0 0 TA P q ( s ) = | s | � q +1 = total number of i =1 0 0 0 TC q -gram occurrences in s = number of 0 0 0 TG which is the Manhattan distance 1 of the q -gram profiles of s and t . distinct positions in s where a q -gram 0 0 0 TT starts. 1 The Manhattan distance, or L 1 -distance, of two vectors x , y ∈ R n is defined as P n i =1 | x i − y i | . 6 / 21 7 / 21 q -gram distance The q -gram distance is a pseudo-metric In the previous example ( q = 2, s = ACAGGGCA , t = GGGCAACA , and Lemma v = AAGGACA ), we have The q -gram distance is a pseudo-metric, i.e. it is non-negative, symmetric, and obeys the triangle inequality (but it is possible to have x 6 = y with dist q − gram ( x , y ) = 0). dist 2 − gram ( s , t ) = 2 , dist 2 − gram ( s , v ) = 5 , and dist 2 − gram ( t , v ) = 5 . Proof: The three properties follow from the fact that the Manhattan metric is a Note that it is possible to have distinct strings with q -gram distance 0, e.g. metric. The example above shows that dist q − gram ( x , y ) = 0 does not imply x = y . for w = AGGGCACA , we have dist 2 − gram ( s , w ) = 0 . Exercise: Prove the lemma explicitly. (Don’t just believe this, double check it!) 8 / 21 9 / 21

Connection to edit distance Connection to edit distance q -gram Lemma Let d edit ( s , t ) denote the (unit-cost) edit distance of s and t . Then Examples dist q − gram ( s , t ) With the earlier examples, we have  d edit ( s , t ) . 2 q 1. Exchange of two long substrings: d edit ( s , t ) = 6 , d edit ( s , w ) = 4 (compare to: dist q − gram ( s , t ) = 2 , dist q − gram ( s , w ) = 0 , with q = 2). Proof 2. The q -gram distance is at most 2 q times edit distance ( q -gram Every edit operation contributes to the q -gram distance at most 2 q : Consider the lemma): d edit ( s , v ) = 2 simplest case, a substitution in position i of s , where character s i is substituted by character x , and let s 0 be the resulting string. If q  i  | s | � q + 1, then there (compare to: dist q − gram ( s , v ) = 5  8 = d edit ( s , v ) · 2 q , with q = 2) are exactly q q -grams of s a ff ected by the substitution: s i � q +1 . . . s i , up to s i . . . s i + q � 1 (otherwise fewer); the counts of all these are decremented by 1, while Based on the q -gram lemma and the fact that the q -gram distance can be the counts of the new q -grams s i � 1+1 . . . x , s i . . . xs i + q , etc. are incremented by 1. computed in linear time, we can use the q -gram distance as a filter for edit Therefore, dist q � gram ( s , s 0 )  2 q (it could be less because these q -grams need not distance computations. be all distinct). For a deletion, the number of q -grams whose count is decremented is at most q , while those whose count is incremented is at most q � 1; for an insertion the other way around.—The claim follows by induction on the number of edit operations. 10 / 21 11 / 21 Computation of the q -gram distance Computation of the q -gram distance Basic ideas Algorithm for computing q -gram distance input: Strings s , t of length | s | = n and | t | = m • Use a sliding window of size q over s and t output: dist q − gram ( s , t ) • Use an array d q of size σ q 1. initialize d q [0 . . . σ q � 1] with 0s • First slide a window over s , increment respective entry for every q -gram seen 2. for i = 1 , . . . , n � q + 1 : r rank ( s i . . . s i + q − 1 ) d q [ r ] d q [ r ] + 1 • Then slide over t , decrement respective entry for every q -gram seen 3. for i = 1 , . . . , m � q + 1 : r rank ( t i . . . t i + q − 1 ) • Now d q [ r ] = N ( s , u r ) � N ( t , u r ). d q [ r ] d q [ r ] � 1 • Sum up the absolute values of the entries: 4. d 0 dist q − gram ( s , t ) = P i | d q [ i ] | 5. for i = 0 . . . σ q � 1 : d d + | d q [ i ] | . We will see: This algorithm runs in linear time. 6. return d But: how do we know where to find the entry for the current q -gram? This is called ranking (coming soon) For an example, see next slide. 12 / 21 13 / 21 r u r d q after the d q after the r u r d q after the pass thru s pass thru t pass thru s Goal 0 0 0 AA 0 � 1 AA Given q -gram u , we want to know which entry of Example: 1 1 1 1 0 AC AC the array u corresponds to. 2 1 1 2 AG 1 AG Ex.: Where is the q -gram CG ? In position 6. 3 0 0 3 0 AT AT s = ACAGGGCA , 4 2 0 4 2 CA CA t = GGGCAACA . Ranking functions 5 0 0 5 0 CC CC • A ranking function is a bijection 6 0 0 6 0 CG CG On the right, the array d q rank : Σ q ! [0 . . . σ q � 1]. 7 0 0 7 0 CT CT after line 2. of the algo 8 0 0 8 0 GA GA • rank ( u ) gives us the position of u in the (now d q equals P q ( s )) 9 1 9 GC 1 0 GC enumeration of Σ q and after line 3. 10 2 10 GG 2 0 GG • needs to be very e ffi ciently computable 11 0 Finally, we have 11 0 0 GT GT 12 0 0 12 TA 0 d 2 ( s , t ) = | � 1 | + 1 = 2. TA • the ranking function we use will give us 13 0 0 13 0 TC TC constant time per q -gram of s 14 0 0 14 0 TG TG 15 0 0 15 0 TT TT 14 / 21 15 / 21

The q -gram distance Bioinformatics Algorithms In many situations, - PDF document

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. But sometimes, other distance functions serve the purpose

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Algorithms in Bioinformatics: Proteins Methods for protein Molecular Distance Geometry

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Uncountably many quasi-isometry classes of groups of type FP Ignat Soroko University of Oklahoma

GAN Compression: Efficient Architectures for Interactive Conditional GANs Muyang Li 1,3 , Ji Lin 1

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong

Language Model School of Data Science, Fudan University

Constructing Equiangular Tight Frames with Alternating Projection Joel A. Tropp

Quartic Curves and Their Bitangents Bernd Sturmfels, UC Berkeley joint work with Daniel Plaumann

The q -gram distance Bioinformatics Algorithms In many situations, - PDF document

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. But sometimes, other distance functions serve the purpose

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Algorithms in Bioinformatics: Proteins Methods for protein Molecular Distance Geometry

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Uncountably many quasi-isometry classes of groups of type FP Ignat Soroko University of Oklahoma

GAN Compression: Efficient Architectures for Interactive Conditional GANs Muyang Li 1,3 , Ji Lin 1

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong

Language Model School of Data Science, Fudan University

Constructing Equiangular Tight Frames with Alternating Projection Joel A. Tropp

Quartic Curves and Their Bitangents Bernd Sturmfels, UC Berkeley joint work with Daniel Plaumann

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt