CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2 Mohammed El-Kebir January 28, 2020
Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9 • Lecture notes 2
Weighted Edit Distance – Practice Problem • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT . A T C G V w 0 1 2 3 4 0 , if i = 0 and j = 0, 0 d [ i � 1 , j ] + 1 , if i > 0, d [ i, j ] = min d [ i, j � 1] + 1 , if j > 0, A 1 d [ i � 1 , j � 1] + 2 , if i > 0, j > 0 and v i 6 = w j , d [ i � 1 , j � 1] , if i > 0, j > 0 and v i = w j . G 2 T 3 3
Weighted Edit Distance – Practice Problem • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT . A T C G V w 0 1 2 3 4 0 , if i = 0 and j = 0, 0 0 1 2 3 4 d [ i � 1 , j ] + 1 , if i > 0, d [ i, j ] = min d [ i, j � 1] + 1 , if j > 0, A 1 1 0 1 2 3 d [ i � 1 , j � 1] + 2 , if i > 0, j > 0 and v i 6 = w j , d [ i � 1 , j � 1] , if i > 0, j > 0 and v i = w j . G 2 2 1 2 3 2 T 3 3 2 1 2 3 4
Edit Distance – Additional Insights • An alignment corresponds to a series of elementary operations Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 5
Edit Distance – Additional Insights • An alignment corresponds to a series of elementary operations • But not every series of elementary operations corresponds to an alignment! Why? Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 6
Distance Function / Metric A distance function (metric) on a set 𝑌 is a function 𝑒 ∶ 𝑌 × 𝑌 → ℝ s.t. for all 𝑦, 𝑧, 𝑨 ∈ 𝑌 : i. 𝑒 𝑦, 𝑧 ≥ 0 [non-negativity] ii. 𝑒 𝑦, 𝑧 = 0 if and only if 𝑦 = 𝑧 [identity of indiscernibles] iii. 𝑒 𝑦, 𝑧 = 𝑒(𝑧, 𝑦) [symmetry] iv. 𝑒 𝑦, 𝑧 ≤ 𝑒 𝑦, 𝑨 + 𝑒(𝑨, 𝑧) [triangle inequality] Question : Is edit distance a distance function? 7
Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . i. 𝑒 𝐰, 𝐱 ≥ 0 [non-negativity] Edit distance is defined by an alignment. This in turn uniquely determines a series of elementary operations, each with cost either 0 (match) or 1 (otherwise). Thus, 𝑒 𝐰, 𝐱 ≥ 0 . 8
Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . ii. 𝑒 𝐰, 𝐱 = 0 if and only if 𝐰 = 𝐱 [identity of indiscernibles] (=>) By the premise, 𝑒 𝐰, 𝐱 = 0 . By definition, the optimal alignment can only consist of operations with cost 0. That is, the alignment consist of only matches. Thus, 𝐰 = 𝐱 . (<=) By the premise, 𝐰 = 𝐱 . Thus, there exists an alignment where every pair of columns is a match. This means that |𝐰| = |𝐱| and each letter 𝑤 A equals 𝑥 A (where 𝑗 ∈ [|𝐰|] ). Moreover, only the match operations has cost 0, the other operations have cost 1. Hence, this is the optimal alignment with cost 𝑒 𝐰, 𝐱 = 0 . 9
Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . iii. 𝑒 𝐰, 𝐱 = 𝑒(𝐱, 𝐰) [symmetry] Let 𝐁 = [𝑏 A,H ] be the optimal alignment corresponding to 𝑒 𝐰, 𝐱 , i.e. 𝐁 is an 2 × 𝑙 matrix where 𝑙 ∈ {max( 𝐰 , 𝐱 ), … , 𝐰 + 𝐱 } . Define the function 𝑔 𝐁 = 𝐂 such that 𝐂 is obtained by interchanging the two rows of 𝐁 . Since the cost of any insertion, deletion and mismatch is 1, we have that alignment 𝐂 has cost 𝑒 𝐰, 𝐱 . The existence of an alignment from 𝐱 to 𝐰 with cost less than 𝑒 𝐰, 𝐱 , yields a contradiction as it implies that 𝐁 is not an optimal alignment from 𝐰 to 𝐱 . Hence, 𝑒 𝐱, 𝐰 = 𝑒 𝐰, 𝐱 . 10
Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . iv. 𝑒 𝐰, 𝐱 ≤ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) [triangle inequality] Assume for a contradiction that 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) . Let 𝑇 be the sequence of elementary operations for transforming 𝐰 into 𝐯 . Let 𝑇′ be the sequence of elementary operations for transforming 𝐯 into 𝐱 . Note that 𝑒 𝐰, 𝐯 = |𝑇| and 𝑒 𝐯, 𝐱 = |𝑇′| . Concatenate 𝑇 and 𝑇′ and remove redundant operations, yielding sequence 𝑇′′ . By definition, 𝑇 VV ≤ 𝑇 + 𝑇 V . We can obtain an alignment of 𝐰 and 𝐱 from 𝑇′′ with cost 𝑇 VV ≤ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) . This yields a contradiction with 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) being the cost of the optimal alignment of 𝐰 and 𝐱 . 11
Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9 12
Biological Sequence Alignment W • Weighted edit distance: find A T C G alignment with minimum V 0 1 2 3 4 distance • Shortest path in weighted 0 O O O O O edit graph A 1 O O O O O • Sequence alignment: find alignment with maximum T 2 O O O O O similarity G 3 O O O O O • Longest path in weighted T edit graph 4 O O O O O • Score function: Z → ℝ 𝜀 ∶ Σ ∪ − deletion insertion mismatch match $ % $ % $ % - " " " - # # # 𝜀(𝑤 A , −) 𝜀(−, 𝑥 H ) 𝜀(𝑤 A , 𝑥 H ) Question : What is an example of 𝜀 ? 13
Scoring Matrices A C Transitions: interchanges among purines (two rings) or pyrimidines (one ring) • A <--> G • C <--> T Transversions: interchanges between purines (two rings) and pyrimidines (one ring) • A <--> C, A <--> T • G <--> C, G <--> T Transitions more likely than transversions! G T 14
Scoring Matrices Transitions: interchanges among purines (two rings) or pyrimidines (one ring) 𝜀 A T C G - • A <--> G A 1 -2 -2 -1 -1 • C <--> T T -2 1 -1 -2 -1 Transversions: interchanges between purines C -2 -1 1 -2 -1 (two rings) and pyrimidines (one ring) • A <--> C, A <--> T G -1 -2 -2 1 -1 • G <--> C, G <--> T - -1 -1 -1 -1 −∞ Transitions more likely than transversions! 15
Global Alignment – Needleman-Wunsch Algorithm Global Alignment problem: Given strings 𝐰 ∈ Σ ` and 𝐱 ∈ Σ a and scoring function 𝜀 , find alignment with maximum score. • An alignment is a source-to-sink path in the edit graph • An alignment 𝐁 = [𝑏 A,H ] is a 2 × 𝑙 matrix s.t. (i) 𝑙 = {max 𝑛, 𝑜 , … , 𝑛 + 𝑜} , (ii) 𝑏 A,H ∈ Σ ∪ − and (iii) there is no 𝑘 ∈ [𝑙] where 𝑏 _,H = 𝑏 Z,H = − 0 , if i = 0 and j = 0, deletion s [ i − 1 , j ] + δ ( v i , − ) , if i > 0, s [ i, j ] = max insertion s [ i, j − 1] + δ ( − , w j ) , if j > 0, match/ s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0. mismatch 16
Demonstration • http://alfehrest.org/sub/nwa/index.html • 𝐰 = ATGTTAT and 𝐱 = ATCGTAC . 𝜀 A T C G - A 1 -2 -2 -1 -1 T -2 1 -1 -2 -1 C -2 -1 1 -2 -1 G -1 -2 -2 1 -1 - -1 -1 -1 -1 −∞ 17
Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.7 and 6.9 • Lecture notes 18
Next Generation Sequencing (NGS) Technology 100,000,000 NGS 10,000,000 1,000,000 Log Scale 100,000 10,000 1,000 November, 2017 19
NGS Characterized by Short Reads … CATTCAGTAG … … AGCCATTAG … … GGTAGTTAG … … GGTAAACTAG … … TATAATTAG … … CGTACCTAG … Genome 10-100’s million short reads Next-generation Millions -billions Short read : 100 nucleotides DNA sequencing nucleotides Allow for inexact matches due to: • Sequencing errors • Polymorphisms/mutations in reference genome 20
NGS Characterized by Short Reads … CATTCAGTAG … … AGCCATTAG … … GGTAGTTAG … … GGTAAACTAG … … TATAATTAG … … CGTACCTAG … Genome 10-100’s million short reads Next-generation Millions -billions Short read : 100 nucleotides DNA sequencing nucleotides Allow for inexact matches due to: Human reference genome is 3,300,000,000 nucleotides, while a • Sequencing errors short read is 100 nucleotides. • Polymorphisms/mutations in Global sequence alignment will not reference genome work! Question : How to account for discrepancy between lengths of reference and short read? 21
Recommend
More recommend