CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020
Course Announcements Instructor: • Mohammed El-Kebir (melkebir) • Office hours: Wednesdays, 3:15-4:15pm TA: • Aswhin Ramesh (aramesh7) • Office hours: Fridays, 11:00-11:59am in SC 3405 Homework 1 : Due on Sept. 18 (11:59pm) Midterm : 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me) 2
Global, Fitting and Local Alignment Global Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find alignment 0 T A C G G C 𝐰 \ 𝐱 of 𝐰 and 𝐱 with maximum score. 0 [Needleman-Wunsch algorithm] A Fitting Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find an G alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡 ∗ among all global G alignments of 𝐰 and all substrings of 𝐱 Local Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find a substring Question : How to assess of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡 ∗ among all resulting algorithms? global alignments of all substrings of 𝐰 and 𝐱 [Smith-Waterman algorithm] 3
Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O 4
Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O Question : Compute alignment faster than 𝑃(𝑛𝑜 ) time? [ subquadratic time ] 5
Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 T m = 4 6
Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 7
Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 Question : Compute alignment in 𝑃(𝑛 ) space? [ linear space ] 8
Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 9
Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1] 0 , if i = 0 and j = 0, s [ i − 1 , j ] + δ ( v i , − ) , if i > 0, s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0, s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0. 𝑗 𝑛 10
Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1] 0 , if i = 0 and j = 0, s [ i − 1 , j ] + δ ( v i , − ) , if i > 0, s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0, s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0. 𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. 𝑛 11
Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1] 0 , if i = 0 and j = 0, s [ i − 1 , j ] + δ ( v i , − ) , if i > 0, s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0, s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0. 𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. Question : What if we want alignment itself? 𝑛 12
Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v
Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v
Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v Best score for column might not be part of best alignment!
Space Efficient Alignment – Second Attempt 𝑜/2 Alignment is a path from source (0, 0) to target (𝑛, 𝑜) in edit graph 𝑗 ∗ Maximum weight path from (0,0) to (𝑛, 𝑜) passes 𝑗 through (𝑗 ∗ , 𝑜/2) Question : What is 𝑗 ∗ ? 𝑛 16
Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. 17
Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) 18
Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) Question : How to reconstruct alignment from reported vertices? 19
Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 𝑛 20
Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 Want : Compute sufYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time 𝑛 and O(𝑛) space Doing a longest path from each 𝑗, 𝑘 to 𝑛, 𝑜 (for all 0 ≤ 𝑗 ≤ 𝑛 ) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound! 21
Hirschberg Algorithm: Reconstructing Alignment A T C G C Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) W if 𝑘 E − 𝑘 > 1 1. 0 1 2 3 4 5 V 𝑗 ∗ ß arg max 2. wt(𝑗) 0 0 -1 -2 -3 -4 -5 VNMN$ Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) A 1 -1 1 0 -1 -2 -3 Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) T 2 -2 0 2 1 0 -1 Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. G 3 -3 -1 1 1 2 1 Problem: Given reported vertices and T 4 -4 -2 0 0 1 1 scores { 𝑗 V , 0, 𝑡 V , … , 𝑗 & , 𝑜, 𝑡 & } , find intermediary vertices. C 5 -5 -3 -1 1 0 2 Transposing matrix does not help, A T - G T C because gaps could occur in both input sequences A T C G - C 22
Linear Space Alignment – The Hirschberg Algorithm 23
Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 24
Recommend
More recommend