cs cs 466 466 in introduct ctio ion t to b bio ioin
play

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:


  1. CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020

  2. Course Announcements Instructor: • Mohammed El-Kebir (melkebir) • Office hours: Wednesdays, 3:15-4:15pm TA: • Aswhin Ramesh (aramesh7) • Office hours: Fridays, 11:00-11:59am in SC 3405 Homework 1 : Due on Sept. 18 (11:59pm) Midterm : 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me) 2

  3. Global, Fitting and Local Alignment Global Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find alignment 0 T A C G G C 𝐰 \ 𝐱 of 𝐰 and 𝐱 with maximum score. 0 [Needleman-Wunsch algorithm] A Fitting Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find an G alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡 ∗ among all global G alignments of 𝐰 and all substrings of 𝐱 Local Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find a substring Question : How to assess of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡 ∗ among all resulting algorithms? global alignments of all substrings of 𝐰 and 𝐱 [Smith-Waterman algorithm] 3

  4. Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O 4

  5. Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O Question : Compute alignment faster than 𝑃(𝑛𝑜 ) time? [ subquadratic time ] 5

  6. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 T m = 4 6

  7. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 7

  8. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 Question : Compute alignment in 𝑃(𝑛 ) space? [ linear space ] 8

  9. Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 9

  10. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 𝑛 10

  11. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. 𝑛 11

  12. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. Question : What if we want alignment itself? 𝑛 12

  13. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v

  14. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v

  15. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v Best score for column might not be part of best alignment!

  16. Space Efficient Alignment – Second Attempt 𝑜/2 Alignment is a path from source (0, 0) to target (𝑛, 𝑜) in edit graph 𝑗 ∗ Maximum weight path from (0,0) to (𝑛, 𝑜) passes 𝑗 through (𝑗 ∗ , 𝑜/2) Question : What is 𝑗 ∗ ? 𝑛 16

  17. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. 17

  18. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) 18

  19. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) Question : How to reconstruct alignment from reported vertices? 19

  20. Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 𝑛 20

  21. Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 Want : Compute sufYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time 𝑛 and O(𝑛) space Doing a longest path from each 𝑗, 𝑘 to 𝑛, 𝑜 (for all 0 ≤ 𝑗 ≤ 𝑛 ) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound! 21

  22. Hirschberg Algorithm: Reconstructing Alignment A T C G C Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) W if 𝑘 E − 𝑘 > 1 1. 0 1 2 3 4 5 V 𝑗 ∗ ß arg max 2. wt(𝑗) 0 0 -1 -2 -3 -4 -5 VNMN$ Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) A 1 -1 1 0 -1 -2 -3 Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) T 2 -2 0 2 1 0 -1 Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. G 3 -3 -1 1 1 2 1 Problem: Given reported vertices and T 4 -4 -2 0 0 1 1 scores { 𝑗 V , 0, 𝑡 V , … , 𝑗 & , 𝑜, 𝑡 & } , find intermediary vertices. C 5 -5 -3 -1 1 0 2 Transposing matrix does not help, A T - G T C because gaps could occur in both input sequences A T C G - C 22

  23. Linear Space Alignment – The Hirschberg Algorithm 23

  24. Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 24

Recommend


More recommend