sequence structure alignment a general formulation
play

Sequence-Structure Alignment A General Formulation Unifying view on - PowerPoint PPT Presentation

Sequence-Structure Alignment A General Formulation Unifying view on Edit Distance, SA&F, ... IN S 1 , . . . , S k P 1 , . . . , P k { 1 , . . . , | S i |} : sets of basepairs score on alignments OUT Alignment


  1. Sequence-Structure Alignment — A General Formulation “Unifying view on Edit Distance, SA&F, ...” IN • S 1 , . . . , S k ∈ Σ • P 1 , . . . , P k ∈ { 1 , . . . , | S i |} : sets of basepairs • score on alignments OUT Alignment A = ( S ∗ 1 , P ∗ 1 , . . . , S ∗ k , P ∗ k ) that maximizes score( A ), where S ∗ i | Σ = S i , “ P ∗ i | Σ ” ⊆ P i , . . . Exact conditions and score vary S.Will, 18.417, Fall 2011 problem classes: restrict input and output structures, score

  2. Alignment with Fixed Input Structures Jiang et al. A General Edit Distance between RNA Structures. JCB , 2002. • “ P ∗ i | Σ ” = P i , i.e. output structure = input structure • score is rather general edit distance (breaking of basepairs) • only pairwise, k = 2 • efficient only for NESTED/CROSSING with “not so general score” S.Will, 18.417, Fall 2011

  3. Alignment with Fixed Input Structures – Pseudoknots • CROSSING/CROSSING, i.e. pseudoknots allowed • restricted pseudoknots: e.g., no crossing of 3 basepairs Patricia A. Evans. Finding common RNA pseudoknot structures in polynomial time. CPM 2006. a) a three−knot b) interleaved left−right endpoints M¨ ohl, Will, Backofen. Lifting prediction to alignment of RNA pseudoknots. RECOMB 2009. • general crossing: S.Will, 18.417, Fall 2011 M¨ ohl, Will, Backofen. Fixed parameter tractable alignment of RNA structures including arbitrary pseudoknots. CPM 2008

  4. Simultaneous Alignment and Folding (SA&F) David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. , 1985. • “ P ∗ i | Σ ” ⊆ P i • input structures crossing (all potential basepairs) • output structures non-crossing Example Input: P 1 = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 S 1 = ACGGACUUACGGACUUGACUCGGACU S 2 = CGGAACGUAUACGGACUCCAGACUACGUGCA S.Will, 18.417, Fall 2011 P 2 = 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  5. Example SA&F IN: P 1 = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 S 1 = ACGGACUUACGGACUUGACUCGGACU S 2 = CGGAACGUAUACGGACUCCAGACUACGUGCA P 2 = 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 OUT: P ∗ 1 ≡ ----.(.((..(........)..)).)...---- S ∗ 1 = ----ACGGACUUACGGACUUGACUCGGACU---- S.Will, 18.417, Fall 2011 S ∗ 2 = CGGAACGUAUACGGACUCCAGACUACG---UGCA P ∗ 2 ≡ .....(.((..(........)..)).)---....

  6. Incomplete history of SA&F • 1985 Sankoff. Computationally heavy, no implementation • 1997 Foldalign (Gorodkin et only stems, simpler energy • 2002 Dynalign (Mathews, Turner) first “full” implementation • 2004 PMcomp (Hofacker et al.) clever simplification • 2007 FoldalignM Mc (Torarinsson et al.), PMcomp implementation • 2007 LocARNA (Will, et al.), PMcomp-based, more time and space efficient, optionally local • 2008 RAF (Do, et al. ), PMcomp-based, sequence-sparsity, machine learning S.Will, 18.417, Fall 2011 • 2011 LocARNA-P (Will, et al.), efficient partition function

  7. PMcomp: A Realistic Nussinov-style Sankoff-Algorithm Idea: • Simplify Energy Model of SA&F: Loop-based (Zuker-style) ⇒ Base-pair-based (Nussinov-style) • Advantage? • Problem? • Add realistic energy scoring again!: McCaskill pair probabilities S.Will, 18.417, Fall 2011

  8. PMcomp: Nussinov-style Sankoff — Recursion  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) j’ j i k l’ l S.Will, 18.417, Fall 2011

  9. PMcomp: Nussinov-style Sankoff — Recursion  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) j’ j i k l’ l S.Will, 18.417, Fall 2011

  10. PMcomp — Scoring  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) Idea: • τ ( i , j , k , l ) = Ψ A ij + Ψ B kl • Ψ A ij , Ψ B kl : log odds scores for base-pairs • “McCaskill”-basepair probabilities vs. background S.Will, 18.417, Fall 2011 Hofacker et al. Alignment of RNA base pairing probability matrices. Bioinformatics , 2004.

  11. Complexity PMcomp  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) • O ( n 2 · m 2 ) entries in M • per entry: O ( nm ) time Total Complexity: O ( n 3 m 3 ) time, O ( n 2 m 2 ) space S.Will, 18.417, Fall 2011

  12. LocARNA: Making PMcomp/Sankoff practical Ideas: • follow PMcomp idea for scoring • only consider significant base pairs: “cut-off probability” 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 • reformulate recursion S.Will, 18.417, Fall 2011 • profit in time and space complexity

  13. Effect of Base-Pair Filtering p cutoff = 0 . 005 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  14. Effect of Base-Pair Filtering p cutoff = 0 . 01 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  15. Effect of Base-Pair Filtering p cutoff = 0 . 05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  16. Effect of Base-Pair Filtering p cutoff = 0 . 1 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  17. Locarna Basic Algorithm: Matrices b1 b2 b3 b4 D a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 S.Will, 18.417, Fall 2011

  18. Locarna Basic Algorithm: Matrices 1 m b1 b2 b3 b4 D M 1 a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 n S.Will, 18.417, Fall 2011

  19. Locarna Basic Algorithm: Matrices 1 m b1 b2 b3 b4 D M 1 a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 n S.Will, 18.417, Fall 2011

  20. Locarna Basic Algorithm: Recursion a=(al,ar) a a al ar al ar al ar = + bl br bl br bl br b=(bl,br) b b D(a,b) M(a,b;ar−1,br−1) tau(a,b) S.Will, 18.417, Fall 2011

  21. Locarna Basic Algorithm: Recursion a al+1 i M(a,b;i−1,j−1) + sigma(Ai,Bj) bl+1 j b a i al+1 M(a,b;i,j−1) + gamma a=(al,ar) bl+1 j al+1 i = max b a bl+1 j al+1 i b=(bl,br) M(a,b;i−1,j) + gamma M(a,b;i,j) bl+1 j b a a’ al+1 i max a’b’: M(a,b;a’l−1,b’l−1) + D(a’,b’) where a’r=i, b’r=j bl+1 j S.Will, 18.417, Fall 2011 b’ b

  22. Locarna Basic Algorithm: Recursion  M a b ( i − 1 , j − 1) + σ ( A i , B j )    M a b ( i − 1 , j ) + γ      M a b ( i , j − 1) + γ M a b ( i , j ) = max a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) S.Will, 18.417, Fall 2011

  23. Complexity LocARNA  M a b ( i − 1 , j − 1) + σ ( A i , B j )    M a b ( i − 1 , j ) + γ      M a b ( i , j ) = max M a b ( i , j − 1) + γ a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) • compute D ( a , b ) for all base-pairs edges: a ∈ P 1 , b ∈ P 2 [and a , b compatible] = ⇒ O ( | P 1 || P 2 | ) • combine D ( a , b )-computation for common ( a l , b l ) ⇒ O ( nm ) S.Will, 18.417, Fall 2011 • per ( a l , b l ): O ( nm · rdeg 1 rdeg 2 ) Total Complexity: O ( nm | P 1 || P 2 | ) time, O ( | P 1 || P 2 | + nm ) space

  24. Affine Gap Cost • Basic algorithm: linear gap cost • Affine gap cost g ( k ) = α + β · k : ala Gotoh 1 m M F 1 n S.Will, 18.417, Fall 2011 E

  25. Affine Gap Cost  M a b ( i − 1 , j − 1) + σ ( A i , B j )    E a b  ( j )  i    M a b ( i , j ) = max F a b i j a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) E a b ( j )= max { E a b i − 1 ( j ) + β, M a b ( i − 1 , j ) + α + β } i F a b i j = max { F a b i j − 1 + β, M a b ( i , j − 1) + α + β } S.Will, 18.417, Fall 2011

Recommend


More recommend