pairwise rna edit distance
play

Pairwise RNA Edit Distance In the following: Sequences S 1 and S 2 - PowerPoint PPT Presentation

Pairwise RNA Edit Distance In the following: Sequences S 1 and S 2 associated structures P 1 and P 2 scoring of alignment: different edit operations arc altering arc removing ..(((....)))..... 1)


  1. → Pairwise RNA Edit Distance • In the following: • Sequences S 1 and S 2 • associated structures P 1 and P 2 • scoring of alignment: different edit operations arc altering arc removing −−−.−−−.(((....)))...−−.. 1) ACGUUGACUGACAACAC −−−A−−−CGUUGACUGACAAC−−AC ..(((....)))..... 2) ACGAUCACGUACUAGCCUGAC ACGAUCACGU−−ACUAGC−−CUGAC ....(((.((....)).))). ....(((.((−−....))−−.))). base deletion arc mismatch base match arc match • Notation: • S k [ i ]: position i in sequence k (for k = 1 , 2). S.Will, 18.417, Fall 2011 • S k [ i ] is free if there is no arc incident in P k to i Jiang et al., 2002: • above scoring scheme • complexity of different problem classes • algorithms

  2. → Edit Distance – Scores • base scoring: base mismatch w m , base indel w d . • case 1: arc match and arc mismatch • arc match (cost 0): S 1 [ i 1 ] = S 2 [ j 1 ] and S 1 [ i 2 ] = S 2 [ j 2 ] i 1 i 2 • arc mismatch : S 1 [ i 1 ] � = S 2 [ j 1 ] or S 1 [ i 2 ] � = S 2 [ j 2 ] • cost for mismatch: • if both ends differ: w am j 1 j 2 • if only one differs: w am 2 • in the following: different ways of deleting arcs cost: cost for deleting arc + cost for base operations • case 2: arc breaking S.Will, 18.417, Fall 2011 i 1 i 2 • ( i 1 , i 2 ) in P 1 , but ( j 1 , j 2 ) is not in P 2 • cost: w b + possibly 2 · w m . j 1 j 2

  3. → Edit Distance – Scores (Cont.) • case 3: arc altering • case 4: arc removing i 1 i 2 i 1 i 2 j 1 j 2 • cost: w a + possibly w m . • cost: w r • remark: arc breaking/altering/removal can overlap A U G G G A S.Will, 18.417, Fall 2011 A G G G U U

  4. → Edit Distance – Scores Summary • operations on single bases: • base insertion/deletion ( w d ) • base mismatch ( w m ) • operations that act on both ends of an arc: 1. arc mismatch ( w am ) 2. arc breaking ( w b ) 3. arc altering ( w a ) 4. arc removing ( w r ) Example: S.Will, 18.417, Fall 2011 1234567890123456 (..)((.(.)))(..) CCGGAGGCCGCUCCCG CCG-ACCC-CGU-CC- (.).((....))....

  5. → Plan 1. Jiang algorithm solves the edit problem given the following restrictions: • non-crossing (aka nested aka pseudoknot-free) input structures 1 • pairwise alignment only • scoring restricted by w a = w r + w b . 2 2. show MAX-SNP-hardness without the restriction w a = w r + w b . 2 S.Will, 18.417, Fall 2011 1 actually, we will see that crossing in at most one structure is OK

  6. → Restriction w a = w r + w b 2 • Arc altering is at one end like arc removing and at the other end arc breaking • Restriction w a = w r + w b captures that 2 ⇒ left and right ends of arcs can be scored independently if they are broken, deleted or altered. ⇒ cost for arc end deletion w end and breaking w end instead d b of w r , w b , and w a : w b = 2 · w end b w r = 2 · w end d w a = w r + w b = w end + w end b d 2 S.Will, 18.417, Fall 2011 i ’ i k w d e n d w b e nd w end d A j j ’

  7. → Independent Arc Scoring • cost for arc end deletion w end and breaking w end Hence: Cost d b i 1 i 2 • arc breaking: w b = 2 · w end b j 1 j 2 i 1 i 2 • arc removing: w r = 2 · w end d i 1 i 2 • arc altering: w a = w end + w end b d j 1 j 2 S.Will, 18.417, Fall 2011 of breaking or removing one end of the arc is independent of whether the other end is broken/removed or not. Only the cost of matching one end of an arc is dependent on whether the other end is matched, too.

  8. → Example • cost for arc end deletion w end and breaking w end d b • arc breaking: w b = 2 · w end b • arc removing: w r = 2 · w end d • arc altering: w a = w end + w end b d 1234567890123456 (..)((.(.)))(..) S.Will, 18.417, Fall 2011 CCGGAGGCCGCUCCCG CCG-ACCC-CGU-CC- (.).((....))....

  9. → How to make a DP algorithm for alignment? dynamic programming ⇒ compute optimal alignment recursively from optimal alignments of “fragments” questions to answer: • what kind of “fragments” do we consider? ( ⇒ semantics of a matrix entry) • how to compute the solutions for all these fragments? ( ⇒ recursion equation) • complexity • details (evaluation order, implementation details,...) S.Will, 18.417, Fall 2011

  10. → Semantics of DP entry D ( i , i ′ , j , j ′ ) D ( i , i ′ , j , j ′ ) is the minimum cost of aligning the fragment [ i , i ′ ] of the first sequence to the fragment [ j , j ′ ] of the second sequence given that no arcs are matched that have one end inside these fragments and one end outside. Remarks • The additional restriction makes the alignment of the fragments independent of the alignment of the remaining parts. • We will see later, why it is not sufficient to look at (alignments of) prefixes, as done for plain sequence alignment. S.Will, 18.417, Fall 2011

  11. → Recursion for D ( i , i ′ , j , j ′ ) D ( i , i ′ , j , j ′ ) =  D ( i , i ′ − 1 , j , j ′ ) + w d + ψ 1 ( i ′ )( w end − w d )  d   D ( i , i ′ , j , j ′ − 1) + w d + ψ 2 ( j ′ )( w end  − w d )  d    D ( i , i ′ − 1 , j , j ′ − 1) + χ ( i ′ , j ′ ) w m + ( ψ 1 ( i ′ ) + ψ 2 ( j ′ )) w end   b min if ∃ ( a 1 , a 2 ) = (( i 1 , i ′ ) , ( j 1 , j ′ )) ∈ P 1 × P 2 for some i 1 , j 1    D ( i , i 1 − 1 , j , j 1 − 1) + D ( i 1 + 1 , i ′ − 1 , j 1 + 1 , j ′ − 1)      +( χ ( i 1 , j 1 ) + χ ( i ′ , j ′ )) w am   2 Notation S.Will, 18.417, Fall 2011 • ψ 1 ( i ) = 1 if i is paired in structure 1, 0 otherwise. ( ψ 2 ( i ) analogous) • χ ( i , j ) = 1 if S 1 [ i ] � = S 2 [ j ], 0 otherwise.

  12. → An optimized version: Jiang Algorithm • D ( i , i ′ , j , j ′ ) alignment of subsequences • in principle: all regions [ i .. i ′ ] and [ j .. j ′ ]. ⇒ O ( n 2 m 2 ) space • But: not all entries are considered a 1 i a l a 1 l +1 1 l l a a 2 +1 j 2 a 2 S.Will, 18.417, Fall 2011 • Hence: O ( nm )-matrices M a 1 a 2 for each pair of arcs a 1 , a 2 . Each matrix: O ( nm ) entries M a 1 a 2 ( i , j )

  13. → Jiang Recursion • reformulated recursion:  a 1 a 1 i i  M a 1  a 2 ( i − 1 , j ) + w d  i−1 i−1 a 1 a 1 l l  aligned aligned  to gap to gap  + ψ 1 ( i )( w end − w d ) broken bond  a l a l  j j d 2 2     a a  2 2    a 1   i  M a 1  a 2 ( i , j − 1) + w d  a l  aligned broken bond  1 to gap  + ψ 2 ( j )( w end − w d )  l a 2  j−1 d   j   a  2 M a 1 a 2 ( i , j ) = min a 1 i  M a 1  a 2 ( i − 1 , j − 1) + χ ( i , j ) w m  a 1 l i−1    +( ψ 1 ( i ) + ψ 2 ( j )) w end broken bond  a l  j−1 b 2    j  a  2    M a 1 a 2 ( i ′ − 1 , j ′ − 1) a 1  a’ 1     + M a ′ i’ a 1 l i S.Will, 18.417, Fall 2011  1  2 ( i − 1 , j − 1)  a ′  a 2 l j  j’  +( χ ( i ′ , j ′ ) + χ ( i , j )) w am   a’  2 2 a  2

  14. → Complexity • time complexity: O ( nm ) arc pairs × O ( nm ) alignment below arcs = O ( n 2 m 2 ) time • remaining question: space complexity: • each entry of some M a 1 a 2 only depends on • other entries of the same matrix M a 1 a 2 • and final entries of arc pairs of smaller arcs: a 1 a 1 l a 1 +1 l a 1 −1 r a 1 r l l r r a 2 a 2 +1 a 2 −1 a 2 a 2 ⇒ store final values in separate O ( nm ) matrix F (in recursion, replace lookup M a ′ 1 2 ( i − 1 , j − 1) by F ( a ′ 1 , a ′ 2 )) a ′ • ⇒ it suffices to keep only F and one M a 1 a 2 in memory simultaneously. S.Will, 18.417, Fall 2011 • compute all M a 1 a 2 ordered (increasing) according to size of a 1 and a 2

  15. → Complexity • Matrix F : O ( nm ) space • only one Matrix M a 1 a 2 at a time: O ( nm ) space argument: for computing one entry M a 1 a 2 ( i , j ), recurse only to F ( a ′ 1 , a ′ 2 ) for “smaller” a ′ 1 , a ′ 2 or entries of the same matrix M a 1 a 2 consequence: reuse space for M a 1 a 2 • TOTAL: O ( nm + nm ) = O ( nm ) space drawback: traceback requires recomputation but only O (min( n , m )) many matrices M a 1 a 2 need to be recomputed. S.Will, 18.417, Fall 2011

  16. → What about Pseudoknots? • Why doesn’t the algorithm work for pseudoknots? ⇒ last recursion case does not cover cases where matched arcs cross (compare Nussinov) S.Will, 18.417, Fall 2011 • only matching of crossing arcs is a problem ⇒ pseudoknots in only one of the structures are OK.

  17. → The alignment hierarchy • Alignment approaches have different limitations concerning • the two input structures • the common superstructure (e.g. for tree alignment ⇒ nested) • the set of edit operations • alignment hierarchy classifies alignment problems as input1 × input2 → superstructure with input1,input2,superstructure being one of • plain : only plain sequence (no basepairs at all) • nest : only nested structures (no pseudoknots) • cross : crossing structures (pseudoknots) • unlim : unlimited, also several base pairs per base possible. • Examples: S.Will, 18.417, Fall 2011 • cross × nest → unlim : Jiang algorithm • nest × nest → nest : tree alignment

Recommend


More recommend