part 2 comparative analysis of rnas
play

Part 2 Comparative Analysis of RNAs S.Will, 18.417, Fall 2011 - PowerPoint PPT Presentation

Part 2 Comparative Analysis of RNAs S.Will, 18.417, Fall 2011 Example Given: set of related RNA sequences >AF008220 GGAGGAUUAGCUCAGCUGGGAGAGCAUCUGCCUUACAAGCAGAGGGUCGGCGGUUCGAGCCCGUCAUCCUCCA >M68929


  1. → Part 2 Comparative Analysis of RNAs S.Will, 18.417, Fall 2011

  2. → Example Given: set of related RNA sequences >AF008220 GGAGGAUUAGCUCAGCUGGGAGAGCAUCUGCCUUACAAGCAGAGGGUCGGCGGUUCGAGCCCGUCAUCCUCCA >M68929 GCGGAUAUAACUUAGGGGUUAAAGUUGCAGAUUGUGGCUCUGAAAACACGGGUUCGAAUCCCGUUAUUCGCC >X02172 GCCUUUAUAGCUUAGUGGUAAAGCGAUAAACUGAAGAUUUAUUUACAUGUAGUUCGAUUCUCAUUAAGGGCA >Z11880 GCCUUCCUAGCUCAGUGGUAGAGCGCACGGCUUUUAACCGUGUGGUCGUGGGUUCGAUCCCCACGGAAGGCG >D10744 GGAAAAUUGAUCAUCGGCAAGAUAAGUUAUUUACUAAAUAAUAGGAUUUAAUAACCUGGUGAGUUCGAAUCUCACAUUUUCCG Wanted: learn about evolutionary relation AF008220 GGAGGAUU-AGCUCAGCUGGGAGAGCAUCUGCCUUACAAGC---------AGAGGGUCGGCGGUUCGAGCCCGUCAUCCUCCA M68929 GCGGAUAU-AACUUAGGGGUUAAAGUUGCAGAUUGUGGCUC---------UGAAAA-CACGGGUUCGAAUCCCGUUAUUCGCC X02172 GCCUUUAU-AGCUUAG-UGGUAAAGCGAUAAACUGAAGAUU---------UAUUUACAUGUAGUUCGAUUCUCAUUAAGGGCA Z11880 GCCUUCCU-AGCUCAG-UGGUAGAGCGCACGGCUUUUAACC---------GUGUGGUCGUGGGUUCGAUCCCCACGGAAGGCG D10744 GGAAAAUUGAUCAUCGGCAAGAUAAGUUAUUUACUAAAUAAUAGGAUUUAAUAACCUGGUGAGUUCGAAUCUCACAUUUUCCG consensus (((((((...((((........))))((((((.......)).........))))....(((((.......)))))))))))). Remarks S.Will, 18.417, Fall 2011 • Usually, we only know the sequences of RNAs. Why? • Important for evolution: sequence AND structure. Why?

  3. → Comparative RNA Analysis A: B: S.Will, 18.417, Fall 2011 A: adopted from: B: [Gardner & Giegerich BMC 2004] consensus: consensus structure:

  4. → Comparative RNA Analysis A: B: A: adopted from: B: [Gardner & Giegerich BMC 2004] consensus: consensus structure: Remarks S.Will, 18.417, Fall 2011 • Here, Comparative RNA Analysis refers to this problem: given a set of RNA sequences, how to match them (alignment) and what’s their common structure (consensus structure). • in general: multiple sequences, here: only pairwise

  5. → Comparative RNA Analysis A: B: S.Will, 18.417, Fall 2011 A: adopted from: B: [Gardner & Giegerich BMC 2004] consensus: consensus structure:

  6. → Comparative RNA Analysis Plan A A: B: ALIGN single sequences A: B: FOLD alignment consensus structure S.Will, 18.417, Fall 2011 A: B: consensus: consensus structure:

  7. → Comparative RNA Analysis Plan A A: B: ALIGN single sequences A: B: FOLD alignment consensus structure A: B: consensus: consensus structure: Remarks • first, simplest way. We will see two further plans. S.Will, 18.417, Fall 2011 • ALIGN: sequence alignment • FOLD: we will generalize prediction for single sequences

  8. → Sequence Alignment, a slightly new definition Example In: A =ACGTAA, B =ACCCT Out: AC-GTAA ACCCT-- “match/mismatch”, “insertion”, “deletion” Definition (Alignment (as set of alignment edges)) An alignment of two (RNA) sequences A and B , n = | A | , m = | B | , is a set A of alignment edges, where 1. for 1 ≤ i ≤ n and 1 ≤ j ≤ m , an alignment edge is either a matching edge ( i , j ) or a gap edge ( i , − ) or ( − , j ). 2. matching edges do not conflict ∀ ( i , j ) , ( i ′ , j ′ ) ∈ A : i < i ′ = S.Will, 18.417, Fall 2011 ⇒ j < j ′ 3. “degree is 1”: • ∀ i : ( i , − ) ∈ A ∨ ∃ ! j : ( i , j ) ∈ A • ∀ j : ( − , j ) ∈ A ∨ ∃ ! i : ( i , j ) ∈ A

  9. → Sequence Alignment, a slightly new definition Definition (Alignment (as set of alignment edges)) An alignment of two (RNA) sequences A and B , n = | A | , m = | B | , is a set A of alignment edges, where 1. for 1 ≤ i ≤ n and 1 ≤ j ≤ m , an alignment edge is either a matching edge ( i , j ) or a gap edge ( i , − ) or ( − , j ). 2. matching edges do not conflict ∀ ( i , j ) , ( i ′ , j ′ ) ∈ A : i < i ′ = ⇒ j < j ′ 3. “degree is 1”: • ∀ i : ( i , − ) ∈ A ∨ ∃ ! j : ( i , j ) ∈ A • ∀ j : ( − , j ) ∈ A ∨ ∃ ! i : ( i , j ) ∈ A Remark S.Will, 18.417, Fall 2011 New definition equivalent to previous one via alignment strings ≡ { (1 , 1) , (2 , 2) , ( − , 3) , (3 , 4) , (4 , 5) , (5 , − ) , (6 , − ) } AC-GTAA ACCCT--

  10. → Recall: The Best Sequence Alignment Idea: define best alignment as alignment with minimal edit distance Definition (Sequence Alignment Problem) Given two (RNA) sequences A and B , find the alignment A of A and B with minimal edit distance � dist A , B ( A ) = d ( i , j ) , ( i , j ) ∈A  i = − or j = − γ   where d ( i , j ) = w m A i � = B j  0 A i = B j .  • idea: how can we transform A into B ? Find sequence of edit S.Will, 18.417, Fall 2011 operations (match/mismatch, insertion, deletion) with minimal weight • d ( i , j ) weights the edit operation from positions i to j

  11. → Recall: Needleman-Wunsch Algorithm Idea: Minimize edit distance by DP. Get best alignment by traceback. Definition (Needleman-Wunsch Matrix) Define the matrix D = ( D ij ) 0 ≤ i ≤ n , 0 ≤ j ≤ m by D ij := min { dist A , B ( A ) | A alignment of A 1 , . . . , A i and B 1 , . . . , B j } . for 1 ≤ i ≤ n , 1 ≤ j ≤ m : Init: D 00 = 0, D i 0 = i γ , D 0 j = j γ ,  D i − 1 j − 1 + d ( i , j )   Recurse: D ij = D i − 1 j + d ( i , − )  S.Will, 18.417, Fall 2011 D ij − 1 + d ( − , j )  Remarks: • recursively compute edit distances of prefix alignments • obtain alignment by trace-back

  12. → Recall: From Pairwise to Multiple Problem: Given set of k RNA sequences, find best multiple alignment Definition (Multiple Alignment) Define a multiple alignment A of K (RNA) sequences S 1 , . . . , S K as a matrix of a ℓ i ∈ { A , C , G , U , −} (1 ≤ ℓ ≤ K , 1 ≤ i ≤ m ), s.t. • for ℓ : deleting each occurrence of − from a ℓ 1 . . . a ℓ m yields S ℓ . • for i : a 1 i . . . a Ki � = − · · · − . Call m the length of A . Recall: Progressive Alignment • pairwise alignments all-vs-all S.Will, 18.417, Fall 2011 • construct guide tree • progressivly construct multiple alignment following guide tree

  13. → You are here Plan A A: B: ALIGN single sequences A: B: FOLD alignment consensus structure A: B: consensus: consensus structure: Example: S 1 =CGAUACG, S 2 =CGAAUACG, S 3 =CCGAUUCGG C-GA-UAC-G S.Will, 18.417, Fall 2011 C-GAAUAC-G CCGA-UUCGG Next: fold the alignment

  14. → How to fold an alignment The Idea of RNAalifold Given a K -way multiple alignment of length m . Goal: predict the (non-crossing) consensus structure of the alignment. A consensus structure is a (non-crossing) RNA structure of length m . An optimal consensus structure minimizes a combination of • sum of free energy over all K RNA sequences and • a conservation score (= evidence for base pairing). Remarks • Think of the alignment as sequence of alignment columns. Folding of this sequence is analogous to folding of an RNA sequence. The consensus structure is a structure of the alignment. S.Will, 18.417, Fall 2011 • Thus, same decomposition as Zuker; except modified scoring: sum loop energies for all sequences & add conservation score • Conservation score γ ( i , j ) for each base pair ( i , j ), awards mutation — penalizes non-complementarity

  15. → RNAalifold — Example AF008220 GGAGGAUU-AGCUCAGCUGGGAGAGCAUCUGCCUUACAAGC---------AGAGGGUCGGCGGUUCGAGCCCGUCAUCCUCCA M68929 GCGGAUAU-AACUUAGGGGUUAAAGUUGCAGAUUGUGGCUC---------UGAAAA-CACGGGUUCGAAUCCCGUUAUUCGCC X02172 GCCUUUAU-AGCUUAG-UGGUAAAGCGAUAAACUGAAGAUU---------UAUUUACAUGUAGUUCGAUUCUCAUUAAGGGCA Z11880 GCCUUCCU-AGCUCAG-UGGUAGAGCGCACGGCUUUUAACC---------GUGUGGUCGUGGGUUCGAUCCCCACGGAAGGCG D10744 GGAAAAUUGAUCAUCGGCAAGAUAAGUUAUUUACUAAAUAAUAGGAUUUAAUAACCUGGUGAGUUCGAAUCUCACAUUUUCCG alifold (((((((...((((........))))((((((.......)).........))))....(((((.......)))))))))))). (-49.58 = -17.46 + -32.12) S.Will, 18.417, Fall 2011

  16. → RNAalifold Recursions � W ij − 1 W ij = min min i ≤ k < j − m W ik − 1 + V kj  � 1 ≤ ℓ ≤ K eH( i , j , S ℓ )   1 ≤ ℓ ≤ K min i < i ′ < j ′ < j V i ′ j ′ + eSBI( i , j , i ′ , j ′ , S ℓ ) V ij = βγ ( i , j ) + min �  min i < k < j WM i +1 k + WM k +1 j − 1 + aK  � WM ij − 1 + cK , WM i +1 j + cK , V ij + bK WM ij = min min i < k < j WM ik + WM k +1 j Remarks S.Will, 18.417, Fall 2011 • eH( i , j , S ℓ ) and eSBI( i , j , i ′ , j ′ , S ℓ ) yield energy contributions for the respective S ℓ .

  17. → RNAalifold Recursions � W ij − 1 W ij = min min i ≤ k < j − m W ik − 1 + V kj  � 1 ≤ ℓ ≤ K eH( i , j , S ℓ )   1 ≤ ℓ ≤ K min i < i ′ < j ′ < j V i ′ j ′ + eSBI( i , j , i ′ , j ′ , S ℓ ) V ij = βγ ( i , j ) + min �  min i < k < j WM i +1 k + WM k +1 j − 1 + aK  � WM ij − 1 + cK , WM i +1 j + cK , V ij + bK WM ij = min min i < k < j WM ik + WM k +1 j Remarks • eH( i , j , S ℓ ) and eSBI( i , j , i ′ , j ′ , S ℓ ) yield energy contributions for the respective S ℓ . S.Will, 18.417, Fall 2011 • RNAalifold implements an unambiguous variant of these recursions for computing partition function and base pair probabilities for the consensus structure. • β weights conservation score vs. sum of free energy. For γ see next slide.

Recommend


More recommend