multiple sequence alignment alignment can be easy or
play

Multiple Sequence Alignment Alignment can be easy or difficult - PowerPoint PPT Presentation

Multiple Sequence Alignment Alignment can be easy or difficult GCGGCCCA TCAGGTAGTT GGTGG GCGGCCCA TCAGGTAGTT GGTGG Easy GCGTTCCA TCAGCTGGTT GGTGG GCGTCCCA TCAGCTAGTT GGTGG GCGGCGCA TTAGCTAGTT GGTGA ******** ********** ***** Difficult due


  1. Multiple Sequence Alignment

  2. Alignment can be easy or difficult GCGGCCCA TCAGGTAGTT GGTGG GCGGCCCA TCAGGTAGTT GGTGG Easy GCGTTCCA TCAGCTGGTT GGTGG GCGTCCCA TCAGCTAGTT GGTGG GCGGCGCA TTAGCTAGTT GGTGA ******** ********** ***** Difficult due TTGACATG CCGGGG---A AACCG to insertions TTGACATG CCGGTG--GT AAGCC TTGACATG -CTAGG---A ACGCG or deletions TTGACATG -CTAGGGAAC ACGCG (indels) TTGACATC -CTCTG---A ACGCG ******** ?????????? *****

  3. Homology: Definition • Homology: similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics. • An Alignment is an hypothesis of positional homology between bases/Amino Acids.

  4. Multiple Sequence Alignment- Goals • To generate a concise, information-rich summary of sequence data. • Sometimes used to illustrate the dissimilarity between a group of sequences. • Alignments can be treated as models that can be used to test hypotheses. • Does this model of events accurately reflect known biological evidence.

  5. Alignment of 16S rRNA can be guided by secondary structure <---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA match ** *** * ** ** * ** Alignment of 16S rRNA sequences from different bacteria

  6. Protein Alignment may be guided by Tertiary Structure Interactions Homo sapiens Escherichia coli DjlA protein DjlA protein

  7. Multiple Sequence Alignment- Methods – 3 main methods of alignment: • Manual • Automatic • Combined

  8. Manual Alignment - reasons • Might be carried out because: – Alignment is easy. – There is some extraneous information (structural). – Automated alignment methods have encountered the local minimum problem. – An automated alignment method can be “improved”.

  9. Dynamic programming 2 methods: • Dynamic programming – Consider 2 protein sequences of 100 amino acids in length. If it takes 100 2 seconds to exhaustively align these sequences, – then it will take 100 3 seconds to align 3 sequences, 100 4 to align 4 sequences...etc. – More time than the universe has existed to align 20 sequences exhaustively. • Progressive alignment

  10. Progressive Alignment • Devised by Feng and Doolittle in 1987. • Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment. • Requires n-1+n-2+n-3...n-n+1 pairwise alignments as a starting point • Most successful implementation is Clustal (Des Higgins)

  11. Overview of ClustalW Procedure CLUSTAL W Hbb_Human 1 - Hbb_Horse 2 .17 - Hba_Human 3 .59 .60 - Quick pairwise alignment: Hba_Horse 4 .59 .59 .13 - Myg_Whale 5 .77 .77 .75 .75 - calculate distance matrix Hbb_Human 4 1 3 Hbb_Horse Neighbor-joining tree Hba_Human 2 Hba_Horse (guide tree) Myg_Whale alpha-helices 1 PEEKSAVTALWGKVN--VDEVGG 4 1 3 2 GEEKAAVLALWDKVN--EEEVGG Progressive alignment 3 PADKTNVKAAWGKVGAHAGEYGA 2 4 AADKTNVKAAWSKVGGHAGEYGA following guide tree 5 EHEWQLVLHVWAKVEADVAGHGQ

  12. ClustalW- Pairwise Alignments • First perform all possible pairwise alignments between each pair of sequences. There are ( n-1)+(n-2)...(n- n+1) possibilities. • Calculate the ‘distance’ between each pair of sequences based on these isolated pairwise alignments. • Generate a distance matrix.

  13. Path Graph for aligning two sequences.

  14. Possible alignment Scoring Scheme: 1 • Match: +1 • Mismatch: 0 1 • Indel: -1 0 1 Score for this path= 2 0 -1

  15. Alignment using this path 1 GATTC- 1 GAATTC 0 1 0 -1

  16. Optimal Alignment 1 Alignment using 1 this path 1 GA-TTC GAATTC -1 1 1 Alignment score: 4 1

  17. Optimal Alignment 2 Alignment using 1 this path -1 G-ATTC GAATTC 1 1 1 Alignment score: 4 1

  18. ClustalW- Guide Tree • Generate a Neighbor-Joining ‘guide tree’ from these pairwise distances. • This guide tree gives the order in which the progressive alignment will be carried out.

  19. Neighbor joining method •The neighbor joining method is a greedy heuristic which joins at each step, the two closest sub-trees that are not already joined. •It is based on the minimum evolution principle. •One of the important concepts in the NJ method is neighbors , which are defined as two taxa that are connected by a single node in an unrooted tree Node 1 A B

  20. What is required for the Neighbour joining method? Distance Matrix Distance matrix PAM Spinach Rice Mosquito Monkey Human Spinach 0.0 84.9 105.6 90.8 86.3 Rice 84.9 0.0 117.8 122.4 122.6 Mosquito 105.6 117.8 0.0 84.7 80.8 Monkey 90.8 122.4 84.7 0.0 3.3 Human 86.3 122.6 80.8 3.3 0.0

  21. First Step PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum Mosquito Spinach Rice Human Monkey

  22. Calculation of New Distances After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55 Mon-Hum Spinach Human Monkey

  23. Next Cycle PAM Spinach Rice Mosquito MonHum Spinach 0.0 84.9 105.6 88.6 Rice 84.9 0.0 117.8 122.5 Mosquito 105.6 117.8 0.0 82.8 MonHum 88.6 122.5 82.8 0.0 Mos-(Mon-Hum) Mon-Hum Rice Spinach Mosquito Human Monkey

  24. Penultimate Cycle PAM Spinach Rice MosMonHum Spinach 0.0 84.9 97.1 Rice 84.9 0.0 120.2 MosMonHum 97.1 120.2 0.0 Mos-(Mon-Hum) Spin-Rice Mon-Hum Rice Spinach Mosquito Human Monkey

  25. Last Joining PAM SpinRice MosMonHum Spinach 0.0 108.7 MosMonHum 108.7 0.0 (Spin-Rice)-(Mos-(Mon-Hum)) Mos-(Mon-Hum) Spin-Rice Mon-Hum Rice Spinach Mosquito Human Monkey

  26. Unrooted Neighbor-Joining Tree Human Spinach Monkey Mosquito Rice

  27. Multiple Alignment- First pair • Align the two most closely-related sequences first. • This alignment is then ‘fixed’ and will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences, but their relative alignment remains unchanged.

  28. ClustalW- Decision time • Next, consult the guide tree to see what alignment is performed next. – Align a third sequence to the first two Or – Align two entirely different sequences to each other. Option 1 Option 2

  29. ClustalW- Alternative 1 If the situation arises where a third sequence is aligned to the first two, + then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences.

  30. ClustalW- Alternative 2 • If, on the other hand, two separate sequences + have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out.

  31. ClustalW- Progression • The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.

  32. ClustalW-Good points/Bad points • Advantages: – Speed. • Disadvantages: – No objective function. – No way of quantifying whether or not the alignment is good – No way of knowing if the alignment is ‘correct’.

  33. ClustalW-Local Minimum • Potential problems: – Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. – Arbitrary alignment.

  34. Increasing the sophistiaction of the alignment process. • Should we treat all the sequences in the same way? - even though some sequences are closely-related and some sequences are distant relatives. • Should we treat all positions in the sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.

  35. ClustalW- Caveats • Sequence weighting • Varying substitution matrices • Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions. • Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

  36. Sequence weighting • First we must be able to categorise sequences according to whether they have close relatives or if they are distantly-related to the other sequences (calculated directly from the guide tree). • Weights are normalised, so that the largest weight is 1. • Closely-related sequences have a large amount of the same information, so they are downweighted. • These weights are multiplication factors.

Recommend


More recommend