Inferring trees is difficult!!! MODELS OF PROTEIN EVOLUTION: 1. The method problem AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES A Method 1 Method 1 Dataset 1 Dataset 1 B ? C A Robert Hirt Method 2 Method 2 Department of Zoology, The C Dataset 1 Dataset 1 Natural History Museum, B London From DNA/protein sequences to trees Inferring trees is difficult!!! * 1 Sequence data Sequence data * 2 Align Sequences Align Sequences 2. The dataset problem 2. The dataset problem Phylogenetic signal? Phylogenetic signal? 3 * Patterns—>evolutionary Patterns >evolutionary processes? processes? A Method 1 Method 1 Distances methods Characters based methods Dataset 1 Dataset 1 B Distance calculation Choose Choose a a method method * 4 (which model?) ? C MB ML MP A Wheighting? Wheighting Model? Model? Model? Model? Method 1 Method 1 (sites, changes)? (sites, changes)? Optimality criterion Single tree C Dataset 2 Dataset 2 LS LS ME ME NJ NJ B Calculate or Calculate or estimate estimate best best fit fit tree tree 5 Test phylogenetic Test phylogenetic reliability reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487 1
Agenda Why protein phylogenies? Why protein phylogenies? • Some general considerations • For historical reasons - the first sequences • For historical reasons - the first sequences – why protein phylogenetics? • Most genes encode proteins • Most genes encode proteins – What are we comparing? Protein sequences - some basic features • To study protein structure, function and evolution • To study protein structure, function and evolution – Protein structure/function and its impact on patterns of mutations • Comparing DNA and protein based phylogenies can • Comparing DNA and protein based phylogenies can • Amino acid exchange matrices: where do they come from be useful be useful and when do we use them? – Different genes - e.g. 18S Different genes - e.g. 18S rRNA rRNA versus EF-2 protein versus EF-2 protein – – Database searches (Blast, FASTA) – Protein encoding gene - Protein encoding gene - codons codons versus amino acids versus amino acids – – Sequence alignment (ClustalX) – Phylogenetics (model based methods) Proteins were the first molecular Phylogenies from proteins sequences to be used for phylogenetic inference • Parsimony • Distance matrices • Fitch and Margoliash (1967). • Maximum likelihood Construction of phylogenetic trees. • Bayesian methods Science 155, 279-284. 2
Evolutionary models for amino acid Character states in DNA and changes protein alignments • All methods have explicit or implicit evolutionary models • DNA sequences have four states (five): A, C, G, T, (and ± indels) • Can be in the form of simple formula – Kimura formula to estimate distances • Proteins have 20 states (21): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and ± indels) • Most models for amino acid changes typically include – 20x20 rate matrix – Correction for rate heterogeneity among sites ( G [a]+ G [a]+ pinv) —> more information in DNA or protein alignments? – Assume neutrality - what if there are biases, or non neutral changes - such as selection? DNA—>Protein DNA->Protein: the code • The code is degenerate: • 3 nucleotides (a codon) code for one amino acid 20 amino acids are encoded by 61 possible codons (3 stop (61 codons! 61x61 rate matrices?) codons) • Complex patterns of changes among codons: • Degeneracy of the code: most amino acids are – Synonymous/non synonymous changes coded by several codons – Synonymous changes correspond to codon changes not affecting the coded amino acid —> more data/information in DNA? 3
Codon degeneracy: protein alignments as a guide DNA->Protein: code usage for DNA alignments Glu- Glu -Gly Gly- -Ser Ser- -Ser Ser- -Trp Trp- -Leu Leu- -Leu Leu- -Leu Leu- -Gly Gly- -Ser Ser • Difference in codon usage can lead to large base Glu- Glu -Gly Gly- -Ser Ser- -Ser Ser- -Tyr Tyr- -Leu Leu- -Leu Leu- -Ile Ile- -Gly Gly- -Ser Ser composition bias - in which case one often needs to Asp Asp- -Gly Gly- -Ser Ser- -Ala Ala- -Trp Trp- -Leu Leu- -Leu Leu- -Leu Leu- -Gly Gly- -Ser Ser remove the 3rd codon, the more bias prone site… Asp Asp- -Gly Gly- -Ser Ser- -Ala Ala- -Tyr Tyr- -Leu Leu- -Leu Leu- -Ala Ala- -Gly Gly- -Ser Ser and possibly the 1st GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC • Comparing protein sequences can reduce the GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT compositional bias problem GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA —> more information in DNA or protein? Ask James for PUTGAPS… Evolutionary models for amino acid changes Models for DNA and Protein evolution • All methods have explicit or implicit evolutionary • DNA: 4 x 4 rate matrices models – Easy to estimate (can be combined with tree search) • Can be in the form of simple formula – Kimura formula to estimate distances • Protein: 20 x 20 matrices – More complex: time and estimation problems (rare changes?) -> • Most models for amino acid changes typically include empirical models from large datasets are typically used – 20x20 rate matrix – Correction for rate heterogeneity between sites ( G [a]+ G [a]+ pinv) 4
Amino acid physico-chemical Proteins and amino acids properties • Proteins determine shape and structure of cells and – Major factor in protein folding carry most catalytic processes - 3D • Proteins are polymers of 20 different amino acids – Key to protein functions • Amino acids sequences determine the structure (2ndary, 3ary…) and function of the protein —> Major influence in pattern — > Major influence in pattern • Amino acids can be categorized by their side chain of amino acid mutations of amino acid mutations physicochemical properties As for Ts versus Tv in DNA sequences, some – Polarity (hydrophobic versus hydrophilic, +/- charges) – Size (small versus large) amino acid changes are more common than others: very important for sequence comparisons (alignment and phylogenetics!) Small <—> small > small <—> big Amino acid substitution matrices Estimation of relative rates of residue based on observed substitutions: replacement (models of evolution) “empirical models” • Differences/changes in protein alignments can be pooled and patterns of changes investigate. • Summarise the substitution pattern from large – Selected sequence, alignment and counting method dependent! Empirical amount of existing data models! • Patterns of changes give insights into the evolutionary processes • Based on a selection of proteins underlying protein diversification -> estimation of evolutionary – Globular proteins, membrane proteins? models – How general is such a model? – Mitochondrial proteins? • Choice of protein evolutionary models can be important for the • Uses a given counting method and the counted sequence analysis we perform (database searching, sequence alignment, phylogenetics) changes to be recorded – tree dependent/independent – restriction on the sequence divergence 5
Taylor’s Venn diagram of amino acids Amino acid physico-chemical properties properties Tiny Small – Size P – Polarity A Aliphatic C S-S G S N Polar – Hydrophilic (polar, +/- charges) C S-H Q V D + T I E – Hydrophobic (non polar) L Charged K M - Y F H R W Hydrophobic Aromatic Amino acids categories 1: Amino acids categories 2 Doolittle (1985). Sci. Am. 253, 74-85. –Sulfhydryl: C –Small polar: S, G, D, N –Small hydrophilic: S, T, A, P, G –Small non-polar: T, A, P, C –Acid, amide: D, E, N, Q –Large polar: E, Q, K, R –Basic: H, R, K –Large non-polar: V, I, L, M, F –Small hydrophobic : M, I, L, V –Intermediate polarity: W, Y, H –Aromatic: F, Y, W 6
Phylogenetic trees from protein alignments • Parsimony based methods - unweighted/weighted • Distance methods - model for distance estimation – probability of amino acid changes, site rate heterogeneity • Maximum likelihood and Bayesian methods- model for ML calculations – probability of amino acid changes, site rate heterogeneity —> Colour coding of different categories is useful for protein — > Colour coding of different categories is useful for protein alignment visual inspection alignment visual inspection Trees from protein alignment: Parsimony: unweighted matrix Parsimony methods - cost matrices for amino acid changes • All changes weighted equally – Ile -> Leu cost = 1 • Differential weighting of changes: an attempt to correct for homoplasy!: – Trp -> Asp cost = 1 – Based on the minimal number of amino acid substitutions, the genetic code matrix (PHYLIP -PROTPARS) – Ser -> Arg cost = 1 – Weights based on physico-chemical properties of amino acids – Weights based on observed frequency of amino acid substitutions in – Lys -> Asp cost = 1 alignments 7
Recommend
More recommend