22 ‐ Mar ‐ 15 Similarity Quantifying sequence similarity • Analogy • Homology – Similar function – Similar ancestry • Homology can be used to… – …determine evolutionary relationships – …predict functions Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 19 th 2015 Archaeopteryx Evolution Biological sequences • Replication (copy) • Represented as 1 ‐ dimensional string • Nucleic acids – DNA: A, C, G, and T – RNA: A, C, G, and U • Mutation (modify) – N is “unknown” – Some other letters are ambiguities: Some other letters are ambiguities: • Proteins: • Selection (fitness) – X is “unknown” Time – All other letters except B, J, O, U, and Z are amino acids: Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV Each sequence has an identifier >protein_seque >protein_sequence_B nce_B that has to be unique in the file MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_sequence_A >protein_seque nce_A >protein_seque >protein_sequence_C nce_C MTQSSHAVAA FDL MT Q Q SSHAVAA FDLGAALR GAALRQE GLTETDYSE Q Q E GLTETDYSEI I QRDPNRAELG TFGV Q RDPNRAELG TFGV MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL MHAVAAFDLG AAL RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA L GRAMFGVMWS EHCCYRNDDA YRNDDA YRNDDA >protein_seque >protein_sequence_B nce_B RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_sequence_C >protein_seque nce_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA – Fasta Nucleic Acid RPLLRPIKSP FGAWVVIV RPLLRPIKSP FGA WVVIV >DNA_sequence_X >DNA_sequence_ GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z >DNA_sequence_ they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG GAAGCTGACC CGT TTCCGGA AGAGGGAGG 1
22 ‐ Mar ‐ 15 Sequence alignment Identity versus similarity • How to determine if two sequences are homologous? • These three sequences are all 66.7% identical… • …but what are those colors? * – Some amino acids are more similar than others • How to quantify their similarity? • Make a sequence alignment! (* hint) • Aligned sequences allow you to calculate percent identity: – Di ff erences: 7 posi � ons → 32 posi � ons iden � cal – Alignment length: 39 positions 32 – The two sequences are 100 * = 82% identical Cysteine (C) Aspartic acid (D) Glutamic acid (E) 39 • Note: identity can be quantified, but homology cannot – Saying “82% homologous” is wrong! George Orwell George Orwell Animal Farm Animal Farm Amino acid properties The similarity signal • What is the most relevant definition of a “similarity signal” between amino acids? – Size? – Polarity? – Hydrophobicity? – Preferred protein fold? …any other measures that might be important? any other measures that might be important? – • Similarity signal of what, exactly? * – “Of the function of the amino acid in the protein” – “Of an evolutionary relationship” (* hint) • Evolution has tested this for us – So let’s use evolution to score similarity! http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf BLOcks SUbstitution Matrix (BLOSUM) Are we looking at well ‐ aligned homologs? 1. Take some aligned homologous sequences You will learn how to do this next week – Unaligned sequences 2. Group highly identical sequences (e.g. >62% identity) To remove redundancy biases in the sequence database To remove redundancy biases in the sequence database – • Well ‐ aligned homologs show you how often two amino ll li d h l h h f i 3. Identify well ‐ aligned blocks acids really mutated into each other during evolution So that only real mutations are compared – 4. Count how two often amino acids mutated into each other This is the most relevant definition of “similarity”! – Well ‐ aligned homologs 2
22 ‐ Mar ‐ 15 How to quantify this similarity? An example • BLOSUM defines similarity between amino acids based on the • Let’s say that we have a well ‐ aligned block of: statistical probability that they align in well ‐ aligned homologs – 100 amino acids long – Observed/expected ratio (odds ratio) – 1,000 proteins “deep” • An obeserved/expected ratio of 2 means something happens twice as often as – With no gaps you expected by random chance 7,400 – Observed: aligned in well ‐ aligned homologs • 7.4% are alanine (A) → F A = = 0.074 100 ∙ 1,000 – Expected: “aligned” in unaligned sequences • 1.3% are tryptophan (W) → F W = 0.013 BLOSUM score: • Randomly, we expect a fraction of A ‐ W alignments of: Randomly, we expect a fraction of A W alignments of: log ( aligned in well ‐ aligned homologs observed ) → S i,j = 2 ∙ log 2 ( F i,j ) “aligned” in unaligned sequences expected F i ∙ F j F A ∙ F W = 0.074 ∙ 0.013 = 0.000962 • The observed/expected ratio for a whole alignment can be • In reality, we observe a fraction of A ‐ W alignments of: 0.00034 calculated by multiplying the ratios for all aligned amino acids – The A ‐ W mutation occurred less frequently in evolution than expected! – This is a lot of multiplying for long sequences – … so the substitution score must be negative • Logarithms allow us to sum the scores of aligned residues • The A ‐ W substitution score is: – Much easier to use for a scoring system S i,j = 2 ∙ log 2 ( F i,j ) S A,W = 2 ∙ log 2 ( 0.00034 ) = ‐ 3 • Hence: the BLOSUM score F i ∙ F j 0.000962 BLOSUM62 So what do the numbers mean? 2 (S i,j /2) = F i,j F i,j S i,j = 2 ∙ log 2 ( ) → F i ∙ F j F i ∙ F j Low for infrequent substitutions Small • A score of ‐ 3 for the substitution A ‐ W means that this substitution is High for frequent substitutions observed in well ‐ aligned homologs 0.35 times as often as expected: Low for common amino acids 2 ( ‐ 3/2) = 0.35 0.00034 Polar, non ‐ hydrophobic = 0.000962 1 High for rare amino acids – … or: = 2.83 times less often than expected 0.35 Positive • A score of 2 for the substitution D ‐ E means that this substitution is Hydrophobic observed in well ‐ aligned homologs twice as often as expected: 2 (2/2) = 2 Aromatic • Why are the highest numbers in each row/column on the • A score of 9 for the substitution C ‐ C means that this substitution is diagonal? observed in well ‐ aligned homologs 22.6 times as often as expected: – Because in well ‐ aligned homologs, every amino acid is most 2 (9/2) = 22.6 often aligned to itself (not mutated) Length of the alignment Odds ratio that sequences are well ‐ aligned homologs • As you see a more and more length of two sequences… • How likely is it that two sequences are well ‐ aligned homologs? – seqA – seqB : ‐ 1 + ‐ 3 + ‐ 2 = ‐ 6 – If two sequences are well ‐ aligned homologs, most of the aligned – seqA – seqC : ‐ 1 + ‐ 2 + ‐ 2 = ‐ 5 amino acids will have positive scores – seqB – seqC : 4 + 2 + 4 = 10 – … so the alignment score keeps increasing as you see more 2 ( ‐ 6/2) = 0.125 2 ( ‐ 5/2) = 0.177 2 (10/2) = 32 length of the sequences – … so the likelihood that they are well ‐ aligned homologs keeps increasing • When you observe RYD aligned to SDA, these sequences are 0.125 times more likely to be well ‐ aligned homologs than expected – … or 8 times less likely to be well ‐ aligned homologs than expected – If two sequences are not homologous, most of the aligned amino • When you observe RYD aligned to SEA, these sequences are 0.177 acids will have negative scores times more likely to be well ‐ aligned homologs than expected – … so the score keeps decreasing as you see more sequence – … or 5.6 times less likely to be well ‐ aligned homologs than expected – … so the likelihood that they are well ‐ aligned homologs keeps • When you observe SDA aligned to SEA, these sequences are 32 decreasing times more likely to be well ‐ aligned homologs than expected 3
Recommend
More recommend