Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast
The Central dogma of Biology The Central dogma of Biology
Protein translation Protein translation gene DNA Transcription mRNA Translation ATG CGT AAT TCA TGA ATG M R N S *
Blast Variants Blast Variants What is a 6 frame translation of a nucleotide sequence? ß blastn ß blastn: nucleotide versus nucleotide : nucleotide versus nucleotide ß blastp ß blastp: protein versus protein : protein versus protein ß blastx ß blastx: nucleotide query, protein database : nucleotide query, protein database ß tblastn ß tblastn: protein query, : protein query, nt nt database database ß tblastx ß tblastx: nucleotide query, nucleotide : nucleotide query, nucleotide database, alignments at a protein level. database, alignments at a protein level.
Homology versus similarity Homology versus similarity ß Sequence alignment tools compute similarity ß Sequence alignment tools compute similarity ß Two sequences are homologous if they shared a ß Two sequences are homologous if they shared a common evolutionary ancestor. common evolutionary ancestor. ß Orthologous Orthologous if the pair was created in a speciation if the pair was created in a speciation ß event event ß Paralogous Paralogous if the pair was created by a gene if the pair was created by a gene ß duplication duplication ß Homology does not always imply sequence ß Homology does not always imply sequence similarity. Ancient protein families might be similarity. Ancient protein families might be homologous, but greatly diverged. The score homologous, but greatly diverged. The score function must reflect this. function must reflect this.
Scoring Function Scoring Function ß For DNA, we worked with a simple ß For DNA, we worked with a simple match/mismatch criteria. match/mismatch criteria. ß Purines Purines (AG) & (AG) & Pyrimidines Pyrimidines (CT) (CT) ß ß Transitions (nucleotide substitution within a Transitions (nucleotide substitution within a ß group) are more likely than transversions transversions. . group) are more likely than ß When aligning When aligning cDNA cDNA, the third base , the third base ß substitution is more likely, then other substitution is more likely, then other positions. positions. ß Q: Can you devise an algorithm to score ß Q: Can you devise an algorithm to score appropriately? appropriately?
Scoring proteins Scoring proteins ß Scoring protein sequence alignments is a much more Scoring protein sequence alignments is a much more ß complex task than scoring DNA complex task than scoring DNA ß Not all substitutions are equal ß Not all substitutions are equal ß Problem was first worked on by Problem was first worked on by Pauling Pauling and and ß collaborators collaborators ß In the 1970s, In the 1970s, Dayhoff Dayhoff created the first similarity created the first similarity ß matrices. matrices. ß “ ß “One size does not fit all One size does not fit all” ” ß Homologous proteins which are evolutionarily close should ß Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily be scored differently than proteins that are evolutionarily distant distant ß Different proteins might evolve at different rates and we need ß Different proteins might evolve at different rates and we need to normalize for that to normalize for that
PAM1 matrix PAM1 matrix ß Align many proteins that are very similar ß Align many proteins that are very similar ß Is this a problem? Is this a problem? ß ß PAM1 distance is the probability of a ß PAM1 distance is the probability of a substitution when 1% of the residues have substitution when 1% of the residues have changed changed ß Estimate the frequency ß Estimate the frequency P P ab of every pair of ab of every pair of substitions a,b a,b substitions ß S(a,b) = log ß S(a,b) = log 10 (P P ab /P P a P b ) = log10(P P b /P P b ) 10 ( ab / a P b ) = log10( |a / b ) b|a
PAM 1 PAM 1
Recommend
More recommend