bioinformatics sequence comparison 1
play

Bioinformatics Sequence comparison 1 global pairwise alignment - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Evolutionary relationships,


  1. Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

  2. Lecture contents • Evolutionary relationships, sequence comparison and alignment • How to compare and align sequences using scoring schemes • Naïve approach to finding optimal score and alignment • Dynamic programming as an efficient method for finding optimal scores & alignments • Variations on dynamic programming – Gap penalties – Substitution matrices (c) David Gilbert 2008 Sequence Comparison (1) 2

  3. Why compare sequences? • Assume a genome has been sequenced • We can find out where “putative” genes are by gene- finding (see http://www.ensembl.org/ ) • → What do such a gene do? • We can make the protein that it encodes – but we can’t easily find out its biological function • So, we can try to find other sequences which are similar to it, for which we know the function... (c) David Gilbert 2008 Sequence Comparison (1) 3

  4. So, what is this sequence similar to? Amino-acid MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQ (protein AAYQKVVAGVANALAHKYH sequence) acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt cattgc Search using BLAST cDNA (nucleotide sequence) http://www.ncbi.nlm.nih.gov/BLAST/ Where does the coding start on or http://www.ebi.ac.uk/blastall/ this sequence? (c) David Gilbert 2008 Sequence Comparison (1) 4

  5. Evolution - basic concepts • Mutation in DNA a natural evolutionary process • DNA replication errors: (nucleotide) – substitutions } indels – insertions – deletions • Similarity between sequences – clue to common evolutionary origin, or – clue to common function • This is a simplistic story: in fact the altered function of the expressed protein will determine if the organism will survive to reproduce, and hence pass on [transmit] the altered gene (c) David Gilbert 2008 Sequence Comparison (1) 5

  6. Human genetic variations (Single Nucleotide Polymorphisms) • SNP’s - “genetic indivuality” • ~ 1/1000 bases variable (2 humans) • Make us more/less susceptible to diseases • May influence the effect of drug treatments TTT TAC GGC ATC TTT TAC GTC ATC Phe Tyr Asn Met Phe Tyr Ser Met Associated with high cholersterol (c) David Gilbert 2008 Sequence Comparison (1) 6

  7. ACCESSION J02799 /translation=" MESKVVVP AQGKKITLQNGKLNVPENPIIPYIEGDGIGVDVTPA MLKVVDAAVEKAYKGERKISWMEIYTGEKSTQVYGQDVWLPAETLDLIREYRVAIKGP LTTPVGGGIRSLNVALRQELDLYICLRPVRYYQGTPSPVKHPELTDMVIFRENSEDIY AGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYAI ANDRDSVTLVHKGNIMKFTEGAFKDWGYQLAREEFGGELIDGGPWLKVKNPNTGKEIV IKDVIADAFLQQILLRPAEYDVIACMNLNGDYISDALAAQVGGIGIAPGANIGDECAL FEATHGTAPKYAGQDKVNPGSIILSAEMMLRHMGWTEAADLIVKGMEGAINAKTVTYD FERLMDGAKLLKCSEFGDA IIENM " ORIGIN MluI site; 25.3 min on K12 map. 1 cgcgtggcgt ggttttcagg tttacgcctg gtagaacgtt gcgagctgaa tcgcttaacc 61 tggtgatttc taaaagaagt tttttgcatg gtattttcag agattatgaa ttgccgcatt 121 atagcctaat aacgcgcatc tttcatgacg gcaaacaata gggtagtatt gacaagccaa 181 ttacaaatca ttaacaaaaa attgctctaa agcatccgta tcgcaggacg caaacgcata 241 tgcaacgtgg tggcagacga gcaaaccagt agcgctcgaa ggagaggtga atggaaagta 301 aagtagttgt tccg gcacaa ggcaagaaga tcaccctgca aaacggcaaa ctcaacgttc 361 ctgaaaatcc gattatccct tacattgaag gtgatggaat cggtgtagat gtaaccccag 421 ccatgctgaa agtggtcgac gctgcagtcg agaaagccta taaaggcgag cgtaaaatct 481 cctggatgga aatttacacc ggtgaaaaat ccacacaggt ttatggtcag gacgtctggc 541 tgcctgctga aactcttgat ctgattcgtg aatatcgcgt tgccattaaa ggtccgctga 601 ccactccggt tggtggcggt attcgctctc tgaacgttgc cctgcgccag gaactggatc 661 tctacatctg cctgcgtccg gtacgttact atcagggcac tccaagcccg gttaaacacc 721 ctgaactgac cgatatggtt atcttccgtg aaaactcgga agacatttat gcgggtatcg 781 aatggaaagc agactctgcc gacgccgaga aagtgattaa attcctgcgt gaagagatgg 841 gggtgaagaa aattcgcttc ccggaacatt gtggtatcgg tattaagccg tgttcggaag 901 aaggcaccaa acgtctggtt cgtgcagcga tcgaatacgc aattgctaac gatcgtgact 961 ctgtgactct ggtgcacaaa ggcaacatca tgaagttcac cgaaggagcg tttaaagact 1021 ggggctacca gctggcgcgt gaagagtttg gcggtgaact gatcgacggt ggcccgtggc 1081 tgaaagttaa aaacccgaac actggcaaag agatcgtcat taaagacgtg attgctgatg 1141 cattcctgca acagatcctg ctgcgtccgg ctgaatatga tgttatcgcc tgtatgaacc 1201 tgaacggtga ctacatttct gacgccctgg cagcgcaggt tggcggtatc ggtatcgccc 1261 ctggtgcaaa catcggtgac gaatgcgccc tgtttgaagc cacccacggt actgcgccga 1321 aatatgccgg tcaggacaaa gtaaatcctg gctctattat tctctccgct gagatgatgc 1381 tgcgccacat gggttggacc gaagcggctg acttaattgt taaaggtatg gaaggcgcaa 1441 tcaacgcgaa aaccgtaacc tatgacttcg agcgtctgat ggatggcgct aaactgctga 1501 aatgttcaga gtttggtgac gcg atcatcg aaaacatgta a tgccgtagt ttgttaaatt 1561 tattaacg // (c) David Gilbert 2008 Sequence Comparison (1) 7

  8. Mutations, frameshifts at[t,c,a]at[t,c,a]ga[a,g]aa[t,c]atg taa (regex) I I E N M Ter atc atc gaa aac atg taa Compute the translation of 1. atcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 2. tcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 3. catcgaaaacatgtaatgccgtagtttgttaaatttattaacg (c) David Gilbert 2008 Sequence Comparison (1) 8

  9. Frameshift mutations 1. atc atc gaa aac atg taa tgc cgt agt ttg tta aat tta tta acg 2. tca tcg aaa aca tgt aat gcc gta gtt tgt taa att tat taa cg 3. cat cga aaa cat gta atg ccg tag ttt gtt aaa ttt att aac g (c) David Gilbert 2008 Sequence Comparison (1) 9

  10. Evolution, DNA->Amino-acids Triplet code, hence difference between DNA base • Substitution: (hence 1 amino-acid changes) • Insertion / Deletion: “frame shift” (all subsequent amino- acids change) – NB, Indels can be in multiples of 3, and hence... Also • “Silent mutation” - DNA changes but amino-acid doesn’t change - why? • “Nonsense mutation” - a single DNA base substitution resulting in a stop codon. (c) David Gilbert 2008 Sequence Comparison (1) 10

  11. Some evolutionary relationships revealed by comparing α - haemoglobins moose giant duck lesser panda axolotl panda goshawk vulture alligator (c) David Gilbert 2008 Sequence Comparison (1) 11

  12. Evolution - example ggcatt agcatt agcata agccta aggatt agcatg gacatt (c) David Gilbert 2008 Sequence Comparison (1) 12

  13. Evolution - related sequences Highlight the other What are the mutations in mutations! { the following:- “ancestral sequences” ggcatt ggcatt agccta g → a a gcatt agcata c → g agccta ag g att agcatg gacatt Q: How many changes between 2 sequences? “living examples” (c) David Gilbert 2008 Sequence Comparison (1) 13

  14. Other evolutionary issues • Convergent evolution: same sequence evolved from different ancestors • back evolution - mutate to a previous sequence ggcatt agcatt agcata agccta aggatt agcatg gacatt aggata ggcatt aggatc aggatc (c) David Gilbert 2008 Sequence Comparison (1) 14

  15. Evolutionary Relationships • Evolutionary relationships between sequences – Two sequences evolved from same ancestor – sequences are homologous h : GLVST V → I S → GLIST GLVT → V L → I q : GLISVT d : GIVT (c) David Gilbert 2008 Sequence Comparison (1) 15

  16. Evolutionary Relationships • ‘True’ Alignment h : G L V S T 2 sub 1 ins q : G L I S V T 1 del d : G I V - - T ‘True’ evolutionary history (& h ) unknown • Alignment can be interpreted as – Two substitutions & 2 insertions OR 2 deletions OR 1 deletion & 1 insertion • Unable to obtain evolutionary history even with ‘true’ alignment (c) David Gilbert 2008 Sequence Comparison (1) 16

  17. 2 del 1 del, 1 ins h : G I V T h : G I S V T q : G L I S V T q : G L I S V T d : G I V T d : G I S V T (c) David Gilbert 2008 Sequence Comparison (1) 17

  18. Evolutionary Relationships • Require model to reconstruct evolutionary history • Minimise number of mutations q : GLISVT; I ↔ L; V ↔ I; ← S → ; ← V → ; d : GIVT • A number of histories can be obtained e.g. q : GLISVT; L → I ; I → V; S → ; V → ; d : GIVT • With ‘true’ alignment obtaining evolutionary history ill- posed – many possible histories. (c) David Gilbert 2008 Sequence Comparison (1) 18

  19. Evolutionary Relationships • What if ‘true’ alignment not known? • Again require a model to construct alignment • Minimise the number of mutations • One alignment would be GLISVT G_I_VT • Showing two insertions/deletions (indels) (c) David Gilbert 2008 Sequence Comparison (1) 19

  20. Evolutionary Relationships • This alignment would produce a number of possible histories e.g. GLIVT → S L → GLISVT GIVT • GLIVT not equal to ‘true’ h GLVST (c) David Gilbert 2008 Sequence Comparison (1) 20

Recommend


More recommend