blast
play

BLAST Michael Schroeder Biotechnology Center TU Dresden Contents - PowerPoint PPT Presentation

BLAST Michael Schroeder Biotechnology Center TU Dresden Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences? Levensthein


  1. BLAST Michael Schroeder Biotechnology Center TU Dresden

  2. Contents  Why to compare and align sequences?  How to judge an alignment?  Z-score, E-value, P-value, structure and function  How to compare and align sequences?  Levensthein distance, scoring schemes, longest common subsequence, global and local alignment, substitution matrix,  How to compute an alignment?  Dynamic programming  How to compute an alignment fast?  BLAST  How to align many sequences  Multiple sequence alignment, phylogenetic trees  Alignments and structure  How to predict protein structure from protein sequence 2

  3. Motivation  Two sequences of length n  Dynamic programming:  matrix = time proportional to n 2  Database of m sequences:  time proportional to m n 2  Manual search: How long does it take?  1 cell = 10 sec  1.000.000 sequences  Sequence = 100 amino acids 3

  4. Motivation  How can we reduce the database size?  How can we reduce the matrix size? 4

  5. lev(petra, peter) = ? 5

  6. Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j p e t e r 0 1 2 3 4 5 p 1 0 1 2 3 4 e 2 1 0 1 2 3 t 3 2 1 0 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2 6

  7. Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j p e t e r 0 1 2 3 4 5 p 1 0 1 2 3 4 e 2 1 0 1 2 3 t 3 2 1 0 1 2 r 4 3 2 1 1 1 not used a 5 4 3 2 2 2 maybe used used For the two alignments, we only used 8 out of 36 cells. Can we discard the other cells beforehand? How? 7

  8. Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5 What is the worst possible distance for two strings of size 5? 5 mismatches. This means all paths of length >5 can be excluded 8

  9. Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 not used 5 5 maybe used used Paths through red cells have all length >5 Only 24 out of 36 can contribute to results. 9

  10. Are we solving the right problem? Are all alignments useful? Only results with reasonable edit distance. For size 5 strings, let‘s say that‘s 3. 10

  11. Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 not used 5 5 maybe used used Paths through red cells have all length >3 Only 16 out of 36 can contribute to results. 11

  12. Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer 12

  13. Are we Solving the Right Problem? 13

  14. Are we Solving the Right Problem?  Aligning completely unrelated sequences not needed  A reasonable alignment has some perfect matches  BLAST Idea: 1. Identify small perfect matches (termed words ) 2. Extend perfect matches on the same diagonal 3. Merge extended perfect matches (with dynamic programming) 14

  15. BLAST Idea  Do we need the alignments for all sequences in the database?  No, only for “reasonable” ➞ hits introduce a threshold  A “reasonable” alignment will contain short stretches of perfect matches  Find these first, then extend them to connect them as best possible 15

  16. Perfect Matches (Words) Formally 16

  17. BLAST: Filtration Algorithm  detects all matching words of length p (of a query string a in a target string b , both of length n ), combining it to an alignment of a and b with no more than k mismatches  Potential match detection: Find all matches of p - tuples between a and b (can be done in linear time by inserting them into a hash table)  Potential match verification: Verify each potential match by extending it to the left and right until either the first k +1 mismatches are found or the beginning or end of the sequences are found 17

  18. BLAST Mechanism Summary word length p (here: p = 4) no mismatches or gaps allowed only within the grey areas 18

  19. Zipf's Law 19

  20. Example for BLAST Search SWISSPROT for Ig-alpha: SWISS_PROT:C79A_HUMAN P11912 20

  21. Example for BLAST 21

  22. Example for BLAST 22

  23. Example for BLAST Distribution of Hits: 23

  24. Example for BLAST: Alignment >gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chain precursor (IG-alpha) (MB-1 membrane glycoprotein) (Surface-IGM-associated protein) (Membrane-bound immunoglobulin associated protein) (CD79A) Length = 220 Score = 278 bits (711), Expect = 5e-75 Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%) Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCG Sbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114 Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D Sbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174 Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP Sbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220 24

  25. Example for BLAST: Taxonomy Lineage Report root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] . Human adenovirus type 7 ................................... 30 1 hit [viruses] 25

  26. PSI-Blast  Globin family (oxygen transport) of proteins occurs in many species  Proteins have same function and structure and  But there are pairs of members of the family sharing less than 10% identical residues 50% 50%  A B C Only 10%  PSI-BLAST idea: score via intermediaries may be better than score from direct comparison 26

  27. PSI-BLAST  PSI-BLAST  1. BLAST  2. Collect top hits  3. Build multiple sequence alignment from significant local matches  4. Build profile  5. Re-probe database with profile  6. Go back to 2. 27

  28. PSI-BLAST  But beware of PSI-BLAST:  False positives propagate and spread through iterations  If protein A consists of domains D and E, and protein B of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain 28

Recommend


More recommend