BLAST Michael Schroeder Biotechnology Center TU Dresden
Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences? Levensthein distance, scoring schemes, longest common subsequence, global and local alignment, substitution matrix, How to compute an alignment? Dynamic programming How to compute an alignment fast? BLAST How to align many sequences Multiple sequence alignment, phylogenetic trees Alignments and structure How to predict protein structure from protein sequence 2
Motivation Two sequences of length n Dynamic programming: matrix = time proportional to n 2 Database of m sequences: time proportional to m n 2 Manual search: How long does it take? 1 cell = 10 sec 1.000.000 sequences Sequence = 100 amino acids 3
Motivation How can we reduce the database size? How can we reduce the matrix size? 4
lev(petra, peter) = ? 5
Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j p e t e r 0 1 2 3 4 5 p 1 0 1 2 3 4 e 2 1 0 1 2 3 t 3 2 1 0 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2 6
Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j p e t e r 0 1 2 3 4 5 p 1 0 1 2 3 4 e 2 1 0 1 2 3 t 3 2 1 0 1 2 r 4 3 2 1 1 1 not used a 5 4 3 2 2 2 maybe used used For the two alignments, we only used 8 out of 36 cells. Can we discard the other cells beforehand? How? 7
Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5 What is the worst possible distance for two strings of size 5? 5 mismatches. This means all paths of length >5 can be excluded 8
Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 not used 5 5 maybe used used Paths through red cells have all length >5 Only 24 out of 36 can contribute to results. 9
Are we solving the right problem? Are all alignments useful? Only results with reasonable edit distance. For size 5 strings, let‘s say that‘s 3. 10
Levenshtein Distance with Dynamic Programming: Are all cells in the matrix equally important? i \ j 0 1 2 3 4 5 1 1 2 2 3 3 4 4 not used 5 5 maybe used used Paths through red cells have all length >3 Only 16 out of 36 can contribute to results. 11
Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer 12
Are we Solving the Right Problem? 13
Are we Solving the Right Problem? Aligning completely unrelated sequences not needed A reasonable alignment has some perfect matches BLAST Idea: 1. Identify small perfect matches (termed words ) 2. Extend perfect matches on the same diagonal 3. Merge extended perfect matches (with dynamic programming) 14
BLAST Idea Do we need the alignments for all sequences in the database? No, only for “reasonable” ➞ hits introduce a threshold A “reasonable” alignment will contain short stretches of perfect matches Find these first, then extend them to connect them as best possible 15
Perfect Matches (Words) Formally 16
BLAST: Filtration Algorithm detects all matching words of length p (of a query string a in a target string b , both of length n ), combining it to an alignment of a and b with no more than k mismatches Potential match detection: Find all matches of p - tuples between a and b (can be done in linear time by inserting them into a hash table) Potential match verification: Verify each potential match by extending it to the left and right until either the first k +1 mismatches are found or the beginning or end of the sequences are found 17
BLAST Mechanism Summary word length p (here: p = 4) no mismatches or gaps allowed only within the grey areas 18
Zipf's Law 19
Example for BLAST Search SWISSPROT for Ig-alpha: SWISS_PROT:C79A_HUMAN P11912 20
Example for BLAST 21
Example for BLAST 22
Example for BLAST Distribution of Hits: 23
Example for BLAST: Alignment >gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chain precursor (IG-alpha) (MB-1 membrane glycoprotein) (Surface-IGM-associated protein) (Membrane-bound immunoglobulin associated protein) (CD79A) Length = 220 Score = 278 bits (711), Expect = 5e-75 Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%) Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCG Sbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114 Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D Sbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174 Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP Sbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220 24
Example for BLAST: Taxonomy Lineage Report root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] . Human adenovirus type 7 ................................... 30 1 hit [viruses] 25
PSI-Blast Globin family (oxygen transport) of proteins occurs in many species Proteins have same function and structure and But there are pairs of members of the family sharing less than 10% identical residues 50% 50% A B C Only 10% PSI-BLAST idea: score via intermediaries may be better than score from direct comparison 26
PSI-BLAST PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2. 27
PSI-BLAST But beware of PSI-BLAST: False positives propagate and spread through iterations If protein A consists of domains D and E, and protein B of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain 28
Recommend
More recommend