CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182
Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary • Q: Given k words (s i has length l i ) , and a database of size n, find all matches to these words in the database string. • How fast can this be done? Fa05 CSE 182
Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? – Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time. • Dictionary matching – Trivial algorithm (l 1 +l 2 +l 3 …)n – Using a keyword tree, l p n (l p is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l 1 +l 2 ..) • We will consider the most general case Fa05 CSE 182
Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. Fa05 CSE 182
The Trie Automaton • Construct an automaton A from the dictionary – A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO u v P O A 1 T T O r 2:POTASSIU M S T 3:TASTE S I U M 2 w A S T E 3 Fa05 CSE 182
An O(l p n) algorithm for keyword matching • Start with the first position in the db, and the root node. • If successful transition – Increment current pointer – Move to a new node – If terminal node “success” • Else – Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat Fa05 CSE 182
Illustration: l c P O T A S T P O T A T O v P O A 1 T T O S T S I U M 2 A S T E 3 Fa05 CSE 182
Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match – Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM Pattern j 3:TASTE Fa05 CSE 182
Failure function • Every node v corresponds to a string s v that is a prefix of some pattern. • Define F[v] to be the node u such that s u is the longest suffix of s v • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |s u | 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration • What is F(n 10 )? • What is F(n 5 )? • F(n 3 )? • Lp(n 10 )? 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 1 c = 1 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 1 c = 2 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 1 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 3 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 3 c = 7 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 n 11 October 09 CSE182
Illustration P O T A S T P O T A T O l = 7 c = 7 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 7 c = 8 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Illustration P O T A S T P O T A T O l = 7 c = 7 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182
Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O October 09 CSE182
Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E- value cutoff • Blast October 09 CSE182
Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. October 09 CSE182
BLAST output • Look up Blast Results with RID – HA5YXH5C012 October 09 CSE182
Distant hits October 09 CSE182
Protein Sequence Analysis • What can you do if BLAST does not return a hit? – Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. – This increases the probability that the sequence similarity is a chance event. – How can we get around this paradox? – Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? B A C October 09 CSE182
Silly Quiz Skin patterns Facial Features October 09 CSE182
Not all features(residues) are important Skin patterns Facial Features October 09 CSE182
Diverged family members provide key features October 09 CSE182
Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? Fam(B) A C October 09 CSE182
Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 October 09 CSE182
Basic idea • It is a heuristic approach. Start with the following: – A collection of sequences with the same function. – Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity October 09 CSE182
EX: Zinc Finger domain October 09 CSE182
Proteins containing zf domains How can we find a motif corresponding to a zf domain October 09 CSE182
From alignment to regular expressions * ALRDFATHDDF ATH-[DE] SMTAEATHDSI ECDQAATHEAS • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate October 09 CSE182
The sequence analysis perspective • Zinc Finger motif C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – – 2 conserved C, and 2 conserved H • How can we search a database using these motifs? – The motif is described using a regular expression. What is a regular expression? October 09 CSE182
Regular Expressions • Concise representation of a set of strings over alphabet ∑ . • Described by a string over { } Σ , ⋅ , ∗ , + • R is a r.e. if and only if R = { ε } Base case R = { σ }, σ ∈ Σ R = R 1 + R 2 Union of strings R = R 1 ⋅ R 2 Concatenation * 0 or more repetitions R = R 1 October 09 CSE182
• End of L7 October 09 CSE182
Regular Expression • Q: Let ∑ ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? • Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE? October 09 CSE182
Recommend
More recommend