Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap • Pattern types • Motivation • Exhaustive searches • TEIRESIAS, • WINNOWER (Graph) • Gibbs sampler • EM algorithm (MEME) Problems • Pattern matching: find the exact occurrences of a given pattern in a given structure ( string matching) • Pattern recognition: recognizing approximate occurrences of a given pattern in a given structure (image recognition) • Pattern discovery: identifying significant patterns in a given structure, when the patterns are unknown (promoter discovery) 1
Data • Positive examples: a sample set S+ of sequences with unknown patterns • Negative examples: a sample set S- of sequences w/o patterns • Noise data Pattern discovery: the problem • Given a set of sequences S+ and a model of the source for S- • Find a set of patterns in S+ which have a support that is statistically significant w.r.t. the probabilistic model • If we are also given negative examples for S-, we must ensure that patterns do not appear in S- Motivation • Transcription factor binding sites: . . . Given a collection of genes with common expression, Find the TF-binding motif in common 2
Characteristics of Regulatory Motifs • Tiny • Highly Variable • ~Constant Size – Because a constant-size transcription factor binds • Often repeated • Low-complexity Pattern discovery • Methods – Combinatorial: graph, suffix tree, – Learning: clustering, classification – Statistical: EM, Gibbs sampling • Type of patterns – Deterministic, rigid, flexible, profiles, … • Measure of significance Training Adjust parameters Pos example Learning system Output Model Machine Neg example Predictions New data 3
Output • True positive: a pattern belonging to the positive set which has been correctly predicted • True negative: a pattern belonging to the negative set which has been correctly predicted • False positive: a pattern belonging to the negative set which has been predicted as positive • False negative: a pattern belonging to the positive set which has been predicted as negative Measuring pattern discovery • Complete: if no true pattern is missed – But may report too many patterns • Sound: if no false pattern is reported – But may miss true positives • Usually there is a tradeoff between soundness and completeness: if you increase one, you will decrease the other. Measuring the performance • Information retrieval measures: – Precision (sensitivity) = true pos/(true pos + false pos): the proportion of discovered patterns out of the total reported positive – Recall = true pos / (true pos + true neg): the proportion of discovered patterns out of the total true patterns 4
Measuring the performance • Specificity = true neg / (true neg + false pos): proportion of true neg of all negative samples • Correlation coefficient = (tp tn –fp fn) / sqrt( (tp+fp)(tp+fn)(tn+fn)(tn+fp)): is 1 when there are no false pos and no fase net, is 0 if the systems chooses randomly and -1 when there are only false pos and false neg Types of patterns • Deterministic patterns • Rigid patterns – Hamming distance • Flexible patterns (F-x(5)-G-x(2-4)-G-*-H – Edit distance • Matrix profiles (Position weighted matrix PWM or Position-specific score matrix PSSM) Deterministic patterns • Def: Deterministic patterns are substring over alphabet S – E.g. “TATAAA” (TATA-box) • Discovery algorithms are faster on these types of patterns • Usually not flexible enough for the needs of molecular biology 5
Rigid patterns • Def: Rigid patterns are patterns which allow substitutions/”don’t care” symbols – E.g. the patterns under IUPAC alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N} where for example: R=[A|G], Y=[C|T], etc. • Note that the size of the pattern is not allowed to change Flexible patterns • Def: Flexible patterns are patterns which allow substitutions/”don’t care” symbols and variable length gap – E.g. F-x(5)-G-x(2-4)-G-*-H. • Note that the size of the pattern is variable • Very expressive • Space of all pattern is huge Position weighted matrix 6
Hamming distance • Def: given two string y and w with the same length, the Hamming distance h(y,w) is given by the number of mismatches between y and w • Example: – y=GATTACA – w=TATAATA – h(w,y)=h(y,w)=3 Hamming neighborhood • Def: given a string y, all strings at Hamming distance at most d from y are in its d-neighborhood • Fact: the size of N(m,d) of the d- neighborhood of a string y with length m is ⎛ ⎞ d m = ∑ ⎜ ⎟ Σ − ∈ Σ j d d ( , ) (| | 1 ) ( | | ) N m d ⎜ ⎟ O m ⎝ ⎠ j = 0 j Hamming neighborhood • Example: y=ATA, the 1-neighborhood is { CTA, GTA, TTA, AAA, ACA, AGA, ATC, ATG, ATT, ATA} • This set can be written as a rigid pattern: {NTA|ANA|ATN} 7
Rigid pattern discovery • Note that we are going to observe occurrences of the neighbors of y, but we may never observe an occurrences of the pattern y y is the unknown pattern 2d d is the number of allowed w3 w1 mismatches d y W1, w2, w3 belong to the d- neighborhood of y w2 Hamming neighborhood • Fact: given two string w 1 and w 2 in the d- neighborhood of the pattern y, then h(w 1 ,w 2 ) ≤ 2d • The problem of finding y given w 1 , w 2 ,… is sometimes called Steiner sequence problem • Unfortunately, even we know all the w i in the neighborhood, there is no guarantee to find the unknown pattern y. Discrete formulations • Closest string problem: Given a multisequence set X= {x 1 , …, x k } each of length n, find a string y of the same length and the minimum d such that h(y,x i ) ≤ d. • Consensus pattern problem: Given a multisequence X= {x 1 , …, x k } each of length n and an integer m, find a string y of length m and substring t i of length m for each x i minimizing Sum i h(y,t i ), denote as d(y,X). 8
Enumerative approach: idea • Define the search space • List exhaustively all the patterns in the search space • Compute the statistical significance of all of them • Report the pattern with the highest statistical significance Enumerative approach 1. Pattern-driven algorithm: (4 m possibilities) For t = AA…A to TT…T Find d( t, S ) Report y = argmin( d(t, S) ) Running time: O(m|S|4 m ) Advantage: Finds provably “best” motif y Time Disadvantage: Complexity results • Thm: the consensus pattern problem is NP-hard • Thm: the closest string problem is NP- hard. • And many of other formulations in pattern discovery turn out to be NP-hard also (Li et al. STOC 99) 9
NP-hard, what to do? • Change the problem – E.g. relax the class of patterns • Accept the fact that methods may fail to find the optimal patterns – Heuristics – Randomized algorithms – Approximation schemes Exhaustive Searches 2. Sample-driven algorithm: For any m-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( m|S| 2 ) Time Advantage: If the true motif is weak and does not occur in data Disadvantage: then a random motif may score better than any instance of true motif Two naïve approaches • n=1,000,000, m=20, | ∑ |=4 • Approach 1: patterns to be tested O(nm| ∑ | m ), in this case: 1,000,000*20*4^20 • Approach 2: pattern to be tested O(n 2 ), in this case: 1,000,000^2 10
Discovering rigid patterns • Some “recent” algorithms • Teiresias • Winnower • Projection • Weeder Teiresias Teiresias • Teiresias by Rigoustos and Floratos from IBM [Bioinformatics, 1998] • Server at http://cbcsrv.watson.ibm.com/Tspd.html • The worst case running time is exponential, but works reasonably fast on average 11
Teiresias patterns • Teiresias searches for rigid patterns on the alphabet ∑ U{.}, where “.” is the don’t care symbol • However, there are some constrains on the density of “.” that can appear in a pattern <L,W> patterns • Def: given integers L and W, L ≤ W, y is a <L,W> pattern if – y is a string over ∑ U{.} – y starts and ends with a symbol from ∑ – any substring of y containing exactly L symbols from ∑ has to be shorter or equal to W (i.e. any substring containing exactly L symbols, the number of don’t care symbol is less than or equal to W-L) Example of <3,5> patterns • AT..GG..T is a <3,5> pattern • AT..GG..T. is not a <3,5> pattern • AT.G.G..T is not a <3,5> pattern 12
Teiresias • Def: a pattern w is more specific than a pattern y, if w can be obtained from y by changing one or more “.” to symbols from ∑ , or by appending any sequence of ∑ U{.} to the ends of y • Example: given y = AT..GG..T, the following patterns are more specific than y: • ATC.GG..T, CAT..GG..T, A.AT..GGT.T.A Teiresias • Def: a pattern y is maximal w.r.t. the sequences {x 1 , …, x k } if there exists no pattern w which is more specific than y and have the same number of occurrences. • Given {x 1 , …, x k } and parameter L,W and K, Teiresias reports all the maximal <L,W> patterns that have at least K colors. Some notation • Given {x 1 , …, x k } and a pattern y • Occurrences: the total number of occurrences of y in {x 1 , …, x k } is denoted as f(y) • Colors: assume each sequence has a different color, the total number of sequences that contain y (thus the total number of different colors) is denoted as c(y). • These quantities are called the support of y. 13
Recommend
More recommend