Probability Theory as Extended Logic: Probability Theory as Extended Logic: Applications to motif finding and clustering Applications to motif finding and clustering Ab initio motif finding: regular expressions, MEME, Gibbs-Sampling. • • Probabilistic clustering of sequences. • Discovering regulatory modules. • Regulatory motif discovery in phylogenetically related sequences. Erik van Nimwegen Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..
Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. binding sites Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. binding sites ATG….. Regulatory network To reconstruct the network we need to identify all binding sites genome-wide and the factor(s) that binds at each site.
Transcription Regulation Networks Transcription Regulation Networks metabolic genes transcription factors cell cycle related genes • The number of transcription regulators increases roughly quadratically with the size of the genome. • The number of regulators per gene thus increases linearly with the size of the genome. From: E. van Nimwegen Trends in Genetics 19 479-484 (2003) Transcription Regulation Networks Transcription Regulation Networks Knowledge from direct experimentation: E. coli : • almost 200,000 papers in PubMed. Over 17,000 on transcription. • About 300 TFs. • Less than 100 TFs with at least 1 known binding site. • About 750 known sites in total. (of 2,500-8,000 ?) S. cerevisiae : • Almost 60,000 papers in PubMed. Over 10,000 on transcription. • About 350 TFs. • About 65 TFs with at least 1 known binding site. • About 450 known sites in total. (of > 10,000 ?) Even in intensely studied model organisms the majority of regulatory sites is not known.
Ab initio initio discovery of regulatory sites Ab discovery of regulatory sites General Approaches: 1. Collect sets of (intergenic) sequences that are thought to contain binding sites for a common regulatory factor. Examples: • Upstream regions of co-regulated genes. • Sequence fragments pulled down with ChrIP then search for overrepresented short sequence motifs among them. Microarray experiments (gene expression) Sets of sequences containing sites Binding experiments for a common regulatory factor. (ChIP-on-chip) Other external biological knowledge Representation by consensus sequence Representation by consensus sequence or regular expression or regular expression The experimentally known binding ACGCGT sites of MBP1 (yeast TF): ACGCGT ACGCGA ACGCGT So called IUPAC symbols ACGCGA CCGCGT are used to represent sets TCGCGA of nucleotides. For instance: ACGCGT W = {A,T} and H = {A,C,T} ACGCGT ACGCGT ACGCGT ACGCGT Consensus sequence: (take the majority base in each column) ACGCGT Regular expression: (take the IUPAC symbol for the sequences HCGCGW occurring in each column)
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. Gene B ATG….. Gene C ATG….. Gene D ATG….. Gene E ATG….. • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. Gene E TGCACG ATG….. • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences.
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. Gene E TGCACG ATG….. • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences. • Determine the significance of the motif. Roughly speaking the significance is given by the probability to get so many occurrences in random sequences , e.g. P(WGCWCG) = 0.034 Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. Gene E TGCACG ATG….. • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences. • Determine the significance of the motif. Roughly speaking the significance is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034 • Rank all motifs by significance and report the motifs with highest significance.
Over- -representation of consensus representation of consensus Over and regular expression patterns and regular expression patterns Example algorithms: • YMF (Sinha and Tompa) • Weeder (Pavesi et al.) Advantages: • The search is exhaustive. If a significant motif exists it is guaranteed to be found. Disadvantages: • Consensus sequences and regular expressions are not necessarily a good representation of binding sites. (next slides) • The significant motifs are often partially redundant. For example: ATTACTAT WWACTWTTA AATTAC ATTACGG Now which motif is the “correct” motif? The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs Alignment of known fruR binding sites : AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG CTGCTGAATTGATTCAGGTCAGGCCA GTGCTGAAACCATTCAAGAGTCAATT GTGGTGAATCGATACTTTACCGGTTG CGACTGAAACGCTTCAGCTAGGATAA TGACTGAAACGTTTTTGCCCTATGAG TTCTTGAAACGTTTCAGCGCGATCTT ACGGTGAATCGTTCAAGCAAATATAT GCACTGAATCGGTTAACTGTCCAGTC ATCGTTAAGCGATTCAGCACCTTACC ** **gcTGAAtCG gcTGAAtCG* *TTcAg TTcAg**c****** **c****** = α i Probabilit y of finding base at position . w i α = = = = 3 3 3 3 For instance : 0 . 267 , 0.2 , 0.467 , 0.067 w w w w A C G T Probability that a site for the TF represented by w will have sequence s : l ∏ = i ( | ) P s w w s i = 1 i
Ab initio initio motif discovery with weight matrices Ab motif discovery with weight matrices Assume the input set of ‘co-regulated’ sequences is a mixture of “random” background sequence plus a number of samples from a weight matrix. ATG….. Unknowns: ATG….. 1. The weight matrix ATG….. 2. The number of sites ATG….. ATG….. 3. The positions of the sites MEME approach: Search the space of WMs for the WM that maximizes the likelihood of the data (summing over all possible binding site configurations for each WM). The likelihood is maximized using “Expectation Maximization”. Probability of the sequence given a configuration Probability of the sequence given a configuration Probability of sequence at positions (i+1) through (i+L) assuming it is a site for WM w : sequence s i+1 i+L L ∏ ≡ = k ( | ) ( ..... | ) P s w P s s s w w + + + [ , ] 1 2 i L i i i L s + i k = 1 k = ( | ) Probability of a base not in a site (“background”): P s b b s Probability of a given configuration of sites ρ to a set of sequences S: Likelihood of a configuration: Prior of a configuration: ( ) ( ) ∏ ∏ sites 1 ρ = ρ = − n n ( ) ( | , ) ( | ) nonsites P S w b P s w P p p [ , ] s j L w w i ∉ ∈ sites sites i j
Recommend
More recommend