Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA • Introduction: toward transcription regulatory networks Ab initio discovery of motifs by over-representation of regular expressions • • The weight matrix representation of regulatory motifs. Ab initio discovery with weight matrices: MEME and the Gibbs Sampler • • Discovery of regulatory modules in higher eukaryotes. Ab initio regulatory motif discovery in phylogenetically • related sequences: PhyloGibbs Erik van Nimwegen Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics E. van Nimwegen, EMBnet Geneve, Feb 2006.
Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. binding sites E. van Nimwegen, EMBnet Geneve, Feb 2006.
Transcription Regulation Networks Transcription Regulation Networks Regulators Promoters Genes (transcription factors) ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. binding sites ATG….. Regulatory network To reconstruct the network we need to identify all binding sites genome-wide and the factor(s) that binds at each site. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Transcription Regulation Networks Transcription Regulation Networks metabolic genes transcription factors cell cycle related genes • The number of transcription regulators increases roughly quadratically with the size of the genome. • The number of regulators per gene thus increases linearly with the size of the genome. From: E. van Nimwegen Trends in Genetics 19 479-484 (2003) E. van Nimwegen, EMBnet Geneve, Feb 2006.
Transcription Regulation Networks Transcription Regulation Networks Knowledge from direct experimentation: E. coli : • almost 200,000 papers in PubMed. Over 17,000 on transcription. • About 300 TFs. • Less than 100 TFs with at least 1 known binding site. • About 750 known sites in total. (of 2,500-8,000 ?) S. cerevisiae : • Almost 60,000 papers in PubMed. Over 10,000 on transcription. • About 350 TFs. • About 65 TFs with at least 1 known binding site. • About 450 known sites in total. (of > 10,000 ?) Even in intensely studied model organisms the majority of regulatory sites is not known. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Ab initio initio discovery of regulatory sites Ab discovery of regulatory sites General Approaches: 1. Collect sets of (intergenic) sequences that are thought to contain binding sites for a common regulatory factor. Examples: • Upstream regions of co-regulated genes. • Sequence fragments pulled down with ChrIP then search for overrepresented short sequence motifs among them. Microarray experiments (gene expression) Sets of sequences containing sites Binding experiments for a common regulatory factor. (ChIP-on-chip) Other external biological knowledge E. van Nimwegen, EMBnet Geneve, Feb 2006.
Representation by consensus sequence Representation by consensus sequence or regular expression or regular expression The experimentally known binding ACGCGT sites of MBP1 (yeast TF): ACGCGT ACGCGA ACGCGT So called IUPAC symbols ACGCGA are used to represent sets CCGCGT TCGCGA of nucleotides. For instance: ACGCGT W = {A,T} and H = {A,C,T} ACGCGT ACGCGT ACGCGT ACGCGT Consensus sequence: (take the majority base in each column) ACGCGT Regular expression: (take the IUPAC symbol for the sequences HCGCGW occurring in each column) E. van Nimwegen, EMBnet Geneve, Feb 2006.
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. Gene B ATG….. Gene C ATG….. Gene D ATG….. ATG….. Gene E • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. ATG….. Gene E TGCACG • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. ATG….. Gene E TGCACG • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences. • Determine the significance of the motif. Roughly speaking the significance is given by the probability to get so many occurrences in random sequences , e.g. P(WGCWCG) = 0.034 E. van Nimwegen, EMBnet Geneve, Feb 2006.
Scan for over- -represented patterns represented patterns Scan for over Gene A ATG….. AGCTCG TGCTCG Gene B ATG….. Gene C TGCTCG ATG….. Gene D AGCACG ATG….. Gene E TGCACG ATG….. • Exhaustively go through all possible consensus sequences (or regular expressions) s up to some length L. • For a given motif, say s = WGCWCG, find all occurrences. • Determine the significance of the motif. Roughly speaking the significance is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034 • Rank all motifs by significance and report the motifs with highest significance. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Over- -representation of consensus representation of consensus Over and regular expression patterns and regular expression patterns Example algorithms: • YMF (Sinha and Tompa) • Weeder (Pavesi et al.) Advantages: • The search is exhaustive. If a significant motif exists it is guaranteed to be found. Disadvantages: • Consensus sequences and regular expressions are not necessarily a good representation of binding sites. (next slides) • The significant motifs are often partially redundant. For example: ATTACTAT WWACTWTTA AATTAC ATTACGG Now which motif is the “correct” motif? E. van Nimwegen, EMBnet Geneve, Feb 2006.
The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs Alignment of known fruR binding sites : CTGAATCGATTTTAT CTGAATCGTTTCAAT CTGAATTGATTCAGG CTGAAACCATTCAAG GTGAATCGATACTTT CTGAAACGCTTCAGC CTGAAACGTTTTTGC TTGAAACGTTTCAGC GTGAATCGTTCAAGC CTGAATCGGTTAACT GTTAAGCGATTCAGC cTGAAtCG* cTGAAtCG *TTcAg TTcAg* * = α i Probabilit y of finding base at position . w i α = = = = 1 1 1 1 For instance : 0 . 07 , 0.53 , 0.27 , 0.13 w w w w A C G T Probability that a site for the TF represented by w will have sequence s : l ∏ = i ( | ) P s w w s i = 1 i E. van Nimwegen, EMBnet Geneve, Feb 2006.
The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs Alignment of known fruR binding sites : CTGAATCGATTTTAT CTGAATCGTTTCAAT CTGAATTGATTCAGG CTGAAACCATTCAAG GTGAATCGATACTTT CTGAAACGCTTCAGC CTGAAACGTTTTTGC TTGAAACGTTTCAGC GTGAATCGTTCAAGC CTGAATCGGTTAACT GTTAAGCGATTCAGC cTGAAtCG* cTGAAtCG *TTcAg TTcAg* * The quality of an alignment of putative sites can be measured by the Information score I : ⎛ ⎞ i i n f ∑ ⎜ ⎟ = = = α α i i , background , log f b I f ⎜ ⎟ α α α ⎝ ⎠ n b α α , i ( sites from a WM ) P ≈ nI e ( sites from bg) P E. van Nimwegen, EMBnet Geneve, Feb 2006.
Ab initio initio motif discovery with weight matrices Ab motif discovery with weight matrices Assume the input set of ‘co-regulated’ sequences is a mixture of “random” background sequence plus a number of samples from a weight matrix. ATG….. Unknowns: ATG….. 1. The weight matrix ATG….. 2. The number of sites ATG….. ATG….. 3. The positions of the sites MEME approach: Search the space of WMs for the WM that maximizes the likelihood of the data (summing over all possible binding site configurations for each WM). The likelihood is maximized using “Expectation Maximization”. Gibbs Sampler approach: Search the space of binding site configurations for the configuration that maximizes the likelihood of all sites deriving from a common WM (integrating over all possible WMs) and all other sequence deriving from background. The space of configurations is searched through “Gibbs Sampling”. E. van Nimwegen, EMBnet Geneve, Feb 2006.
Recommend
More recommend