Analysis methods Identifying gene modules Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 18 / 109
Analysis methods Identifying gene modules Gene modules What is a gene module? • Many possible definitions, but lets keep it informal • Usually a set of genes that function together • Think: the genes whose regulation you want to understand • Gene modules might have 10 genes, or 500 genes Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 19 / 109
Analysis methods Identifying gene modules Differentially expressed genes Condition 1 Condition 2 The context • Want to understand, for example Ppp2r5e Sgpp1 • Expression in diseased cells Zbtb1 Esr2 • Cells from a developmental state Ttc9 • Get expression from 2 conditions Tex21 Mthfd1 • Before and after some perturbation Zbtb25 Prkch • Samples taken at different time-points Hif1a • Different types of cells Gphb5 Snapc1 Syt16 Dbpht2 Simplest gene modules Kcnh5 Rhoj pm1a • Genes showing differential expression Six6os1 Tmem30b • Maybe interested only in genes Six6 Six1 “over-expressed” or “under-expressed” Six4 Mnat1 • Mann-Whitney U-test Trmt5 Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 20 / 109
Analysis methods Identifying gene modules Gene expression profiles Gene expression matrix · · · x 1 , 1 x 1 , 2 x 1 , 3 x 1 ,m · · · • Columns ⇔ experiments x 2 , 1 x 2 , 2 x 2 , 3 x 2 ,m · · · x 3 , 1 x 3 , 2 x 3 , 3 x 3 ,m • Rows ⇔ genes . . . . ... • x i , j ⇔ level of gene i in expmt j . . . . . . . . · · · x n, 1 x n, 2 x n, 3 x n,m Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 21 / 109
Analysis methods Identifying gene modules Gene expression profiles Gene expression matrix x 1 , 1 x 1 , 2 x 1 , 3 x 1 ,m · · · x 2 , 1 x 2 , 2 x 2 , 3 x 2 ,m • Columns ⇔ experiments · · · x 3 , 1 x 3 , 2 x 3 , 3 x 3 ,m x 3 = � · · · • Rows ⇔ genes . . . . ... • x i , j ⇔ level of gene i in expmt j . . . . . . . . x n, 1 x n, 2 x n, 3 x n,m · · · Gene expression profile • Each gene has a profile: a row of the matrix • Statistical issues ( e.g. normalization) outside current scope • More experiments means more information in each profile • Similar expression profiles suggest similar regulation Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 21 / 109
Analysis methods Identifying gene modules Using data from multiple experiments Clustering genes expression profiles • Get gene modules based on expression from multiple experiments • Cluster genes with similar or correlated expression profiles • Any clustering algorithm can be used (e.g. k-means, hierarchical) • Best algorithm depends on data and analysis goals Measuring profile similarity • Examples: correlation, Euclidean distance, mutual information • Again, best measure depends on data and analysis goals Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 22 / 109
Analysis methods Identifying gene modules Inferring influence networks Obtaining the direction of a relationship • Clusters suggest association, but not causation • More interesting: infer which are regulators and which are targets • Need sophisticated tools and the right kind/amount of data • Examples of methods: Bayesian networks, ARACNE How to use influence networks • Influence networks can provide framework • Connections can be annotated with direct information Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 23 / 109
Analysis methods Modeling regulatory elements Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 24 / 109
Analysis methods Modeling regulatory elements Modeling binding sites Transcription Factor Binding site ACGTGACACAATTGGCATACGATCTACGTACAA Binding sites • Genomic sequences recognized and bound by binding domains of TFs • Binding sites for same TF might be different from each other • Often 8-12bp, but examples can be found from 5bp to ∼ 30bp Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 25 / 109
Analysis methods Modeling regulatory elements Modeling binding sites GATCATCATCATTGTGCAGCAGTC G C CG T C G CC TGAAGAGAGAGAACATGACAACGA ACAACGTACATGATGTGCCCAGTC G C C A TC T T G CACGTTTTTTAACACCGTGCCAAT CCACGTGACGTAACCTGCATCACA A C C A T C T T G ACACGTGACCCAATATATGGACTT AGTCTCGACAGCCTTCCCTTCGCG G C C A TTT TG CAACCATGCACGAATTGAATTAAT TGCGTATAACCCCATGATGCCCGA GC C T A C A T G GATGACCAACACACACCACACCAG What is a motif? • Motifs are how we model the set of binding sites for a TF • Should describe information important for binding • Motifs � = binding sites Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 25 / 109
Analysis methods Modeling regulatory elements Consensus sequence representation G C C A T C T G T G C C A T C C G C G C C A T C T T G G C C A T G T A C Alignment of G C C A T A T T T G C C A T C T T T binding sites G A C A T T T T G T C C A T T T T G T C T A G G T T T G C T C C A T T T T C C A T G G T T G C C A T C T T G G C C A T T T T G G C C A T C T T G Consensus sequence A C C A T G T C A T C C A T G T G T G C C A T C A C A G C C A T C T T G Consensus sequences • Pros: Easy to understand, easy to manipulate computationally • Cons: Does not express all important information Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 26 / 109
Analysis methods Modeling regulatory elements Consensus sequence representation G C C A T C T G T G C C A T C C G C Degenerate nucleotides G C C A T C T T G G C C A T G T A C G C C A T A T T T M A or C V A, C or G ⇒ ⇒ G C C A T C T T T R A or G H A, C or T ⇒ ⇒ G A C A T T T T G W A or T D A, G or T ⇒ ⇒ T C C A T T T T G S ⇒ C or G B ⇒ C, G or T T C T A G G T T T Y C or T N A, C, G or T G C T C C A T T T ⇒ ⇒ T C C A T G G T T K G or T ⇒ G C C A T C T T G G C C A T T T T G G C C A T C T T G A C C A T G T C A Degenerate consensus T C C A T G T G T G C C A T C A C A D M Y M B N N N N Degenerate consensus sequences • IUPAC degenerate nucleotide codes • Provides more flexible representation, but usually not enough Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 26 / 109
Analysis methods Modeling regulatory elements Matrix-based representation G C C A T C T G T G C C A T C C G C G C C A T C T T G G C C A T G T A C G C C A T A T T T G C C A T C T T T 1 2 3 4 5 6 7 8 9 G A C A T T T T G A 1 1 0 16 0 2 1 1 2 T C C A T T T T G C 0 16 15 1 1 7 1 2 2 T C T A G G T T T G C T C C A T T T G 12 0 0 0 1 5 1 3 6 T C C A T G G T T T 4 0 2 0 15 3 14 11 7 G C C A T C T T G G C C A T T T T G G C C A T C T T G A C C A T G T C A T C C A T G T G T G C C A T C A C A What is the matrix representation? • Matrix columns correspond to positions in sites • Matrix rows correspond to nucleotides • Entries correspond to base counts at the site • Assumptions: independent positions, fixed with, no gaps Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 27 / 109
Analysis methods Modeling regulatory elements Matrix-based representation 1 2 3 4 5 6 7 8 9 Counts A 1 1 0 16 0 2 1 1 2 C 0 16 15 1 1 7 1 2 2 G 12 0 0 0 1 5 1 3 6 T 4 0 2 0 15 3 14 11 7 1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 Probabilities C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 (normalized counts) T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 Different kinds of matrices • Probability matrix: columns are position-specific nucleotide distributions • Many names: position-weight matrix (PWM), position-frequency matrix (PFM) profile, alignment matrix, etc. • We use PWM to refer to both count and probability matrices • Only 3 different kinds of matrices (we will see a scoring matrix later) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 27 / 109
Analysis methods Modeling regulatory elements Sequence Logos A C T A T T T G C T C C 1 2 3 4 5 6 7 8 9 A 1 1 0 16 0 2 1 1 2 G G C 0 16 15 1 1 7 1 2 2 G 12 0 0 0 1 5 1 3 6 G T T C T 4 0 2 0 15 3 14 11 7 G C G C A A A A A C Sequence Logos • Cartoon depiction of a motif • Size of base is proportional to frequency in matrix • Sometimes sizes are scaled by “information content” (not covered) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 28 / 109
Analysis methods Modeling regulatory elements Sequence Logos 2 A C C T bits 1 T G T T T C A G 0 A C G C T A G C G 1 2 3 4 5 6 7 A 8 9 5 ′ 3 ′ weblogo.berkeley.edu Sequence Logos • Cartoon depiction of a motif • Size of base is proportional to frequency in matrix • Sometimes sizes are scaled by “information content” (not covered) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 28 / 109
Analysis methods Modeling regulatory elements Resources Motif Databases • JASPAR (free) and TRANSFAC (BIOBASE) • Hundreds of known motifs and binding sites • Essential resources for regulatory sequence analysis Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 29 / 109
Analysis methods Predicting binding sites Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 30 / 109
Analysis methods Predicting binding sites Probability from a motif 1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.24 × 0.94 × 0.12 × 0.94 × 0.88 × 0.29 × 0.82 × 0.65 × 0.41 = 0.001419188 • Possible to compute probability of a sequence from a motif • Multiply values corresponding to nucleotide at each position • This works because we assume positions are independent • In the example Pr ( TCTATGTTT ) = 0 . 001419188 Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 31 / 109
Analysis methods Predicting binding sites Probability from a motif 1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.24 × 0.94 × 0.12 × 0.94 × 0.88 × 0.29 × 0.82 × 0.65 × 0.41 = 0.001419188 • Possible to compute probability of a sequence from a motif • Multiply values corresponding to nucleotide at each position • This works because we assume positions are independent • In the example Pr ( TCTATGTTT ) = 0 . 001419188 • ... but does that mean anything? Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 31 / 109
Analysis methods Predicting binding sites Likelihood from motif vs base composition 1 2 3 4 5 6 7 8 9 A 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 C 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 G 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 T 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.2 × 0.3 × 0.2 × 0.2 × 0.2 × 0.3 × 0.2 × 0.2 × 0.2 = 0.00000152 • Likelihood from motif was ≈ 0.00142 • Assume each position sampled independently from base frequencies • Ratio of the likelihoods: 0.00142/0.00000152 ≈ 934 • Match-score: obtained by taking log of this ratio • Positive match-score ⇒ sequence more likely from motif Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 32 / 109
Analysis methods Predicting binding sites Making a scoring matrix 1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 A 0.20 C 0.30 G 0.30 probability from motif � 0.94 � T 0.20 = log log = 1.6 probability from base composition 0.30 1 2 3 4 5 6 7 8 9 A -1.6 -1.6 -4.2 2.2 -4.2 -0.7 -1.6 -1.6 -0.7 C -4.2 1.6 1.5 -2.1 -2.1 0.4 -2.1 -1.2 -1.2 G 1.2 -4.2 -4.2 -4.2 -2.1 -0.0 -2.1 -0.7 0.2 T 0.2 -4.2 -0.7 -4.2 2.1 -0.2 2.0 1.6 1.0 Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 33 / 109
Analysis methods Predicting binding sites Scanning a sequence 1 2 3 4 5 6 7 8 9 A -1.6 -1.6 -4.2 2.2 -4.2 -0.7 -1.6 -1.6 -0.7 C -4.2 1.6 1.5 -2.1 -2.1 0.4 -2.1 -1.2 -1.2 G 1.2 -4.2 -4.2 -4.2 -2.1 0.0 -2.1 -0.7 0.2 T 0.2 -4.2 -0.7 -4.2 2.1 -0.2 2.0 1.6 1.0 ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.2 + 1.6 - 0.7 + 2.2 + 2.1 + 0.0 + 2.0 + 1.6 + 1.0 = 10 AGTATCACTCTATGTTTGTTGCACA Basic steps • Slide matrix along sequence • Calculate score at each position • Keep scores that meet some criteria ( e.g. above a cutoff) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 34 / 109
Analysis methods Predicting binding sites Remarks About scoring matrices • Match-scores are sensitive to the base composition assumed • Also sensitive to pseudocount • Several algorithms exist for calculating scores fast • Statistical significance of matches can be measured multiple ways Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 35 / 109
Analysis methods Predicting binding sites Remarks About scoring matrices • Match-scores are sensitive to the base composition assumed • Also sensitive to pseudocount • Several algorithms exist for calculating scores fast • Statistical significance of matches can be measured multiple ways About predicted sites • Provide mechanistic link between regulator and target in networks • High false positive rate: match-scores only tell part of the story • Should be combined with cross-species conservation (more later) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 35 / 109
Analysis methods Predicting binding sites What does enrichment mean? Three desirable properties 1. More total occurrences VS. Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109
Analysis methods Predicting binding sites What does enrichment mean? Three desirable properties 1. More total occurrences 2. Stronger occurrences ( i.e. higher scoring) VS. Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109
Analysis methods Predicting binding sites What does enrichment mean? Three desirable properties 1. More total occurrences 2. Stronger occurrences ( i.e. higher scoring) VS. 3. More sequences containing an occurrence Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109
Analysis methods Predicting binding sites What does enrichment mean? Three desirable properties 1. More total occurrences 2. Stronger occurrences ( i.e. higher scoring) VS. 3. More sequences containing an occurrence But different assumptions valid for different TFs/contexts Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109
Analysis methods Predicting binding sites Enrichment based on likelihood TCM OOPS ZOOPS Two Component Mixture One Occurrence Per Sequence Zero Or One Occurrence Per Sequence (any number per sequence) • Mixture models: rigorous statistical foundation for enrichment • These models capture the 3 aspects of enrichment: each sequence is a mixture of sites and non-sites • Likelihoods calculated for entire set of sequences • Necessary calculations closely related to match-scores Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 37 / 109
Analysis methods Predicting binding sites Using a set of background sequences Foreground sequences Which motif is more enriched? • Yellow motif occurs many times • Blue motif also occurs many times (and in consistent location) • Both may appear enriched Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 38 / 109
Analysis methods Predicting binding sites Using a set of background sequences Foreground sequences Background sequences Why use a background set? • Statistical models of “random” promoters don’t work • Using a background can control many unknown variables • Different backgrounds can be used to examine different questions Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 38 / 109
Analysis methods Predicting binding sites Selecting background sequences Examples of desirable properties • Similar to foreground in terms of primary sequence features ( e.g. GC-content, CpG-content) • Uniform length sequences (both FG and BG) can facilitate statistics • Share similar biological properties ( e.g. compare promoters to other promoters) Common mistakes • Compare promoters to exons (very bad) • Comparing CpG-related promoters to non-CpG-related promoters • Having different repeat composition in background • Comparing sequences between species • Using too few sequences (results in over-fitting) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 39 / 109
Analysis methods Predicting binding sites Identifying enriched motifs Why identify enriched motifs? • Identify motifs that are important regulators of a gene module • Obtain more information for connections in networks • Identify candidates for site prediction Significance of motif enrichment • Enrichment scores more useful if p -values can be obtained • Empirical p -values can be obtained in multiple ways: shuffle sequences, permute sequence labels, permute matrix columns • Correct for multiple testing if evaluating enrichment of multiple motifs Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 40 / 109
Analysis methods Conservation of regulatory elements Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 41 / 109
Analysis methods Conservation of regulatory elements Cross-species conservation chr19: 50518000 50518500 50519000 UCSC Known Genes Based on UniProt, RefSeq, and GenBank mRNA CKM RefSeq Genes RefSeq Genes Vertebrate Multiz Alignment & Conservation (17 Species) Conservation mouse rat rabbit dog armadillo elephant opossum chicken x_tropicalis tetraodon Why do we use it? • Negative selection: things that are important will be conserved • Helps distinguish functional from non-functional sites Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 42 / 109
Analysis methods Conservation of regulatory elements How to use conservation Conserved regions • Search in pre-defined regions Conserved regions Conserved sites • e.g. Ultraconserved regions Non−conserved sites Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109
Analysis methods Conservation of regulatory elements How to use conservation Conserved regions • Search in pre-defined regions Conserved sites Non−conserved sites • e.g. Ultraconserved regions Conservation profile • Assign conservation score to each individual base • e.g. phastCons scores Conservation profile Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109
Analysis methods Conservation of regulatory elements How to use conservation Conserved regions • Search in pre-defined regions • e.g. Ultraconserved regions Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) Gaps 1 4 Human A C C A C G A A C A T G C C G G T A C A T G T T T G T T T Chimp A C C A C G A A C A T G C C G G T A C A T G T T T G T T T Rhesus A C C A C G A A C A T G C C G G T A C A T G T T T G T T T Bushbaby A C C A T G A A C T T G C C T G T A C A T G T T T G T T T TreeShrew A C C A C G G A C A T G C T G G T A C A T G C T T G T T T Mouse A C C A A G A A C A T G C C G G T A C A T G T T T G T T T Rat A C G A G G A A C A T G C C G G T A C A T G T T T G T T T Conservation profile GuineaPig A A C A C G A A C G T G C C G G G A C A T G T T T G T T T Rabbit A C C A C G A A C A T G T C G G T A C A T G T T T G T T T Shrew C C C A T G A A C A G G T C G G T A C A T G T T T G T T T Hedgehog A C C A T G A A C A G G C T G G A A C A T G T T T G T T T Dog A C C A C G A A C A T G C C G G T A C A T G T T T G T T T Cat A C C A C G A A C A T G C C A G T A C A T G T C T G T T T • Assign conservation score to Horse A C C A C G A A C A T G C C A C T A C A T G T T T G T T T A C C C A G A A C A C A C C A G T A C A T G T T T G T T A Cow A C C G C G A A C A T G C C G G T A C A T G T T T G T T T Armadillo A C T G G G C A C T T G C A G G T A C T T G T T T G T T T Elephant A C C G G G A A C T T G C C A G T A C A T A T T T G T T T each individual base Tenrec C C T G A G A A C A T G C C A G T A C A T G T T T G T T T Opossum A C C - - - - - - - - - T T G G T A C A C A T T T A T T T Platypus • e.g. phastCons scores Use alignments directly • Much information in alignments • Requires more complex methods Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109
Analysis methods Conservation of regulatory elements Turnover and non-alignment methods Human Binding sites: Chimp Present−day Mouse Rat Ancestral Dog Cow Frog • Functionally analogous sites (in different species) that do not align • Sites presumed to evolve under similar evolutionary constraints • Importance of turnover still not clear, but some evidence exists • Non-alignment methods in general less useful for predicting sites • Can indicate important motifs (cross-species enrichment) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 44 / 109
Analysis methods Conservation of regulatory elements Things to consider Which alignments to use • Precomputed alignments: multiz17way (recently 28-way), mlagan • Creating your own alignments raises many issues Species to use • Understand the network being investigated • Make sure protein and function conserved in species compared • Accounts for compensatory substitutions in sites Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 45 / 109
Analysis methods Motif discovery Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 46 / 109
Analysis methods Motif discovery Motif discovery What is motif discovery? • Start with just sequences • Identify strongly enriched motifs de novo • Algorithmically on of the most challenging analysis tasks • Use it when you suspect important unknown motifs in your data Motif discovery methods • Can be classified by motif representation • Word-based representation • Matrix-based representation • Also by algorithmic strategy • Discrete optimization • General statistical algorithms Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 47 / 109
Analysis methods Motif discovery Motif discovery by word counting Table of words and their occurrences For each word of width k: AAAAA 521 count number of occurrences 534 AAAAC 243 Apply statistics to counts AAAAG AAAAT 847 366 AAACA current word 501 AAACC 521 GAGTC AAACG GAGGT 622 718 GAGTA ??? GAGTC AAGTCTACATGGAGTCGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGTTA GAGTG TAAATAAATTCATCTGATCAAAAGAAATTTAAAAACCAACCAACCCTAATGAGCTCTAAAGACAGCAGAGTCACACGCGA GAGTT AGGAGCGGCGCCTTCACCCTCCGGCCTCAGCCCGCGAGGCTGCAACCCTTTCCGCACCTGGCTCCATCTCCCTGGCCCTC TTTGG GGAGCGAGAAGGCGGCGGGGGATCTGGCGCCCGGCTTAGGGGCGAGACGGCCGCACCGGGAGCCTAGCGATCAGGGCACC GCCACGCCGCCGTGAGCCCCGCCCAACATAGCCCCAGGAGTCGCTTCGCGTGTAGAAGCGTCCGGGTGGCGGAGGCCGCA TTTGT AGAAGGGTGCCCTGTCCTGGGAGTCCCTTTTGCAGCCACTCAGATGTGCTGCTGCGGTGTCCTTTGTGCTGGTGGCAGCC TTTTA AGCCGTTCCCAGCTTGACTTTCCCCTTTAGCCTAGTGATTTGGGGGCCCCAAGGTTTATTTTCCTTTCGCGTAGCTTCGC TTTTC TGTGTCCTGGTGTCTTCTCTCCTCAGCCTGTTTCTCATCCTGGAAACATGAGGTGTGCTGGCGCAGGGCGATAGCGCAGTG GGGTGGGGTTGGGAGGAAACCCTTATCTGTGGCCGATGGCCCTCCGTTGTGAGTCTATTAAAACTCTGGGAAACTGCTAT TTTTG AAGACCCTGAGAAGCAAATCTTTAATTTTTTTGTTTTTGTGAGACGGAGCACTCTGTCGCCCAGGCTAGAGTGCAATTAG TTTTT GGTGCAATCTCGGCTCACTGGAACCTCCGCCTCCTGAGTCCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGTTA AGTAGAGACTGGAGTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCCCAAGTGATCCACCTGCCTCAGCCTCTT AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAAATGGATTCAACATCTATTATTGCTACTATTGT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTATGGATTCCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGATGGATTCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 48 / 109
Analysis methods Motif discovery Gibbs Sampling Start with a given motif and a set of occurrences GCCATCTTT GACATTTTG TCCATTTTG A 1 1 0 11 0 1 1 0 2 TCTAGGTTT C 0 11 10 1 1 3 0 2 0 GCTCCATTT G 7 0 0 0 1 5 1 1 5 TCCATGGTT GCCATCTTG T 4 0 2 0 10 3 10 9 5 GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA TCCATGTGT AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT GCCATCTTT AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT GACATTTTG TCCATTTTG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCTAGGTTT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT GCTCCATTT TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TCCATGGTT CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GCCATCTTG TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG GCCATTTTG GCCATCTTG GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT GCCATGACA ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG TCCATGTGT GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC ACCATGTCA Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109
Analysis methods Motif discovery Gibbs Sampling Iterate these steps: 1) Sample a new occurrence from one sequence GCCATCTTT GACATTTTG TCCATTTTG A 1 1 0 11 0 1 1 0 2 TCTAGGTTT C 0 11 10 1 1 3 0 2 0 GCTCCATTT G 7 0 0 0 1 5 1 1 5 TCCATGGTT GCCATCTTG T 4 0 2 0 10 3 10 9 5 GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA TCCATGTGT AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT GCCATCTTT AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT GACATTTTG TCCATTTTG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCTAGGTTT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT GCTCCATTT GCCATCTTT TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TCCATGGTT CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GCCATCTTG Probability of selecting TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG GCCATTTTG particular site related to GCCATCTTG GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT GCCATGACA strength of match to matrix ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG TCCATGTGT GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC ACCATGTCA Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109
Analysis methods Motif discovery Gibbs Sampling Iterate these steps: 1) Sample a new occurrence from one sequence 2) Update the matrix based on new occurrence GCCATCTTT GACATTTTG TCCATTTTG 12 0 A 1 1 0 11 0 1 1 0 2 TCTAGGTTT C 0 11 10 11 1 0 1 3 0 2 0 GCTCCATTT 0 4 G 7 0 0 0 1 5 1 1 5 TCCATGGTT GCCATCTTG T 4 0 2 0 10 3 10 9 5 1 11 GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA Usually the changes TCCATGTGT will move matrix toward AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT GCCATCTTT stronger motif AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT GACATTTTG TCCATTTTG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCTAGGTTT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT GCTCCATTT GCCATCTTT TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TCCATGGTT CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GCCATCTTG TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG GCCATTTTG GCCATCTTG GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT GCCATGACA ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG TCCATGTGT GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC ACCATGTCA Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109
Analysis methods Motif discovery Other techniques Expectation Maximization (EM) • Instead of sampling sites with particular probability: • All possible sites contribute to the matrix • Contribution of each site related to probability (score) • Iterate through motifs instead of sites • Like deterministic version of Gibbs: no random choices after setting the starting point Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 50 / 109
Analysis methods Motif discovery Other techniques Expectation Maximization (EM) • Instead of sampling sites with particular probability: • All possible sites contribute to the matrix • Contribution of each site related to probability (score) • Iterate through motifs instead of sites • Like deterministic version of Gibbs: no random choices after setting the starting point Variants of EM or Gibbs • Gibbs Motif Sampler (Lawrence et al., 1993) • MEME (Bailey & Elkan, 1995) • AlignACE (Hughes et al., 2000) • MDscan (Liu et al., 2002) Good starting points are critical for Gibbs and EM Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 50 / 109
Analysis methods Motif discovery Things to consider Current status • Field starting to mature: many great algorithms exist! • Probably none will be “perfect” for your application • Try several algorithms, understand what they do Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109
Analysis methods Motif discovery Things to consider Current status • Field starting to mature: many great algorithms exist! • Probably none will be “perfect” for your application • Try several algorithms, understand what they do How to improve • Combine best aspects of different algorithms • Incorporate more biological knowledge Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109
Analysis methods Motif discovery Things to consider Current status • Field starting to mature: many great algorithms exist! • Probably none will be “perfect” for your application • Try several algorithms, understand what they do How to improve • Combine best aspects of different algorithms • Incorporate more biological knowledge DME: Discriminating Motif Enumerator • Enumerative search strategy, matrix-based motifs • Smith, Sumazin & Zhang (PNAS, 2005) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109
Analysis methods Cis-regulatory modules Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 52 / 109
Analysis methods Cis-regulatory modules What is a cis -regulatory module? The IFN β Enhancer • Figure from Maniatis et al. (CSHL Symposium 1998) • Critical property: sites that work together tend to cluster Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 53 / 109
Analysis methods Cis-regulatory modules What is a cis -regulatory module? (Yuh et al., 2001) Sea Urchin Endo16 promoter • Figure from Yuh et al. (2001) • Promoter logic: CRMs are autonomous units encoding regulation Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 53 / 109
Analysis methods Cis-regulatory modules Identifying cis -regulatory modules PReMod (Blanchette et al, 2006) Occurrences tightly clustered Far from gene chr2: 236860000 236865000 236870000 236875000 236880000 236885000 PReMod Predicted Regulatory Modules module STS Markers on Genetic (blue) and Radiation Hybrid (black) Maps STS Markers UCSC Known Genes (June, 05) Based on UniProt, RefSeq, and GenBank mRNA GBX2 ASB18 RefSeq Genes RefSeq Genes Mammalian Gene Collection Full ORF mRNAs Exoniphy Human/Mouse/Rat/Dog Exoniphy Strong occurrences ExonWalk Alt-Splicing Transcripts ExonWalk Human mRNAs from GenBank AF118452 AK123854 of known motifs Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & Conservation Conservation mouse rat rabbit dog Highly conserved region armadillo elephant opossum chicken x_tropicalis tetraodon Simple Nucleotide Polymorphisms (dbSNP build 125) SNPs Repeating Elements by RepeatMasker RepeatMasker Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 54 / 109
Analysis methods Cis-regulatory modules Motif modules What are they? • A set of motifs for sites that frequently work together • CRMs are the occurrences of motif modules • Often can predict expression better than individual motifs • Simplest kind: pair of sites for dimerizing TFs Interesting properties • Relative order: some motifs must be beside each other • Total span and spacing of sites can be restricted • Relative orientation sometimes important • Weaker individual sites: combined affinity is important Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 55 / 109
Analysis methods Cis-regulatory modules Discovering motif modules Library based • Given a library of motifs construct modular motifs • Many known motifs work have important interactions De-novo discovery • Discover modular motifs from sequence alone • Currently no generally practical methods • Anchoring strategy: almost de novo , and can be useful • CisModule: one of the most sophisticated algorithms Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 56 / 109
Part II Part II: Worked Examples Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 57 / 109
Overview Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 58 / 109
Analyzing sets of co-regulated genes Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 59 / 109
Analyzing sets of co-regulated genes An example gene module Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 60 / 109
Analyzing sets of co-regulated genes An example gene module Example gene module LPS responsive genes • Bacterial LPS (lipopolysaccharide) stimulates B-cell activation, proliferation, and differentiation • Gene module compiled through individual experiments • Ramirez-Carrozzi et al. (Genes & Dev, 2006) Selective and antagonistic functions of SWI/SNF and Mi-2b nucleosome remodeling complexes during an inflammatory response Properties of the gene module • The gene module comprises 35 genes • Some are TFs (e.g. Irf1, Irf7, Junb, Fos, Nfkbiz, Egr1, Zfp369) • Several known binding sites in promoters of these genes ( e.g. IFN β enhancer) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 61 / 109
Analyzing sets of co-regulated genes An example gene module Analysis tasks Analysis tasks • Identify enriched known motifs • Use known motifs to predict functional binding sites Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 62 / 109
Analyzing sets of co-regulated genes An example gene module Obtaining promoter sequences Promoter databases • Examples: EPD, DBTSS, CSHLmpd • Use when promoter choice really matters ( e.g. small data sets, many alternative promoters) UCSC Table Browser to get promoters • Start with set of RefSeq IDs for genes in module • Select the appropriate table (refGene for mm8) • Upload the RefSeq IDs • Select sequence output format • Select “upstream by 1000bp” Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 63 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 64 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The motifclass program How it evaluates enrichment • Compares set of foreground sequences to background sequences • For a given motif, each sequence is assigned a score • The score is the maximum match-score of any site in the sequence • The scores are used to classify foreground and background sequences • Sequences with higher scores are classified as foreground • Better classification ability means greater enrichment • p -values obtained by randomly permuting sequence labels Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 65 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The motifclass program Foreground sequences Background sequences Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 65 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs Using motifclass to evaluate motif enrichment Sequence files • Foreground: the 35 proximal promoters • Background: 1000 random mm8 RefSeq promoters • Promoter sequences taken -1000 to -1 relative to the TSS • Sequences given in FASTA format Motif library • Known motifs from the JASPAR database • Total of 123 motifs (some redundancy) • Motifs must be converted into CREAD motif format Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 66 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The CREAD motif file format AC: the accession • Identifier for each motif • Best to keep them unique Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The CREAD motif file format TY: the type of pattern • Type of this pattern is “Motif” • Just to tell programs what they are looking at Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The CREAD motif file format The matrix lines • This is the actual PWM • Transposed: one line per column • Either counts or probabilities Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The CREAD motif file format AT: the attributes • Annotate motifs with additional information • Attribute=value pairs • Usually optional • Some programs require certain attributes Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs The CREAD motif file format BS: the binding site lines • To store sites for each motif • More details on this later Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs Running motifclass on LPS-responsive promoters • -r: use relative error as enrichment measure • -O: find the score cutoff optimizing that enrichment • -P 1000: report a p -value for each motif using 1000 shuffles • -v: print progress information while running Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 68 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs What the output looks like Attributes from motifclass • Relative error rate • Sensitivity and specificity • Optimal score cutoff (Functional depth and threshold) • p -value and rank (in set of motifs) Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 69 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs Interpreting the results Name Logo Sn Sp Error p -value G G G A G A T T T C C C A C 1. NFKB1 0.743 0.603 0.327 0 G A C T G G G C T G G G G A A T T C T C C 2. RELA 0.686 0.655 0.33 0 G NF-kappaB G G C T C A T T G G A A T T A T C T C 3. 0.4 0.9 0.35 0.002 C C A T G G C A G G G T A G T G T C T T C A T C T C A G 4. Dorsal 1 0.886 0.413 0.351 0 C C G T T G T T C G T C A A A A A A T T A T T G G G G C A C 5. REL 0.314 0.956 0.365 0.008 C C A A T T T G T T G C A T G C C T A A A G G A G A G T T C 6. En1 0.686 0.584 0.365 0.009 G T T T C G T C G C A A A G G C T C A A A G G A T A A A G C T G AAA C G C A A A A C 7. IRF2 0.371 0.872 0.378 0.015 G T T T A C T T C T A G C G G T C G G G T T A C T T A T A A G A G G C G G G G 8. TBP 0.371 0.867 0.381 0.018 C G C C C C C A T A T A A A A A C C A T T T T T G T T G T A G G T G G T T T T C T C A C 9. Dorsal 2 0.429 0.798 0.387 0.032 A A T C C C T A T A A G G G G T G G 10. ZNF42 5-13 0.629 0.59 0.391 0.03 T G T G G C A T A C T C A C T C Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 70 / 109
Analyzing sets of co-regulated genes Identifying enriched known motifs Implications for the LPS network NF- κ B motif highly enriched • Top 5 motifs all NF- κ B family members • Likely a master regulator • Expected to have multiple direct targets (next task) Other motifs and TFs • IRF motif is important • Could be Irf1, Irf7 or some other Irf family member • Other IRF motifs ranked high Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 71 / 109
Analyzing sets of co-regulated genes Predicting functional binding sites Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 72 / 109
Analyzing sets of co-regulated genes Predicting functional binding sites Predicting functional binding sites Genomeic Regions Sequences Alignments Where functional Where we will search for conservation in sites are not likely ( e.g. promoters) sequences searched ( e.g. inside CDS) Motif library 2) Filter by location 3) Filter by conservation 1) Identify candidate sites Known or novel Scan sequences for sites Eliminate candidate sites Eliminate candidate sites motifs whose sites scoring above the cutoff for occurring inside these without desired we want to identify regions. conservation properties. each motif. Predicted sites Final set of predicted sites; to be evaluated experimentally Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 73 / 109
Recommend
More recommend