Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca
Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm Brenda Gallie (OCI) Alice Chou Jens Lagergren (KTH) Debra Fulton Chip Lawrence (Brown) Shannan Ho Sui Carol Huang Boris Lenhard (K.I.) Danielle Kemmer (KI) James Mortimer (MF) Byron Kuo Jacob Odeberg (KTH) Jonathan Lim Raf Podowski (KI) Dora Pak Group Alumni Chris Walsh Wynand Alkema Dimas Yusuf Elena Herzog Annette Höglund Collaborating Trainees William Krivan Malin Andersson (KTH) Öjvind Johansson (KTH) Luis Mendoza Stuart Lithwick (U.Toronto) Albin Sandelin Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation
Overview CMMT • DISCRIMINATION: TFBS Prediction with Motif Models • Phylogenetic Footprinting • Combinatorial Interactions • Current Activities • DISCOVERY: Inferring Regulatory Mechanisms for Co-Expressed (Co-Regulated) Genes • Motif Over-representation • Pattern Discovery • Current Activities
CMMT Transcription Factor Binding Sites (over-simplified for pedagogical purposes) Pol-II TATA URF URE
Teaching a computer to find TFBS…
Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 AAATTAATGA GAGTTAATGA GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA
PFMs to PWMs (PSSMs) CMMT f matrix w matrix f (b,i)+ s (N) A 5 0 1 0 0 Log ( ) A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2
Detecting binding sites in a single sequence Scanning a sequence against a PW M Sp1 ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Abs_score = 13.4 (sum of column scores) Calculating the relative score Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] at rel_ score threshold of 7 5 % G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] Max_score = 15.2 (sum of highest column scores) A [-0.2284 0.4368 -1.5 -1.5 - -1.5 1.5 0.4368 - -1.5 1.5 - -1.5 1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 - -1.5 1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 - -1.5 1.5 ] T [ 0.4368 0.4368 - -0.2284 0.2284 -1.5 - 1.5 - -1.5 1.5 -0.2284 0.4368 0.4368 0.4368 - -1.5 1.5 1.7457 ] Min_score = -10.3 (sum of lowest column scores) Abs_score - Min_score = ⋅ Rel_score 100 % Max_score - Min_score 13.4 - (-10.3) = ⋅ = 100% 93% − − 15.2 ( 10.3) Ouch.
Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo
CMMT JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES
CMMT Overcoming the Specificity Problems DISCRIMINATION
Phylogenetic Footprinting Dramatically Reduces Spurious Hits Human Mouse Actin, alpha cardiac
Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained
CMMT Now Featuring: Ortholog Sequence Retrieval Service ConSite (www.cisreg.ca)
CMMT Current Activity: Analysis of Genetic Variation in TFBS ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT
Sequence Variation in TFBS CMMT URF AaGT TSS GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity Y. Olswang et al., 2002 PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994
Identifying allele-specific binding site predictions CMMT 4 2 2 1 S wt -S mt 0 0 1 2 3 4 5 6 7 8 9 10 11 -2 -1 -4 -2 1234567890123456789012345 ACGCAT AAGTTAAtGAATAAC AGAT ............. c ...........
CMMT RAVEN screenshots
Recent and Active Projects CMMT • JUMBO-JASPAR – Building a second generation open-access database • NHR-scan – Identification of binding sites for nuclear hormone receptors
CMMT Discrimination of Regulatory Modules TFs do NOT act in isolation
Layers of Complexity in Metazoan Transcription
Detecting Clusters of TF Binding Sites CMMT • Trained Methods – Sufficient examples of real clusters to establish weights on the relative importance of each TF • Statistical Over-Representation of Combinations – Binding profiles available for a set of biologically motivated TFs
Training for the detection of liver cis -regulatory modules (CRMs) CMMT
Building a predictive model (Brief, as this is well described in the literature) CMMT HNF1 C/EBP HNF3 At 60% sensitivity, predictions made ~1/30,000 HNF4 bp
UGT1A1 CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Other 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence
MSCAN: An untrained method for CRM detection (w/ J. Lagergren, Royal Technical University of Sweden) CMMT • MSCAN takes as input a user-defined set of TF profiles • Calculates significance for each observed “site” based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out statistically significant clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se
Current Activities on Combinatorial Binding Prediction CMMT • Social network analysis to identify a reliable set of genes regulated by a given set of TFs
Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates ~90% of false predictions • Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context
CMMT Linking co-expressed genes from microarrays to candidate transcription factors
CMMT DISCOVERY Inferring regulatory mechanisms for subsets of co-expressed genes
CMMT Deciphering Regulation of Co- Expressed Genes
oPOSSUM Procedure CMMT Set of co- Automated Phylogenetic expressed sequence retrieval Footprinting genes from EnsEMBL ORCA Putative Statistical Detection of mediating significance of transcription factor transcription binding sites binding sites factors
Statistical Methods for Identifying Over- represented TFBS CMMT • Z scores – Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model • Fisher exact probability scores – Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution
Recommend
More recommend