Bioinformatics for the Identification of Sequences Regulating Gene - PowerPoint PPT Presentation

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca

Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm Brenda Gallie (OCI) Alice Chou Jens Lagergren (KTH) Debra Fulton Chip Lawrence (Brown) Shannan Ho Sui Carol Huang Boris Lenhard (K.I.) Danielle Kemmer (KI) James Mortimer (MF) Byron Kuo Jacob Odeberg (KTH) Jonathan Lim Raf Podowski (KI) Dora Pak Group Alumni Chris Walsh Wynand Alkema Dimas Yusuf Elena Herzog Annette Höglund Collaborating Trainees William Krivan Malin Andersson (KTH) Öjvind Johansson (KTH) Luis Mendoza Stuart Lithwick (U.Toronto) Albin Sandelin Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation

Overview CMMT • DISCRIMINATION: TFBS Prediction with Motif Models • Phylogenetic Footprinting • Combinatorial Interactions • Current Activities • DISCOVERY: Inferring Regulatory Mechanisms for Co-Expressed (Co-Regulated) Genes • Motif Over-representation • Pattern Discovery • Current Activities

CMMT Transcription Factor Binding Sites (over-simplified for pedagogical purposes) Pol-II TATA URF URE

Teaching a computer to find TFBS…

Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 AAATTAATGA GAGTTAATGA GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

PFMs to PWMs (PSSMs) CMMT f matrix w matrix f (b,i)+ s (N) A 5 0 1 0 0 Log ( ) A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2

Detecting binding sites in a single sequence Scanning a sequence against a PW M Sp1 ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Abs_score = 13.4 (sum of column scores) Calculating the relative score Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] at rel_ score threshold of 7 5 % G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] Max_score = 15.2 (sum of highest column scores) A [-0.2284 0.4368 -1.5 -1.5 - -1.5 1.5 0.4368 - -1.5 1.5 - -1.5 1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 - -1.5 1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 - -1.5 1.5 ] T [ 0.4368 0.4368 - -0.2284 0.2284 -1.5 - 1.5 - -1.5 1.5 -0.2284 0.4368 0.4368 0.4368 - -1.5 1.5 1.7457 ] Min_score = -10.3 (sum of lowest column scores) Abs_score - Min_score = ⋅ Rel_score 100 % Max_score - Min_score 13.4 - (-10.3) = ⋅ = 100% 93% − − 15.2 ( 10.3) Ouch.

Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo

CMMT JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES

CMMT Overcoming the Specificity Problems DISCRIMINATION

Phylogenetic Footprinting Dramatically Reduces Spurious Hits Human Mouse Actin, alpha cardiac

Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

CMMT Now Featuring: Ortholog Sequence Retrieval Service ConSite (www.cisreg.ca)

CMMT Current Activity: Analysis of Genetic Variation in TFBS ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

Sequence Variation in TFBS CMMT URF AaGT TSS GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity Y. Olswang et al., 2002 PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

Identifying allele-specific binding site predictions CMMT 4 2 2 1 S wt -S mt 0 0 1 2 3 4 5 6 7 8 9 10 11 -2 -1 -4 -2 1234567890123456789012345 ACGCAT AAGTTAAtGAATAAC AGAT ............. c ...........

CMMT RAVEN screenshots

Recent and Active Projects CMMT • JUMBO-JASPAR – Building a second generation open-access database • NHR-scan – Identification of binding sites for nuclear hormone receptors

CMMT Discrimination of Regulatory Modules TFs do NOT act in isolation

Layers of Complexity in Metazoan Transcription

Detecting Clusters of TF Binding Sites CMMT • Trained Methods – Sufficient examples of real clusters to establish weights on the relative importance of each TF • Statistical Over-Representation of Combinations – Binding profiles available for a set of biologically motivated TFs

Training for the detection of liver cis -regulatory modules (CRMs) CMMT

Building a predictive model (Brief, as this is well described in the literature) CMMT HNF1 C/EBP HNF3 At 60% sensitivity, predictions made ~1/30,000 HNF4 bp

UGT1A1 CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Other 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

MSCAN: An untrained method for CRM detection (w/ J. Lagergren, Royal Technical University of Sweden) CMMT • MSCAN takes as input a user-defined set of TF profiles • Calculates significance for each observed “site” based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out statistically significant clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

Current Activities on Combinatorial Binding Prediction CMMT • Social network analysis to identify a reliable set of genes regulated by a given set of TFs

Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates ~90% of false predictions • Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context

CMMT Linking co-expressed genes from microarrays to candidate transcription factors

CMMT DISCOVERY Inferring regulatory mechanisms for subsets of co-expressed genes

CMMT Deciphering Regulation of Co- Expressed Genes

oPOSSUM Procedure CMMT Set of co- Automated Phylogenetic expressed sequence retrieval Footprinting genes from EnsEMBL ORCA Putative Statistical Detection of mediating significance of transcription factor transcription binding sites binding sites factors

Statistical Methods for Identifying Over- represented TFBS CMMT • Z scores – Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model • Fisher exact probability scores – Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution

Bioinformatics for the Identification of Sequences Regulating Gene - PowerPoint PPT Presentation

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Local Foods: Tracy Murphy , MD, state epidemiologist, Wyoming Dept. of Health Safety, Freedom

Biological Networks Analysis Degree Distribution and Network Motifs Genome 559: Introduction to

Functional Genomics and Systems Biology Group and at IBM Gus Stolovitzky Jorge Lepre

Dilute bacterial suspensions 18.S995 - L06 & 07 dunkel@mit.edu E.coli (non-tumbling HCB 437)

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Video-Rate Stereo Vision on a Reconfigurable Hardware Ahmad Darabiha Department of Electrical

Digital Signatures for Flows and Multicasts by Chung Kei Wong and Simon S. Lam in IEEE/ACM

Runion PLEPU : 16 nov. 2016 Olivier Deschamps LPC Clermont-F d UBP/CNRS/IN2P3 1 Radiative

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Bioinformatics for the Identification of Sequences Regulating Gene - PowerPoint PPT Presentation

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Local Foods: Tracy Murphy , MD, state epidemiologist, Wyoming Dept. of Health Safety, Freedom

Biological Networks Analysis Degree Distribution and Network Motifs Genome 559: Introduction to

Functional Genomics and Systems Biology Group and at IBM Gus Stolovitzky Jorge Lepre

Dilute bacterial suspensions 18.S995 - L06 &amp; 07 dunkel@mit.edu E.coli (non-tumbling HCB 437)

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Video-Rate Stereo Vision on a Reconfigurable Hardware Ahmad Darabiha Department of Electrical

Digital Signatures for Flows and Multicasts by Chung Kei Wong and Simon S. Lam in IEEE/ACM

Runion PLEPU : 16 nov. 2016 Olivier Deschamps LPC Clermont-F d UBP/CNRS/IN2P3 1 Radiative

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Dilute bacterial suspensions 18.S995 - L06 & 07 dunkel@mit.edu E.coli (non-tumbling HCB 437)