Discovery of Transcription- Regulating Regions in Genes Wyeth - PowerPoint PPT Presentation

Discovery of Transcription- Regulating Regions in Genes Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia

Overview CMMT • Bioinformatics for detection of transcription factor binding sites • The Specificity Problem • Methods to enhance specificity of discrimination algorithms • Pattern discovery for the analysis of regulatory sequences in sets of co-expressed genes • Methods to enhance sensitivity of discovery algorithms • Current activities

Layers of Complexity in Metazoan Transcription

Transcription Simplified CMMT URF Pol-II URE TATA

Teaching a computer to find TFBS…

Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAATGA AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 GAGTTAATGA AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

PFMs to PWMs CMMT One would like to add the following features to the model: 1. Correcting for the base frequencies in DNA 2. Weighting for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic f matrix w matrix f (b,i)+ s (N) Log ( ) A 5 0 1 0 0 A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2 TGCTG = 0.9

Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo

A 1 kbp promoter screened with collection of TF profiles CMMT

CMMT Phylogenetic Footprinting for better specificity 70,000,000 years of evolution reveals most regulatory regions.

Phylogenetic Footprinting to Identify Functional Segments CMMT % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse with DPB.

Phylogenetic Footprinting (2) CMMT FoxC2 1 100% 0.8 80% % Identity 0.6 60% 0.4 40% 20% 0.2 0% 0 -0.2 0 1000 2000 3000 4000 5000 6000 7000 Start Position of 200bp Window

Recall... CMMT

The 1kbp promoter screen with phylogenetic footprinting CMMT

Choosing the ”right” species... CMMT CHICKEN HUMAN MOUSE HUMAN COW HUMAN

ConSite (www.phylofoot.org) CMMT Now driven by the ORCA Aligner

Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 85-95% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

Emerging Issues CMMT • Multiple sequence comparisons – Incorporate phylogenetic trees – Visualization • Analysis of closely related species – Phylogenetic shadowing • Genome rearrangements – Inversion compatible alignment algorithm • Higher order models of TFBS

CMMT Regulatory Modules for better specificity TFs do NOT act in isolation

Layers of Complexity in Metazoan Transcription

Liver regulatory modules CMMT

Models for Liver TFs… (10 second slide for 3 months of work) CMMT HNF3 HNF1 HNF4 C/EBP

Statistically Significant Clusters of Sites CMMT • Can we identify dense clusters of sites that are statistically significant? • Diverse methods have been introduced over the past few years…Berman; Markstein; Frith; Noble; Wagner;… • In the best cases, we have enough data to train a discriminant function • Rare to have sufficient data • For general purpose, we identify statistically significant clusters of TFBS • Non-trivial to correct for non-random properties of DNA – Most difficulty comes from local direct repeats

MSCAN (collaboration with Jens Lagergren) CMMT • MSCAN allows users to submit any set of TF profiles • Calculates significance for each site based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out “significant” clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

Training predictive models for modules CMMT • MSCAN and similar methods assume that any combination of sites is meaningful • Reality: Some factors critical, others secondary • An alternative is to teach the computer which combinations are better • Limited by small size of positive training set • Our original method: Logistic Regression Analysis • Recent method from Frith et al: Hidden Markov Model (COMET)

Liver regulatory modules CMMT

Logistic Regression Analysis CMMT ∗ α 1 Optimize α vector to maximize the distance between output values for positive and negative training data. ∗ α 2 Σ “logit” ∗ α 3 Output value is: e logit ∗ α 4 p(x)= 1 + e logit

PERFORMANCE CMMT • Liver (Genome Research, 2001) – At 1 hit per 35 kbp, identifies 60% of modules – Limited to genes expressed late in liver development LRA Models do not account for multiple sites for the same TF* *Frith et al’s COMET and CISTER algorithms circumvent this problem

UDPGT1 (Gilbert’s Syndrome) CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Mutant Series2 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates about 90% of false predictions • Detection of clusters of binding sites offers better predictive performance, especially through trained discriminant functions

Active Issues CMMT • Significance of clusters of sites • Segmentation of DNA into regions of different composition • Methods using training to find clusters • Where to place weights? • Lack of large reference collections of modules • Limited profile databases

CMMT de novo Discovery of TF Binding Sites

Pattern Discovery CMMT

Pattern Discovery Methods CMMT • Exhaustive – e.g. “Moby Dick” (Bussemaker, Li & Siggia) – Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections • Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo) – Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

Yeast Regulatory Sequence Analysis (YRSA) system CMMT

Tests of YRSA System CMMT DNA-damage response Classic cell-cycle array data PDR3-regulated genes partially mediating by MCB re-clustered by Getz et al from array study

CMMT Yeast genomes are ideal for such studies Metazoan genomes are far from ideal

Biochemical complexity enables greater complexity in regulation CMMT Yeast ORF A GO GO GO 500 bp Humans EXON 1 2 EXON 3 GO GO GO GO GO GO GO GO GO 20 000 bp

Applied Pattern Discovery is Acutely Sensitive to Noise CMMT 18 vs. TRUE MEF2 PROFILE PATTERN SIMILARITY 16 14 12 10 0 100 200 300 400 500 600 SEQUENCE LENGTH True Mef2 Binding Sites

Four Approaches to Improve Sensitivity CMMT • Better background models -Higher-order properties of DNA • Phylogenetic Footprinting – Human:Mouse comparison eliminates ~75% of sequence • Regulatory Modules – Architectural rules • Limit the types of binding profiles allowed – TFBS patterns are NOT random

Phylogenetic Footprinting to Identify Conserved Regions CMMT Bayes Block Aligner (Lawrence Group) ORCA

Skeletal Muscle Genes CMMT • One of the most extensively studied tissues for transcriptional regulation – 45 genes partially analyzed – 26 genes with orthologous genomic sequence from human and rodent • Five primary classes of transcription factors – Principal: Myf (myoD), Mef2, SRF – Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types)

de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites CMMT Mef2-Like SRF-Like Myf-Like

CMMT Pattern discovery methods using biochemical constraints

Discovery of Transcription- Regulating Regions in Genes Wyeth - PowerPoint PPT Presentation

Discovery of Transcription- Regulating Regions in Genes Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Overview CMMT Bioinformatics for detection of

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

TFClass a classifjcation of transcription factors Jrgen Dnitz, Edgar Wingender T

Unsupervised Piano Music Transcription Taylor Berg-Kirkpatrick Jacob Andreas and Dan Klein

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Mapping pathogenic regulatory regions and genes Chris Cotsapas Yale/Broad Mapping pathogenic

Gas Metering & Regulating Systems Gas Metering & Regulating Systems High Pressure Custody

Office of Drug Control - Regulating cannabis July 2017 Thomas Stoddart Office of Drug Control

FOOD WE CAN TRUST REGULATING THE FUTURE #FoodRegulation @foodgov Regulating the future

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

Theoretical Biology 2016 Transcription factors bind DNA to block or enhance transcription

Transcription: Pausing and Backtracking: Error Correction Mamata Sahoo and Stefan Klumpp Theory

FROM DRUM TRANSCRIPTION TO DRUM PATTERN VARIATION Richard Vogl richard.vogl@tuwien.ac.at PART 1

regions and cities the role of the European Committee of the Regions Startup Europe Regions

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif discovery

Are essential genes conserved? Fatemeh Ashari Ghomi University of Canterbury

Reconstruction of Transmission Pairs for COVID-19 in family clusters by Crowdsourcing data

ALL MY SONS The play presents a world where family is more important than morality.

AFFH Rule - 24 C.F.R. 5.150-5.180 Purpose: To aid HUD program participants in taking

ABILITY TO HANDLE CHALLENGING SITUATIONS IN FOSTERING AND RETENTION Foster Family Treatment

with BRCA mutations in Montreal Najlaa Houssaini 1 , Christine Maugard 3 , Steven Narod 3 , Andr

MAKING SENSE OF Disclosures CARRIER SCREENING Research funding from Natera Consultant to

A. Van Catterton, Jr., Esq. avc@avcpa-law.com www.avcpa-law.com AirBNB brand name synonymous

Pulmonary arterial hypertension Definition and classification Pulmonary arterial hypertension:

Sambuz

Useful Links

Newsletter

Mail Us

Discovery of Transcription- Regulating Regions in Genes Wyeth - PowerPoint PPT Presentation

Discovery of Transcription- Regulating Regions in Genes Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Overview CMMT Bioinformatics for detection of

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

TFClass a classifjcation of transcription factors Jrgen Dnitz, Edgar Wingender T

Unsupervised Piano Music Transcription Taylor Berg-Kirkpatrick Jacob Andreas and Dan Klein

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Mapping pathogenic regulatory regions and genes Chris Cotsapas Yale/Broad Mapping pathogenic

Gas Metering &amp; Regulating Systems Gas Metering &amp; Regulating Systems High Pressure Custody

Office of Drug Control - Regulating cannabis July 2017 Thomas Stoddart Office of Drug Control

FOOD WE CAN TRUST REGULATING THE FUTURE #FoodRegulation @foodgov Regulating the future

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W.

Theoretical Biology 2016 Transcription factors bind DNA to block or enhance transcription

Transcription: Pausing and Backtracking: Error Correction Mamata Sahoo and Stefan Klumpp Theory

FROM DRUM TRANSCRIPTION TO DRUM PATTERN VARIATION Richard Vogl richard.vogl@tuwien.ac.at PART 1

regions and cities the role of the European Committee of the Regions Startup Europe Regions

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif discovery

Are essential genes conserved? Fatemeh Ashari Ghomi University of Canterbury

Reconstruction of Transmission Pairs for COVID-19 in family clusters by Crowdsourcing data

ALL MY SONS The play presents a world where family is more important than morality.

AFFH Rule - 24 C.F.R. 5.150-5.180 Purpose: To aid HUD program participants in taking

ABILITY TO HANDLE CHALLENGING SITUATIONS IN FOSTERING AND RETENTION Foster Family Treatment

with BRCA mutations in Montreal Najlaa Houssaini 1 , Christine Maugard 3 , Steven Narod 3 , Andr

MAKING SENSE OF Disclosures CARRIER SCREENING Research funding from Natera Consultant to

A. Van Catterton, Jr., Esq. avc@avcpa-law.com www.avcpa-law.com AirBNB brand name synonymous

Pulmonary arterial hypertension Definition and classification Pulmonary arterial hypertension:

Sambuz

Useful Links

Newsletter

Mail Us

Gas Metering & Regulating Systems Gas Metering & Regulating Systems High Pressure Custody