Gene Regulation Bioinformatics Wyeth W. Wasserman University of British Columbia www.cisreg.ca
The Grand Challenge: Reliably Define Cis-Regulatory Mechanisms of Regulons CO-EXPRESSED GROUPS EXPRESSION DATA SEQUENCE ANALYSIS BIRS 2006 2
REGULATORY PATHWAY INFERENCE from CO-EXPRESSED GENES • What is the appeal? • Understand how perceived signals at surface result in downstream changes in cell phenotype • TFs occasionally serve as therapeutically relevant targets • PPAR γ , Estrogen Receptor, Glucocorticoid Receptor • • Builds on data from powerful profiling technologies • Expression profiling; ChIP-chip BIRS 2006 3
Bioinformatics and Promoter Analysis What can we do? BIRS 2006 4
Binding Profiles for a TF Set of Set of binding binding sites sites AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAATGA CAGTTAATAA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 CAGTTAATAA GAGTTAAACA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 GAGTTAAACA CAGTTAATTA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 CAGTTAATTA GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA AAGTTAATGA AAGTTAACGA AAGTTAACGA AAATTAATGA AAATTAATGA GAGTTAATGA GAGTTAATGA AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA BIRS 2006 5 AAGTTAATGA AAGTTAATGA
Phylogenetic Footprinting ACTIN % I dentity 200 bp Window Start Position (human sequence) SELECTI VI TY SENSI TI VI TY
Deciphering Regulation of Co- Expressed Genes Co-Expressed Controls BIRS 2006 7
oPOSSUM Procedure Set of co- Automated expressed or Phylogenetic sequence retrieval Footprinting co-precipitated from EnsEMBL genes ORCA ORCA Putative Statistical Detection of mediating significance of transcription factor transcription binding sites binding sites factors BIRS 2006 8
Empirical Selection of Parameters based on Reference Studies 40 p65 SRF c-Rel HNF-1 30 p50 NF- κ B 20 Muscle TEF-1 MEF2 Liver Z-score FREAC-2 10 NF- κ B Myf cEBP Z-score cutoff SP1 HNF-3 β Fisher cutoff 0 -10 -20 1.0E-09 1.0E-07 1.0E-05 1.0E-03 1.0E-01 Fisher p-value BIRS 2006 9
CRM Models Trained models take as input a set of TF binding profiles and return significant clusters of TFBS 1 0.8 0.6 0.4 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 BIRS 2006 10
oPOSSUM Server BIRS 2006 11
WHAT CAN WE DO ? BIRS 2006 12
Identifying over-represented pairs of TFBSs in co-expressed genes Background Target d Calculate a Fisher exact probability that the pair of sites is over-represented Correct for multiple testing d BIRS 2006 13
Over-represented Pairs of Sites in Yeast Fermentation Clusters Target Background cluster motif1 motif2 Hits No hits Hits No hits p-value Adjusted 4 CSRE STRE 15 46 362 6311 8.33E-07 6.49E-04 4 CSRE GCR1 43 18 2881 3792 1.62E-05 1.26E-02 7 STRE ADR1P 67 262 835 5838 6.38E-05 4.97E-02 7 STRE PHO2 70 259 881 5792 5.63E-05 4.39E-02 7 STRE TBP 69 260 868 5805 6.36E-05 4.96E-02 7 STRE UASPHR 55 274 628 6045 3.77E-05 2.94E-02 7 STRE GCR1 68 261 813 5860 1.58E-05 1.23E-02 8 STRE CAR1_r 25 150 372 6301 2.24E-05 1.75E-02 16 PAC RRPE 188 293 1958 4715 6.54E-06 5.10E-03 16 RRPE XBP1 424 57 5354 1319 5.11E-06 3.98E-03 16 RRPE SCB 411 70 5121 1552 2.78E-06 2.17E-03 16 RRPE PHO2 425 56 5388 1285 9.28E-06 7.24E-03 16 RRPE ROX1 273 208 3056 3617 2.09E-06 1.63E-03 16 RRPE TBP 425 56 5362 1311 3.74E-06 2.92E-03 16 RRPE FKH1 404 77 5097 1576 4.72E-05 3.68E-02 17 LYS14 RRPE 31 23 1857 4816 5.47E-06 4.27E-03 18 PAC RRPE 152 206 1958 4715 1.98E-07 1.55E-04 18 RAP1 RRPE 204 154 2901 3772 3.91E-07 3.05E-04 18 RRPE XBP1 326 32 5354 1319 3.08E-08 2.40E-05 18 RRPE SCB 309 49 5121 1552 6.59E-06 5.14E-03 18 RRPE PHO2 325 33 5388 1285 2.38E-07 1.86E-04 18 RRPE TBP 323 35 5362 1311 5.07E-07 3.96E-04 BIRS 2006 14 18 RRPE UASPHR 256 102 4051 2622 2.02E-05 1.57E-02 18 RRPE FKH1 312 46 5097 1576 4.20E-07 3.28E-04
What can we do? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting • Motif Over-Representation • Motif Discovery BIRS 2006 15
Gibbs Sampling (grossly over-simplified) ttcgctcc cgatacgc 1 2 3 4 5 6 7 8 tgctacct A 2 0 2 2 2 1 0 1 tgacttcc C 0 2 3 3 2 1 6 2 G 0 4 1 0 1 0 1 1 agacctca T 4 1 1 2 2 5 0 2 ctgtagtg acgcatct BIRS 2006 16
There are problems… Exploring limitations BIRS 2006 17
Combinatorial interactions between TFs BIRS 2006 18
Why can’t we do better? • Predict TFBS BIRS 2006 19
Futility Conjuncture Human Cardiac α -Actin gene analyzed with the JASPAR set of profiles (each vertical line represents a TFBS prediction) Futility Conjuncture: TFBS predictions are almost always wrong Red boxes are protein coding exons - TFBS predictions excluded in this analysis BIRS 2006 20
Why can’t we do better? • Predict TFBS • Predict CRMs BIRS 2006 21
Cis -regulatory modules (CRMs) for specific expression in hepatocytes BIRS 2006 22
Why can’t we do better? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting BIRS 2006 23
Regulatory Resolution Varies Widely Between Genes Gene: NR2E1 BIRS 2006 24
Why can’t we do better? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting • Motif Over-Representation BIRS 2006 25
Ets TF Family BIRS 2006 26 Structural classes of TFs often bind identical target sequences – we cannot specify which TF interacts with a motif.
Challenges for Motif Over- Representation • Methods fail when noise (genes not co- regulated) exceeds 20-50% • Most expression profiling experiments are not sufficiently resolved to identify such co- regulated clusters • Works well for studies linked to a primary TF response, but fail over long time periods or complex (multi-pathway) responses BIRS 2006 27
Why can’t we do better? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting • Motif Over-Representation • Motif Discovery BIRS 2006 28
Applied Pattern Discovery is Acutely Sensitive to Noise 18 Pink line is negative control vs. TRUE MEF2 PROFILE PATTERN SIMILARITY with no Mef2 sites included 16 14 12 10 0 100 200 300 400 500 600 SEQUENCE LENGTH True Mef2 Binding Sites BIRS 2006 29
The Signal-to-Noise Battle • Background models • Phylogenetic footprinting • Motif combinations • Familial Binding Profiles • Concurrent motif discovery and expression clustering BIRS 2006 30
Where are we going now? Snippets of Active Projects BIRS 2006 31
An impending transition in promoter analysis… • Transitions in promoter analysis algorithms separated by periods of slow progress • Focus on same tired reference collections using progressively more convoluted algorithms • Advances can be triggered from new data producing technologies, but more commonly from adopting principles well-known to laboratory researchers • CpG islands; CRMs; phylogenetic footprinting • The next transition: Incorporating data from laboratory studies BIRS 2006 32
Informed Motif Discovery Enhance the Signal or Reduce the Noise BIRS 2006 33
34 Informed Initial Choice BIRS 2006
BIRS 2006 35
FBPs enhance sensitivity of pattern detection BIRS 2006 36
A new direction? • Laboratory (WET) data indicating the locations of regulatory regions and/or specific TFBS can constrain the motif discovery process to improve the success rate • Extension – We should be able to determine how much WET data is required for successful prediction BIRS 2006 37
TF binding data ( ) ( ) ( ) METHOD ( ) ( ) ( ) rod-specific genes predicted regulatory regions identification of overrepresented patterns METHOD corresponding to putative TFBS BIRS 2006 38
Knowledge Directed Co-expressed genes CRM Discovery Retrieve Known RR orthologs Pattern discovery algorithm Align sequences ( ) ( ) CRMs, TFBS and profiles Prior prob of ( ) 1) Sample regions being part of a RR ( ) ( ) Phylogenetic footprinting ( ) ( ) Prior prob of 2) Sample sites ( ) ( ) being part of a TFBS within regions ( ) Known TFBS Profile for known TF
ROC curve (exons excluded) 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 sensitivity 0.6 0.55 0.5 windows = 10 0.45 windows = 20 0.4 windows = 50 0.35 windows = 100 0.3 windows = 200 0.25 0.2 windows = 300 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 †“ specificity 1 - specificity
Recommend
More recommend