Gene Regulation Bioinformatics Wyeth W. Wasserman University of British Columbia www.cisreg.ca
The Grand Challenge: Reliably Define Cis-Regulatory Mechanisms of Regulons CLUSTERING EXPRESSION DATA SEQUENCE ANALYSIS Lake Barkley 2006 2
Inferring Gene Regulation from Expression Profiling Data
REGULATORY PATHWAY INFERENCE from CO-EXPRESSED GENES • What is the appeal? • Understand how perceived signals at surface result in downstream changes in cell phenotype • TFs occasionally serve as therapeutically relevant targets • PPAR γ , Estrogen Receptor, Glucocorticoid Receptor • • Builds on data from powerful profiling technologies • Expression profiling; ChIP-chip Lake Barkley 2006 4
Bioinformatics and Promoter Analysis What can we do? Lake Barkley 2006 5
What can we do? • Predict Transcription Factor Binding Sites Lake Barkley 2006 6
Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA GAGTTAATAA • A set of sites represented as a consensus GAGTTAATAA CAGTTATTCA CAGTTATTCA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a set of sites: AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 AAATTAATGA GAGTTAATGA GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 AAGTTAATCA AAGTTAATCA AAGTTGATGA Logo – A graphical AAGTTGATGA AAATTAATGA AAATTAATGA representation of frequency ATGTTAATGA matrix. Y-axis is information ATGTTAATGA AAGTAAATGA content , which reflects the AAGTAAATGA AAGTTAATGA strength of the pattern in each AAGTTAATGA AAGTTAATGA column of the matrix AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Lake Barkley 2006 AAGTTAATGA 7 AAGTTAATGA AAGTTAATGA
Conversion of PFM to Position Specific Scoring Matrix (PSSM) Add the following features to the matrix profile: 1. Correct for nucleotide frequencies in genome 2. Weight for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic pssm pfm f (b,i)+ s (n) A 1.6 -1.7 -0.2 -1.7 -1.7 A 5 0 1 0 0 Log ( ) p (b) C -1.7 0.5 0.5 1.3 -1.7 C 0 2 2 4 0 G -1.7 1.0 -0.2 -1.7 1.3 G 0 3 1 0 4 T -1.7 -1.7 -0.2 -0.2 -0.2 T 0 0 1 1 1 TGCTG = 0.9 Lake Barkley 2006 8
What can we do? • Predict TFBS • Predict Cis-Regulatory Modules Lake Barkley 2006 9
Combinatorial interactions between TFs Lake Barkley 2006 10
CRM Models Trained models take as input a set of TF binding profiles and return significant clusters of TFBS 1 0.8 0.6 0.4 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Lake Barkley 2006 11
What can we do? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting Lake Barkley 2006 12
Phylogenetic Footprinting % I dentity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse Lake Barkley 2006 13
What can we do? • Predict TFBS • Predict CRMs • Phylogenetic Footprinting • Motif Over-Representation Lake Barkley 2006 14
Deciphering Regulation of Co- Expressed Genes Co-Expressed Controls Lake Barkley 2006 15
oPOSSUM Procedure Set of co- Automated expressed or Phylogenetic sequence retrieval Footprinting co-precipitated from EnsEMBL genes ORCA ORCA Putative Statistical Detection of mediating significance of transcription factor transcription binding sites binding sites factors Lake Barkley 2006 16
Statistical Methods for Identifying Over-represented TFBS • Z scores – Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model • Fisher exact probability scores – Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution Lake Barkley 2006 17
Validation using Reference Gene Sets A. Muscle-specific (23 input; 16 analyzed) B. Liver-specific (20 input; 12 analyzed) Rank Z-score Fisher Rank Z-score Fisher 8.83e-08 SRF 1 21.41 1.18e-02 HNF-1 1 38.21 9.50e-03 MEF2 2 18.12 8.05e-04 HLF 2 11.00 1.22e-01 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.60e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 4.66e-02 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.20e-01 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 Yin-Yang 7 4.070 1.16e-01 S8 7 5.874 2.93e-01 1.61e-02 Irf-1 8 5.245 2.63e-01 S8 8 3.821 Irf-1 9 3.477 1.69e-01 Thing1-E47 9 4.485 4.97e-02 COUP-TF 10 3.286 2.97e-01 HNF-1 10 3.353 2.93e-01 TFs with experimentally-verified sites in the reference sets. Lake Barkley 2006 18
Empirical Selection of Parameters based on Reference Studies 40 p65 SRF c-Rel HNF-1 30 p50 NF- κ B 20 Muscle TEF-1 MEF2 Liver Z-score FREAC-2 10 NF- κ B Myf cEBP Z-score cutoff SP1 HNF-3 β Fisher cutoff 0 -10 -20 1.0E-09 1.0E-07 1.0E-05 1.0E-03 1.0E-01 Fisher p-value Lake Barkley 2006 19
C-Myc SAGE Data • c-Myc transcription factor dimerizes with the Max protein • Key regulator of cell proliferation, differentiation and apoptosis • Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells • They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR Lake Barkley 2006 20
Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed) TF Class Rank Z-score Fisher No. Genes Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7 Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2 Max bHLH-ZIP 3 18.32 2.16e-02 12 SAP-1 ETS 4 13.23 1.61e-04 13 1.84e-01 16 USF bHLH-ZIP 5 11.90 SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12 1.55e-01 20 n-MYC bHLH-ZIP 7 11.11 1.55e-01 20 ARNT bHLH 8 11.11 Elk-1 ETS 9 10.92 3.88e-03 19 1.11e-01 25 Ahr-ARNT bHLH 10 10.17
C-Fos Microarray Experiment • In a study examining the role of transcriptional repression in oncogenesis, Ordway et al . compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line • We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs Lake Barkley 2006 22
Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed) TF Class Rank Z-score Fisher No. Genes c-FOS bZIP 1 17.53 2.60e-05 45 RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1 PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1 CREB bZIP 4 3.626 1.25e-01 10 E2F Unknown 5 2.965 7.67e-02 15 Lake Barkley 2006 23
NF- к B inhibition microarray study Lake Barkley 2006 24
Genes significantly down-regulated by the NF- κ B pathway inhibitor (326 input ; 179 analyzed) TF Class Rank Z-score Fisher No. Genes p65 REL 1 36.57 5.66e-12 62 NF-kappaB REL 2 32.58 5.82e-11 61 c-REL REL 3 26.02 8.59e-08 63 Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6 SPI-B ETS 5 16.59 1.23e-03 135 Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23 Sox-5 HMG 7 15.38 2.56e-02 126 p50 REL 8 14.72 2.23e-03 19 Nkx HOMEO 9 13.66 2.29e-03 111 9.92e-02 1 Bsap PAIRED 10 13.2 FREAC-4 FORKHEAD 11 12.05 1.66e-03 92 Lake Barkley 2006 25
Identifying over-represented pairs of TFBSs in co-expressed genes Background Target d Calculate a Fisher exact probability that the pair of sites is over-represented Correct for multiple testing d Lake Barkley 2006 26
Recommend
More recommend