discovery and analysis of regulatory regions in the human
play

Discovery and Analysis of Regulatory Regions in the Human Genome - PowerPoint PPT Presentation

Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Acknowledgements Wasserman Group CMMT


  1. Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia

  2. Acknowledgements Wasserman Group – CMMT Collaborators Dave Arenillas Chip Lawrence (Wadsworth) Jochen Brumm William Thompson (Wadsworth) Danielle Kemmer Jens Lagergren (SBC/KTH) Jonathan Lim Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Wasserman Group - Karolinska Niclas Jareborg (AZ) Albin Sandelin William Hayes (AZ) Raf Podowski Wynand Alkema Boris Lenhard (K.I.) Collaborating Trainees Group Alumni Malin Andersson (KTH) Elena Herzog Öjvind Johansson (UCSD) Annette Höglund Stuart Lithwick (U.Toronto) William Krivan Luis Mendoza Support: CIHR, CGDN, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder

  3. Overview CMMT • Basics of promoter analysis – Bioinformatics for detection of transcription factor binding sites • The Specificity Problem • Phylogenetic Footprinting – Pattern recognition for discovery of novel regulatory mechanisms • A signal-to-noise problem • Discrimination of Regulatory Regions – Given binding models for relevant TFs, identify potential regulatory sequences – Analyze potentially important genetic variation within predicted regulatory regions • Pattern discovery (as time permits) – Given a set of co-regulated genes, predict important classes of TFBS – Given a newly discovered binding profile, predict candidate regulon members

  4. Transcription Simplified CMMT URF Pol-II URE TATA

  5. Teaching a computer to find TFBS…

  6. Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAATGA AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 GAGTTAATGA AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

  7. PFMs to PWMs CMMT One would like to add the following features to the model: 1. Correcting for the base frequencies in DNA 2. Weighting for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic f matrix w matrix f (b,i)+ s (N) Log ( ) A 5 0 1 0 0 A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2 TGCTG = 0.9

  8. Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo

  9. A 1 kbp promoter screened with collection of TF profiles CMMT

  10. CMMT Phylogenetic Footprinting for better specificity 70,000,000 years of evolution reveals most regulatory regions.

  11. SIDENOTE: Global Progressive Alignments (ORCA Algortihm) CMMT • Global alignments memory = product of sequence lengths • Progressive alignment by banding with local alignments (e.g. BLAST) and running global method on banded sub-segments • Recursion with decreasingly stringent parameters

  12. Phylogenetic Footprinting to Identify Functional Segments CMMT % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse by ORCA.

  13. Phylogenetic Footprinting (2) CMMT FoxC2 1 100% 0.8 80% % Identity 0.6 60% 0.4 40% 20% 0.2 0% 0 -0.2 0 1000 2000 3000 4000 5000 6000 7000 Start Position of 200bp Window

  14. Recall... CMMT

  15. 1kbp promoter with phylogenetic footprinting CMMT

  16. Choosing the ”right” species... CMMT CHICKEN HUMAN MOUSE HUMAN COW HUMAN

  17. Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 85-95% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

  18. ConSite (www.phylofoot.org) CMMT Now driven by the ORCA Aligner

  19. Emerging Issues CMMT • Multiple sequence comparisons – Incorporate phylogenetic trees – Visualization • Analysis of closely related species – Phylogenetic shadowing • Genome rearrangements – Inversion compatible alignment algorithm • Higher order models of TFBS

  20. CMMT Improving Pattern Discrimination TFs do NOT act in isolation

  21. Layers of Complexity in Metazoan Transcription

  22. Biochemical complexity enables greater complexity in regulation CMMT Yeast ORF A GO GO GO 500 bp Humans EXON 1 2 EXON 3 GO GO GO GO GO GO GO GO GO 20 000 bp

  23. Statistically Significant Clusters of Sites CMMT • Can we identify dense clusters of sites that are statistically significant? • Diverse methods have been introduced over the past few years…Berman; Markstein; Frith; Noble; Wagner;… • In the best cases, we have enough data to train a discriminant function • Rare to have sufficient data • For general purpose, we identify statistically significant clusters of TFBS • Non-trivial to correct for non-random properties of DNA – Most difficulty comes from local direct repeats

  24. Liver regulatory modules CMMT

  25. Models for Liver TFs… (10 second slide for 3 months of work) CMMT HNF3 HNF1 HNF4 C/EBP

  26. Logistic Regression Analysis CMMT ∗ α 1 Optimize α vector to maximize the distance between output values for positive and negative training data. ∗ α 2 Σ “logit” ∗ α 3 Output value is: e logit ∗ α 4 p(x)= 1 + e logit

  27. PERFORMANCE CMMT • Liver (Genome Research, 2001) – At 1 hit per 35 kbp, identifies 60% of modules – Limited to genes expressed late in liver development LRA Models do not account for multiple sites for the same TF and require significant reference collection

  28. UDPGT1 (Gilbert’s Syndrome) CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Mutant 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

  29. I.23 Predicted Muscle Regulatory Module CMMT Kcna7 0.5 Score 0.4 0.3 0.2 0.1 0 0 1500 3000 4500 6000 7500 9000

  30. MSCAN: A more general method (w/ Jens Lagergren, Royal Technical University of Sweden) CMMT • MSCAN allows users to submit any set of TF profiles • Calculates significance for each site based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out “significant” clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

  31. CMMT JASPAR (jaspar.cgb.ki.se) OPEN-ACCESS DATABASE OF TF BINDING PROFILES

  32. Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates about 90% of false predictions • Detection of clusters of binding sites offers better predictive performance, especially through trained discriminant functions

  33. CMMT RAVEN Project: Regulatory Analysis of Variation in ENhancers Genetic variation in TFBS can result in biomedically important phenotypes

  34. Sequence Variation in TFBS CMMT URF AaGT TSS GENE DISEASE/CONDITION (associated) REFERENCE UDP-GT1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity Y. Olswang et al., 2002 PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

  35. CMMT Stage 1: Prediction of Regulatory Regions

  36. Stage 1: Identify Putative Regulatory Regions CMMT • Retrieves orthologous human and mouse gene sequences from GeneLynx • Aligns sequences with ORCA Aligner • Finds most significant non-coding regions • Designs primers FoxC2 1 100% 0.8 80% 0.6 60% 0.4 40% 20% 0.2 0% 0 -0.2 0 1000 2000 3000 4000 5000 6000 7000

  37. Data/Orthology obtained from GeneLynx (www.genelynx.org) CMMT

Recommend


More recommend