ngs sequence analysis for regulation and epigenomics
play

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene


  1. NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013

  2. NGS Analysis and Transcriptional Regulation • RNA-seq – Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA) • ChIP-seq – Chromatin modifications – Binding of transcription factor proteins

  3. Talk Overview I. Transcriptional Regulation 101 II. ChIP-seq 101 III. Analyzing ChIP-seq data IV. Combining ChIP-seq and RNA-seq

  4. Part I: Basic Transcriptional Regulation Source: ¡Steven ¡Chu ¡

  5. Transcription Factors • Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins. • These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.

  6. BASAL TRANSCRIPTION: ¡ • The pre-initiation complex assembles at the core promoter. • This results in only low levels of transcription because the interaction is unstable. + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Core ¡Promoter ¡

  7. PROXIMAL PROMOTER: • The proximal promoter extends upstream of the promoter. • It contains binding sites for repressor and activator transcription factors. DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

  8. ACTIVATORS: • Some transcription factors (“activators”) stabilize the transcriptional machinery when they bind to sites in the proximal promoter. • This increases transcription. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

  9. REPRESSORS: • Some factors do not stabilize the transcriptional machinery. • Their binding can block binding by co- factors and activators. • This reduces transcription. + ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

  10. ENHANCER REGIONS: • Groups of binding sites located upstream or downstream of a promoter. ¡ • Often very distant—1000s of base pairs. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ 1-­‑-­‑100Kb ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

  11. ENHANCER REGIONS: • Activator and repressor transcription factors compete to occupy enhancer regions. • DNA looping brings factors into contact with transcriptional machinery. • Bound activators increase transcription. ¡ ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

  12. Chromatin modification by TFs: • Example: Histone Acetyltransferases (HATs) acetylate histones. ¡ • Tissue-specific transcription factors can bind to HATs, causing chromatin to open. ¡ • This can increase transcription. HAT ¡ + ¡ +++ ¡ + ¡ Specific ¡ General ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

  13. Part II: ChIP-seq Overview Source: ¡Steven ¡Chu ¡

  14. ChIP-seq • Chromatin ImmunoPrecipitation followed by high- throughput sequencing. • TF binding sites (“punctate peaks”) • Chromatin mods (“broad peaks”

  15. Steps in Cross-­‑link ¡ ChIP-seq • Cross-link proteins to DNA • Fragment chromatin • Immunoprecipitate with antibody to protein • Size-select and ligate • Amplify • Sequence

  16. What can I learn from ChIP-seq? • What chromatin regions are marked as active promoters or enhancers? • Where is my TF bound? • What is its DNA-binding motif? • What genes might it regulate?

  17. Part III: Analyzing ChIP-seq Data Source: ¡Steven ¡Chu ¡

  18. Analyzing TF ChIP-seq Data • Key messages of this talk: – Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?

  19. Things that can go wrong in ChIP-seq… 1. Low affinity antibody 2. Non-specific antibody 3. Contamination 4. Poor choice of peak calling algorithm (or parameters) … etc.

  20. Steps in ChIP-seq Data Analysis 1. Mapping: where do the sequence “tags” map to the genome? 2. Peak Calling: where are the regions of significant tag concentration? 3. Motif Discovery: what is the binding motif? 4. Location Analysis: where are the peaks w/ respect to genes, promoters, introns etc?

  21. 1) Mapping ChIP-seq Tags • Tags: ChIP-seq produces a pool of “tags” (~100bp) • Tag Count: measure of enrichment of region • Negative Control: “input DNA” tag count Tallack ¡et ¡al., ¡ Genome ¡Res. , ¡2010 ¡

  22. Do the mapped tags make sense? • Each ~100 bp tag is the 5’ end of a DNA fragment. • But DNA is double- stranded so there are tags from both strands. • We expect pairs of clusters of tags on opposite strands, separated by the fragment length. Wilbanks ¡and ¡FaccioK, ¡ PLoS ¡One , ¡2010 ¡

  23. Strand Cross Correlation Analysis (SCCA) • If we shift the anti-sense tags left by the (average) fragment length, we should see maximum correlation between the reads on the two strands. Kharchenko ¡et ¡al., ¡ Nature ¡Biotechnology , ¡2009 ¡

  24. SCCA often shows two maxima • Fragment-length fragment-­‑length ¡peak ¡ peak at average read-­‑length ¡peak ¡ fragment length (as we expected) • Read-length peak at average read length (due to variable and dispersed mappability of genomic positions) Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-­‑1831 ¡

  25. Quality control 1: SCCA identifies failed ChIP-seq ENCODE Guidelines: • Normalized Strand Correlation, NSC > 1.05 • Relative Strand Correlation, RSC > 0.8 • https://code.google.com/p/ phantompeakqualtools Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-­‑1831 ¡

  26. 2) ChIP-seq Peak Calling • Peak callers combine overlapping tags to get the “peak height”. • Often, strand information and shifting is used to combine tags on opposite strands. • Fold-enrichment (tag count / control tag count) is Wilbanks ¡and ¡Faccio., ¡ PLoS ¡One , ¡2010 ¡ usually used as the criterion for declaring a peak.

  27. Some ChIP-seq peak callers use SCCA Uses ¡SCCA ¡ Uses ¡SCCA ¡ Bailey ¡et. ¡al., ¡PLoS ¡Comp ¡Bio, ¡in ¡press. ¡

  28. Sanity checks: Are your peaks reasonable? • Width: TF ChIP-seq peaks should be relatively short (< 300bp) compared to histone modification peaks. – Are your peaks too wide? • Number: Is the number of TF ChIP-seq peaks reasonable? – Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?) • Location: Do your peaks co-occur with histone marks and genes that your TF regulates? – Examine some peaks using the UCSC genome browser and ENCODE histone tracks

  29. Quality control 2: Fraction of Reads in Peaks (FRiP) • Only a fraction of reads typically fall within ChIP-seq peaks. • ENCODE guideline: FRiP > 1% • Caveat: A lower FRiP threshold may be appropriate if there are very few peaks. Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-­‑1831 ¡

  30. How many of my peaks are “real”? • Irreproducible Discovery Rate (IDR) compares the ranks of peaks from two biological replicates. – Rank peaks by significance ( p -value or q - value) – Reproducible discoveries (peaks) should have similar ranks between replicates. • ENCODE: reports peaks at 1% IDR • https://sites.google.com/site/ anshulkundaje/projects/idr

  31. Quality control 3: IDR identifies failed ChIP-seq High ¡Reproducibility ¡ Low ¡Reproducibility ¡ Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-­‑1831 ¡

  32. 3) Motif Discovery & Enrichment Analysis • If your TF binds DNA directly (and sequence-specifically), Motif Discovery should find its binding motif. • The DNA-binding motif of your TF should be centrally enriched in the peaks, and Central Motif Enrichment Analysis (CMEA) should find it.

  33. Caveats in ChIP-seq Motif Analysis • Peak regions may contain other TF motifs due to looping. • The binding of the ChIP-ed factor “X” may be indirect. • ChIP-ed motif might be weak due to assisted binding. Farnham, ¡ Nature ¡Reviews ¡Gene>cs , ¡2009 ¡

  34. TF Binding Motif Discovery • ChIP-seq provides extremely rich data for inferring the DNA-binding affinity of the ChIP-ed transcription factor. • In principle, discovering the motif • ChIP-seq peaks tend is simple. ààà to be within +/- 50bp of the bound factor. • So we just examine the peak regions for enriched patterns.

  35. MEME Suite tools for ChIP-seq motif discovery and enrichment • The MEME Suite (http://meme.nbcr.net) contains several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis. – Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME

  36. Example: Motif discovery in NFIC ChIP-seq data • Pjanic et al. predicted 39,807 ChIP-seq peaks in NFIC ChIP-seq data. • They do not report a using motif discovery on these peaks. • We used MEME-ChIP which runs both MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions. Machanick ¡& ¡Bailey, ¡Bioinforma>cs , ¡2011 ¡

Recommend


More recommend