NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012
NGS Analysis and Transcriptional Regulation • RNA-seq – Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA) • ChIP-seq (and ChIP-exo) – Chromatin modifications – Binding of transcription factor proteins
Talk Overview I. Basic Transcriptional Regulation II. ChIP-seq and ChIP-exo III. Analyzing ChIP-seq & ChIP-exo data a) Mapping b) Peak calling c) Motif discovery & Enrichment Analysis d) Location analysis
Part I: Basic Transcriptional Regulation Source: ¡Steven ¡Chu ¡
Transcription Factors • Mammalian transcription is controlled (in part) by about 1400 transcription factor (TF) proteins. • These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying the chromatin.
BASAL TRANSCRIPTION: ¡ • The pre-initiation complex assembles at the core promoter. • This results in only low levels of transcription because the interaction is unstable. + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Core ¡Promoter ¡
PROXIMAL PROMOTER: • The proximal promoter extends upstream of the promoter. • It contains binding sites for repressor and activator transcription factors. DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
ACTIVATORS: • Some transcription factors (“activators”) bind to sites in the proximal promoter. • This stabilizes the transcriptional machinery. ¡ • This increases transcription. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
REPRESSORS: • Some factors do not stabilize the transcriptional machinery. • Their binding can block binding by co- factors and activators. • This reduces transcription. + ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
ENHANCER REGIONS: • Groups of binding sites located upstream or downstream of a promoter. ¡ • Often very distant—1000s of base pairs. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ 1-‑-‑100Kb ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
ENHANCER REGIONS: • Both activator and repressor transcription factors can occupy enhancer regions. ¡ • DNA looping brings factors into contact with transcriptional machinery. ¡ • Bound activators increase transcription. ¡ + ¡ +++ ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
Chromatin modification by TFs: • Histone Acetyltransferases (HATs) acetylate histones. ¡ • Tissue-specific transcription factors can bind to HATs, causing chromatin to open. ¡ • This can increase transcription. HAT ¡ + ¡ +++ ¡ + ¡ Specific ¡ General ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
Part II: ChIP-seq & ChIP-exo Source: ¡Steven ¡Chu ¡
ChIP-seq
ChIP-Exo Rhee ¡and ¡Pugh, ¡Cell ¡201. ¡
ChIP-seq & ChIP-exo Rhee ¡and ¡Pugh, ¡Cell ¡2011. ¡
Part III: Analyzing ChIP-seq Data Source: ¡Steven ¡Chu ¡
Analyzing TF ChIP-seq Data • Key messages of this talk: – Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?
Things that can go wrong in ChIP-seq… 1. Low affinity antibody 2. Non-specific antibody 3. Contamination 4. Poor choice of peak calling algorithm (or parameters) … etc.
Steps in ChIP-seq Data Analysis 1. Mapping: where do the sequence “tags” map to the genome? 2. Peak Calling: where are the regions of significant tag concentration? 3. Motif Discovery: what is the binding motif? 4. Location Analysis: where are the peaks w/ respect to genes, promoters, introns etc?
1) Mapping ChIP-seq Tags • Tags: ChIP-seq produces a pool of “tags” (~100bp) • Tag Count: measure of enrichment of region • Negative Control: “input DNA” tag count Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡
2) ChIP-seq Peak Calling • ChIP-seq produces a pool of “tags”. • Tags are currently about 100 bp long. • Tag is the 5’ end of a DNA fragment. • But DNA is double- stranded so… Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡
ChIP-seq Peak Calling • Peak callers combine overlapping tags to get the “peak height”. • Sometimes strand information is used to combine tags on opposite strands. • Fold-enrichment (tag count / control tag count) is usually used as the criterion for declaring a peak.
…ChIP-seq Peak Callers Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡
Sanity check: are your peaks reasonable • Width: TF ChIP-seq peaks should be relatively short (< 300bp) compared to histone modification peaks. – Are your peaks too wide? • Number: Is the number of TF ChIP-seq peaks reasonable? – Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?) • Location: Do your peaks co-occur with histone marks and genes your TF regulates? • The next analysis steps will help you answer these questions!
3) Motif Discovery & Enrichment Analysis • If your TF binds DNA directly (and sequence-specifically), Motif Discovery should find its binding motif. • The DNA-binding motif of your TF should be centrally enriched in the peaks, and hould be Central Motif Enrichment Analysis (CMEA) should find it.
Caveats in ChIP-seq Motif Analysis • Peak regions may contain other TF motifs due to looping. • The binding of the ChIP-ed factor “X” may be indirect. • ChIP-ed motif might be weak due to assisted binding. Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡
TF Binding Motif Discovery • ChIP-seq provides extremely rich data for inferring the DNA-binding affinity of the ChIP-ed transcription factor. • In principle, discovering the motif • ChIP-seq peaks tend is simple. ààà to be within +/- 50bp of the bound factor. • So we just examine the peak regions for enriched patterns.
MEME Suite tools for ChIP-seq motif discovery and enrichment • The MEME Suite (http://meme.nbcr.net) contains several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis. – Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME
Example: Motif discovery in NFIC ChIP-seq data • Pjanic et al. predicted 39,807 ChIP-seq peaks in NFIC ChIP-seq data. • They do not report a using motif discovery on these peaks. • We used MEME-ChIP which runs both MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions. Machanick ¡& ¡Bailey, ¡BioinformaMcs, ¡2011 ¡
Motif discovery fails in the (original) NFIC dataset • An NFIC motif is know from in vitro data, based on only 16 sites. • MEME and DREME fail to find this motif in the NFIC data. • But so do the other algorithms we tried: Amadeus, peak-motifs, Trawler and Weeder.
The problem: poor peak calling! • We applied a different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000). • MEME discovers the NFI-family binding motif in this new set of peaks.
Central Motif Enrichment Analysis: CentriMo • CentriMo searches 500-‑bp ¡ChIP-‑seq ¡regions ¡ for known motifs whose sites are most centrally enriched in the ChIP-seq W=120 ¡ L=500 ¡ regions. S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ • Use 500bp regions T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡ centered on each ChIP-seq peak. Probability ¡ “site-‑probability” ¡curve ¡ ¡ MA0119.1 T GG C T G CC A G A A A Bailey ¡et ¡al, ¡NAR ¡2012 ¡ C C A T G G T C T G T A A C PosiMon ¡of ¡Best ¡Site ¡ Position CEQLOGO 22.09.10 17:31
Central Motif Enrichment confirms the known NFIC motif—even in the original peaks 0.003 MA0119.1 T GG C T G CC A G NFIC ¡ A A A 0.0025 C C A T G G G T C T T C A A Position CEQLOGO 22.09.10 17:31 0.002 probability 0.0015 0.001 MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 0.0005 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383 0 -250 -200 -150 -100 -50 0 50 100 150 200 250 position of best site in sequence NFIC motif is most centrally enriched of 862 JASPAR • +UniPROBE motifs ( p = 10 -31 ). However, standard motif enrichment algorithms (including AME) • do not show the NFIC as the most enriched motif.
Recommend
More recommend