NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013
NGS Analysis and Transcriptional Regulation • RNA-seq – Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA) • ChIP-seq – Chromatin modifications – Binding of transcription factor proteins
Talk Overview I. Transcriptional Regulation 101 II. ChIP-seq 101 III. Analyzing ChIP-seq data IV. Combining ChIP-seq and RNA-seq
Part I: Basic Transcriptional Regulation Source: ¡Steven ¡Chu ¡
Transcription Factors • Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins. • These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.
BASAL TRANSCRIPTION: ¡ • The pre-initiation complex assembles at the core promoter. • This results in only low levels of transcription because the interaction is unstable. + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Core ¡Promoter ¡
PROXIMAL PROMOTER: • The proximal promoter extends upstream of the promoter. • It contains binding sites for repressor and activator transcription factors. DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
ACTIVATORS: • Some transcription factors (“activators”) stabilize the transcriptional machinery when they bind to sites in the proximal promoter. • This increases transcription. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
REPRESSORS: • Some factors do not stabilize the transcriptional machinery. • Their binding can block binding by co- factors and activators. • This reduces transcription. + ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡
ENHANCER REGIONS: • Groups of binding sites located upstream or downstream of a promoter. ¡ • Often very distant—1000s of base pairs. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ 1-‑-‑100Kb ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
ENHANCER REGIONS: • Activator and repressor transcription factors compete to occupy enhancer regions. • DNA looping brings factors into contact with transcriptional machinery. • Bound activators increase transcription. ¡ ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
Chromatin modification by TFs: • Example: Histone Acetyltransferases (HATs) acetylate histones. ¡ • Tissue-specific transcription factors can bind to HATs, causing chromatin to open. ¡ • This can increase transcription. HAT ¡ + ¡ +++ ¡ + ¡ Specific ¡ General ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡
Part II: ChIP-seq Overview Source: ¡Steven ¡Chu ¡
ChIP-seq • Chromatin ImmunoPrecipitation followed by high- throughput sequencing. • TF binding sites (“punctate peaks”) • Chromatin mods (“broad peaks”
Steps in Cross-‑link ¡ ChIP-seq • Cross-link proteins to DNA • Fragment chromatin • Immunoprecipitate with antibody to protein • Size-select and ligate • Amplify • Sequence
What can I learn from ChIP-seq? • What chromatin regions are marked as active promoters or enhancers? • Where is my TF bound? • What is its DNA-binding motif? • What genes might it regulate?
Part III: Analyzing ChIP-seq Data Source: ¡Steven ¡Chu ¡
Analyzing TF ChIP-seq Data • Key messages of this talk: – Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?
Things that can go wrong in ChIP-seq… 1. Low affinity antibody 2. Non-specific antibody 3. Contamination 4. Poor choice of peak calling algorithm (or parameters) … etc.
Steps in ChIP-seq Data Analysis 1. Mapping: where do the sequence “tags” map to the genome? 2. Peak Calling: where are the regions of significant tag concentration? 3. Motif Discovery: what is the binding motif? 4. Location Analysis: where are the peaks w/ respect to genes, promoters, introns etc?
1) Mapping ChIP-seq Tags • Tags: ChIP-seq produces a pool of “tags” (~100bp) • Tag Count: measure of enrichment of region • Negative Control: “input DNA” tag count Tallack ¡et ¡al., ¡ Genome ¡Res. , ¡2010 ¡
Do the mapped tags make sense? • Each ~100 bp tag is the 5’ end of a DNA fragment. • But DNA is double- stranded so there are tags from both strands. • We expect pairs of clusters of tags on opposite strands, separated by the fragment length. Wilbanks ¡and ¡FaccioK, ¡ PLoS ¡One , ¡2010 ¡
Strand Cross Correlation Analysis (SCCA) • If we shift the anti-sense tags left by the (average) fragment length, we should see maximum correlation between the reads on the two strands. Kharchenko ¡et ¡al., ¡ Nature ¡Biotechnology , ¡2009 ¡
SCCA often shows two maxima • Fragment-length fragment-‑length ¡peak ¡ peak at average read-‑length ¡peak ¡ fragment length (as we expected) • Read-length peak at average read length (due to variable and dispersed mappability of genomic positions) Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-‑1831 ¡
Quality control 1: SCCA identifies failed ChIP-seq ENCODE Guidelines: • Normalized Strand Correlation, NSC > 1.05 • Relative Strand Correlation, RSC > 0.8 • https://code.google.com/p/ phantompeakqualtools Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-‑1831 ¡
2) ChIP-seq Peak Calling • Peak callers combine overlapping tags to get the “peak height”. • Often, strand information and shifting is used to combine tags on opposite strands. • Fold-enrichment (tag count / control tag count) is Wilbanks ¡and ¡Faccio., ¡ PLoS ¡One , ¡2010 ¡ usually used as the criterion for declaring a peak.
Some ChIP-seq peak callers use SCCA Uses ¡SCCA ¡ Uses ¡SCCA ¡ Bailey ¡et. ¡al., ¡PLoS ¡Comp ¡Bio, ¡in ¡press. ¡
Sanity checks: Are your peaks reasonable? • Width: TF ChIP-seq peaks should be relatively short (< 300bp) compared to histone modification peaks. – Are your peaks too wide? • Number: Is the number of TF ChIP-seq peaks reasonable? – Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?) • Location: Do your peaks co-occur with histone marks and genes that your TF regulates? – Examine some peaks using the UCSC genome browser and ENCODE histone tracks
Quality control 2: Fraction of Reads in Peaks (FRiP) • Only a fraction of reads typically fall within ChIP-seq peaks. • ENCODE guideline: FRiP > 1% • Caveat: A lower FRiP threshold may be appropriate if there are very few peaks. Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-‑1831 ¡
How many of my peaks are “real”? • Irreproducible Discovery Rate (IDR) compares the ranks of peaks from two biological replicates. – Rank peaks by significance ( p -value or q - value) – Reproducible discoveries (peaks) should have similar ranks between replicates. • ENCODE: reports peaks at 1% IDR • https://sites.google.com/site/ anshulkundaje/projects/idr
Quality control 3: IDR identifies failed ChIP-seq High ¡Reproducibility ¡ Low ¡Reproducibility ¡ Landt ¡S ¡G ¡et ¡al. ¡ Genome ¡Res . ¡2012;22:1813-‑1831 ¡
3) Motif Discovery & Enrichment Analysis • If your TF binds DNA directly (and sequence-specifically), Motif Discovery should find its binding motif. • The DNA-binding motif of your TF should be centrally enriched in the peaks, and Central Motif Enrichment Analysis (CMEA) should find it.
Caveats in ChIP-seq Motif Analysis • Peak regions may contain other TF motifs due to looping. • The binding of the ChIP-ed factor “X” may be indirect. • ChIP-ed motif might be weak due to assisted binding. Farnham, ¡ Nature ¡Reviews ¡Gene>cs , ¡2009 ¡
TF Binding Motif Discovery • ChIP-seq provides extremely rich data for inferring the DNA-binding affinity of the ChIP-ed transcription factor. • In principle, discovering the motif • ChIP-seq peaks tend is simple. ààà to be within +/- 50bp of the bound factor. • So we just examine the peak regions for enriched patterns.
MEME Suite tools for ChIP-seq motif discovery and enrichment • The MEME Suite (http://meme.nbcr.net) contains several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis. – Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME
Example: Motif discovery in NFIC ChIP-seq data • Pjanic et al. predicted 39,807 ChIP-seq peaks in NFIC ChIP-seq data. • They do not report a using motif discovery on these peaks. • We used MEME-ChIP which runs both MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions. Machanick ¡& ¡Bailey, ¡Bioinforma>cs , ¡2011 ¡
Recommend
More recommend