NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012

NGS Analysis and Transcriptional Regulation • RNA-seq – Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA) • ChIP-seq (and ChIP-exo) – Chromatin modifications – Binding of transcription factor proteins

Talk Overview I. Basic Transcriptional Regulation II. ChIP-seq and ChIP-exo III. Analyzing ChIP-seq & ChIP-exo data a) Mapping b) Peak calling c) Motif discovery & Enrichment Analysis d) Location analysis

Part I: Basic Transcriptional Regulation Source: ¡Steven ¡Chu ¡

Transcription Factors • Mammalian transcription is controlled (in part) by about 1400 transcription factor (TF) proteins. • These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying the chromatin.

BASAL TRANSCRIPTION: ¡ • The pre-initiation complex assembles at the core promoter. • This results in only low levels of transcription because the interaction is unstable. + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Core ¡Promoter ¡

PROXIMAL PROMOTER: • The proximal promoter extends upstream of the promoter. • It contains binding sites for repressor and activator transcription factors. DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

ACTIVATORS: • Some transcription factors (“activators”) bind to sites in the proximal promoter. • This stabilizes the transcriptional machinery. ¡ • This increases transcription. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

REPRESSORS: • Some factors do not stabilize the transcriptional machinery. • Their binding can block binding by co- factors and activators. • This reduces transcription. + ¡ + ¡ +++ ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Proximal ¡Promoter ¡

ENHANCER REGIONS: • Groups of binding sites located upstream or downstream of a promoter. ¡ • Often very distant—1000s of base pairs. ¡ + ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ 1-‑-‑100Kb ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

ENHANCER REGIONS: • Both activator and repressor transcription factors can occupy enhancer regions. ¡ • DNA looping brings factors into contact with transcriptional machinery. ¡ • Bound activators increase transcription. ¡ + ¡ +++ ¡ + ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

Chromatin modification by TFs: • Histone Acetyltransferases (HATs) acetylate histones. ¡ • Tissue-specific transcription factors can bind to HATs, causing chromatin to open. ¡ • This can increase transcription. HAT ¡ + ¡ +++ ¡ + ¡ Specific ¡ General ¡ DNA ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡ Proximal ¡Promoter ¡

Part II: ChIP-seq & ChIP-exo Source: ¡Steven ¡Chu ¡

ChIP-seq

ChIP-Exo Rhee ¡and ¡Pugh, ¡Cell ¡201. ¡

ChIP-seq & ChIP-exo Rhee ¡and ¡Pugh, ¡Cell ¡2011. ¡

Part III: Analyzing ChIP-seq Data Source: ¡Steven ¡Chu ¡

Analyzing TF ChIP-seq Data • Key messages of this talk: – Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?

Things that can go wrong in ChIP-seq… 1. Low affinity antibody 2. Non-specific antibody 3. Contamination 4. Poor choice of peak calling algorithm (or parameters) … etc.

Steps in ChIP-seq Data Analysis 1. Mapping: where do the sequence “tags” map to the genome? 2. Peak Calling: where are the regions of significant tag concentration? 3. Motif Discovery: what is the binding motif? 4. Location Analysis: where are the peaks w/ respect to genes, promoters, introns etc?

1) Mapping ChIP-seq Tags • Tags: ChIP-seq produces a pool of “tags” (~100bp) • Tag Count: measure of enrichment of region • Negative Control: “input DNA” tag count Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡

2) ChIP-seq Peak Calling • ChIP-seq produces a pool of “tags”. • Tags are currently about 100 bp long. • Tag is the 5’ end of a DNA fragment. • But DNA is double- stranded so… Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

ChIP-seq Peak Calling • Peak callers combine overlapping tags to get the “peak height”. • Sometimes strand information is used to combine tags on opposite strands. • Fold-enrichment (tag count / control tag count) is usually used as the criterion for declaring a peak.

…ChIP-seq Peak Callers Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

Sanity check: are your peaks reasonable • Width: TF ChIP-seq peaks should be relatively short (< 300bp) compared to histone modification peaks. – Are your peaks too wide? • Number: Is the number of TF ChIP-seq peaks reasonable? – Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?) • Location: Do your peaks co-occur with histone marks and genes your TF regulates? • The next analysis steps will help you answer these questions!

3) Motif Discovery & Enrichment Analysis • If your TF binds DNA directly (and sequence-specifically), Motif Discovery should find its binding motif. • The DNA-binding motif of your TF should be centrally enriched in the peaks, and hould be Central Motif Enrichment Analysis (CMEA) should find it.

Caveats in ChIP-seq Motif Analysis • Peak regions may contain other TF motifs due to looping. • The binding of the ChIP-ed factor “X” may be indirect. • ChIP-ed motif might be weak due to assisted binding. Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

TF Binding Motif Discovery • ChIP-seq provides extremely rich data for inferring the DNA-binding affinity of the ChIP-ed transcription factor. • In principle, discovering the motif • ChIP-seq peaks tend is simple. ààà to be within +/- 50bp of the bound factor. • So we just examine the peak regions for enriched patterns.

MEME Suite tools for ChIP-seq motif discovery and enrichment • The MEME Suite (http://meme.nbcr.net) contains several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis. – Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME

Example: Motif discovery in NFIC ChIP-seq data • Pjanic et al. predicted 39,807 ChIP-seq peaks in NFIC ChIP-seq data. • They do not report a using motif discovery on these peaks. • We used MEME-ChIP which runs both MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions. Machanick ¡& ¡Bailey, ¡BioinformaMcs, ¡2011 ¡

Motif discovery fails in the (original) NFIC dataset • An NFIC motif is know from in vitro data, based on only 16 sites. • MEME and DREME fail to find this motif in the NFIC data. • But so do the other algorithms we tried: Amadeus, peak-motifs, Trawler and Weeder.

The problem: poor peak calling! • We applied a different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000). • MEME discovers the NFI-family binding motif in this new set of peaks.

Central Motif Enrichment Analysis: CentriMo • CentriMo searches 500-‑bp ¡ChIP-‑seq ¡regions ¡ for known motifs whose sites are most centrally enriched in the ChIP-seq W=120 ¡ L=500 ¡ regions. S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ • Use 500bp regions T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡ centered on each ChIP-seq peak. Probability ¡ “site-‑probability” ¡curve ¡ ¡ MA0119.1 T GG C T G CC A G A A A Bailey ¡et ¡al, ¡NAR ¡2012 ¡ C C A T G G T C T G T A A C PosiMon ¡of ¡Best ¡Site ¡ Position CEQLOGO 22.09.10 17:31

Central Motif Enrichment confirms the known NFIC motif—even in the original peaks 0.003 MA0119.1 T GG C T G CC A G NFIC ¡ A A A 0.0025 C C A T G G G T C T T C A A Position CEQLOGO 22.09.10 17:31 0.002 probability 0.0015 0.001 MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 0.0005 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383 0 -250 -200 -150 -100 -50 0 50 100 150 200 250 position of best site in sequence NFIC motif is most centrally enriched of 862 JASPAR • +UniPROBE motifs ( p = 10 -31 ). However, standard motif enrichment algorithms (including AME) • do not show the NFIC as the most enriched motif.

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

Epigenomics at NIDA, NIH, and Beyond John Satterlee Ph.D. National Institute on Drug Abuse/NIH

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Basics and Prospects in YUM! YUM! Epigenomics Epigenetics Outline Epigenetics

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Computational personal genomics: selection, regulation, epigenomics, disease Manolis Kellis

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

REVISION GUIDES: How to use them effectively Miss A Humphries and Mr C Dawson Science revision

EVOLUTION Paper 2: 66 marks THEORIES OF EVOLUTION EVOLUTION : Change over Time Compiled by

Dundee: a city of designs Michael Marra 17 th March 2016 V&A at Dundee design ( dzn

Biochemical Genetics Laboratory Viapath @Guys Hospital Marie Jackson Consultant Clinical

Overview Brief introduction to epigenetics and bromodomains Identification of BET

Cindy G. Boer Genetic Laboratory Internal Medicine Erasmus MC Congratulations! A genome-wide

Molecular simulations of DNA loop extrusion explain and predict human genome architecture Adrian

International Human Epigenome Consortium (IHEC) Eric Marcotte, PhD Chair, IHEC Executive

Sambuz

Useful Links

Newsletter

Mail Us

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical

Epigenomics at NIDA, NIH, and Beyond John Satterlee Ph.D. National Institute on Drug Abuse/NIH

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Basics and Prospects in YUM! YUM! Epigenomics Epigenetics Outline Epigenetics

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Computational personal genomics: selection, regulation, epigenomics, disease Manolis Kellis

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

REVISION GUIDES: How to use them effectively Miss A Humphries and Mr C Dawson Science revision

EVOLUTION Paper 2: 66 marks THEORIES OF EVOLUTION EVOLUTION : Change over Time Compiled by

Dundee: a city of designs Michael Marra 17 th March 2016 V&amp;A at Dundee design ( dzn

Biochemical Genetics Laboratory Viapath @Guys Hospital Marie Jackson Consultant Clinical

Overview Brief introduction to epigenetics and bromodomains Identification of BET

Cindy G. Boer Genetic Laboratory Internal Medicine Erasmus MC Congratulations! A genome-wide

Molecular simulations of DNA loop extrusion explain and predict human genome architecture Adrian

International Human Epigenome Consortium (IHEC) Eric Marcotte, PhD Chair, IHEC Executive

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Dundee: a city of designs Michael Marra 17 th March 2016 V&A at Dundee design ( dzn