6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility, DNA regulatory code Prof. Manolis Kellis Slides credit: 6.047, Anshul Kundaje, David Gifford http://mit6874.github.io
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation – Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq 2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation. 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
1a. Basics of gene regulation
One Genome – Many Cell Types ACCAGTTACGACGGTCA GGGTACTGATACCCCAA ACCGTTGACCGCATTTA CAGACGGGGTTTGGGTT TTGCCCCACACAGGTAC GTTAGCTACTGGTTTAG CAATTTACCGTTACAAC GTTTACAGGGTTACGGT TGGGATTTGAAAAAAAG TTTGAGTTGGTTTTTTC ACGGTAGAACGTACCGT TACCAGTA 4 Image Source wikipedia
DNA packaging • Why packaging – DNA is very long – Cell is very small • Compression – Chromosome is 50,000 times shorter than extended DNA • Using the DNA – Before a piece of DNA is used for anything, this compact structure must open locally • Now emerging: – Role of accessibility – State in chromatin itself – Role of 3D interactions
Combinations of marks encode epigenomic state Enhancers Promoters Transcribed Repressed • H3K4me1 • H3K4me3 • H3K36me3 • H3K9me3 • H3K27ac • H3K9ac • H3K79me2 • H3K27me3 • DNase • DNase • H4K20me1 • DNAmethyl • H3K4me3 • H3K4me1 • H3K27ac • H3K36me3 • H4K20me1 • H3K79me3 • H3K27me3 • H3K9me3 • H3K9ac • H3K18ac • 100s of known modifications, many new still emerging • Systematic mapping using ChIP-, Bisulfite-, DNase-Seq
Summarize multiple marks into chromatin states Chromatin state track summary 30+ epigenomics marks WashU Epigenome Browser ChromHMM: multi-variate hidden Markov model
T ra nsc ription fa c tors c ontrol a c tiva tion of c e ll- type - spe c ific promote rs a nd e nha nc e rs Enhancer region Promoter region Protein-coding sequence
T F s use DNA-b inding do ma ins to re c o g nize spe c ific DNA se q ue nc e s in the g e no me “ Logo ” or “ motif ” TAATTA CACGTG AGATAAGA DNA-binding domain of Engrailed TCATTA
Re g ula to r struc ture re c o g nize d mo tifs • Pro te ins ‘ fe e l’ DNA - Re a d c he mic a l pro pe rtie s o f b a se s - Do NOT o pe n DNA (no b a se c o mple me nta rity) • 3D T o po lo g y dic ta te s spe c ific ity - F ully c o nstra ine d po sitio ns: e ve ry a to m ma tte rs - “Amb ig uo us / de g e ne ra te ” po sitio ns lo o se ly c o nta c te d • Othe r type s o f re c o g nitio n - Mic ro RNAs: c o mple me nta rity - Nuc le o so me s: GC c o nte nt - RNAs: struc ture / se q n c o mb ina tio n
Mo tifs summa rize T F se q ue nc e spe c ific ity • Summa rize info rma tio n • I nte g ra te ma ny po sitio ns • Me a sure o f info rma tio n • Disting uish mo tif vs. mo tif insta nc e • Assumptio ns: - I nde pe nde nc e - F ixe d spa c ing
Re gulator y motifs at all le ve ls of pr e / post- tx r e gulation Enhancer regions Promoter motifs Splicing signals Motifs at RNA level Where in the body? When in time? Which variants? Which subsets? • T he pa rts list: ~20-30k g e ne s - Pro te in-c o ding g e ne s, RNA g e ne s (tRNA, mic ro RNA, snRNA) • T he c irc uitry: c o nstruc ts c o ntro lling g e ne usa g e - E nha nc e rs, pro mo te rs, splic ing , po st-tra nsc riptio na l mo tifs • T he re g ula to ry c o de , c o mplic a tio ns: - Co mb ina to ria l c o ding o f ‘ uniq ue ta g s’ - Da ta -c e ntric e nc o ding o f a ddre sse s - Ove rla id with ‘ me mo ry’ ma rks - L a rg e -sc a le o n/ o ff sta te s - Mo dula tio n o f the la rg e -sc a le c o ding - Po st-tra nsc riptio na l a nd po st-tra nsla tio na l info rma tio n • T o da y: disc o ve ring mo tifs in c o -re g ula te d pro mo te rs a nd de no vo mo tif disc o ve ry & ta rg e t ide ntific a tio n
Disrupte d mo tif a t the he a rt o f F T O o b e sity lo c us Strongest association C-to-T disruption of AT-rich with obesity regulatory motif Lean Obese Restoring motif restores thermogenesis
1b. Technologies for probing gene regulation
Mapping regulator binding: ChIP-seq (Chromatin immunoprecipitation followed by sequencing) TF=transcription factor antibody Bar-coded multiplexed sequencing
ChIP-chip and ChIP-Seq technology overview or modification Image adapted from Wikipedia Modification-specific antibodies Chromatin Immuno-Precipitation followed by: ChIP-chip: array hybridization ChIP-Seq: Massively Parallel Next-gen Sequencing
ChIP-Seq Histone Modifications: What the raw data looks like • Each sequence tag is 30 base pairs long • Tags are mapped to unique positions in the ~3 billion base reference genome • Number of reads depends on sequencing depth. Typically on the order of 10 million mapped reads. 17
Chro ma tin a c c e ssib ility c a n re ve a l T F b inding She rwo o d, RI , e t a l. “ Disc ove r y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr ofile ma g nitude a nd sha pe ” Nat. Bio te c h 2014.
DNa se - se q r e ve a ls g e nome pr ote c tion pr ofile s
AT AC-se q
AT AC- se q and DNase - se q ar e not ide ntic al GM12878, Chr. 14, E a c h po int is a c c e ssib ility in a 2 kb windo w Ha shimo to T B, e t a l. “ A Syne r gistic DNA L ogic Pr e dic ts Ge nome - wide Chr omatin Ac c e ssibility” Ge no me Re se arc h 2016
Dnase - se q is le ss de fine d e vide nc e than ChIP- se q A ChIP-seq reports TF-binding locations regions (specifically) seq DNase-seq reports proximal TF- non-binding locations ( noisily ) seq
Bound fa c tor s le a ve distinc t DNa se - se q pr ofile s Esrrb Zfx CTCF Oct4 Brg motif Individua l binding site pr e dic tion is diffic ult Individual CTCF: Aggregate CTCF:
Motifs c a n pr e dic t T F binding Binding site s c ha ng e a c r oss time ~50,000 binding sites for a typical TF ~650,000 TF Motifs
Chr omatin ac c e ssibly influe nc e s tr ansc r iption fac tor binding • Mo de ling a c c e ssib ility pro file s yie lds b inding pre dic tio ns a nd pio ne e r fa c to r disc o ve ry • Asymme tric a c c e ssib ility is induc e d b y dire c tio nal pio ne e rs • T he b inding o f se ttle r fac to rs c a n b e e na b le d b y pro xima l pio ne e r fa c to r b inding She rwo o d, RI , e t a l. “ Disc ove r y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr ofile ma g nitude a nd sha pe ” Nat. Bio te c h 2014.
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation – Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq 2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation. 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
2. Classical regulatory genomics (before Deep Learning)
Enrichment-based discovery methods Given a set of co-regulated/functionally related genes, find common motifs in their promoter regions • Align the promoters to each other using local alignment • Use expert knowledge for what motifs should look like • Find ‘median’ string by enumeration (motif/sample driven) • Start with conserved blocks in the upstream regions
Recommend
More recommend