Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490 HST.506 David Gifford Lecture 10 March 12, 2019 Histone Marks Chromatin 3D Structure http://mit6874.github.io 1
Goals for today • Chromatin marks and their models • Hidden Markov Model (HMM) Deep learning model (DeepSEA) • • Three-dimensional chromatin structure • Inferring it • Predicting it
1. Chromatin marks and biological state
Chromatin and Nucleosome Organization Khorasanizadeh, (2004) Green -H3, yellow - H4, red - H2A, pink - H2B. Dark and light blue - DNA Nucleosome DNA - 146 base pairs, wrapped 1.7 times in a left-handed superhelix Proteins - two copies of each Histones H2A, H2B, H3 and H4. Higher organisms have linker H1 histone Histone variants H3 variants: H3.3 - transcribed CENP-A - centromeres H2A variants: H2A.X - DNA damage macroH2A - X chromosome H2A.Z - transcribed regions
Chro ma tin o rg a niza tio n ha s multiple struc tura l la ye rs a nd o rg a nize s c hro ma tin into “do ma ins” Bo th DNA me thyla tio n a nd c hro ma tin ma rks c o nta in impo rta nt func tio na l info rma tio n
Histo ne T a il Mo dific a tio ns Sims III et al., 2003
We c an obse r ve c hr omatin mar ks and othe r ge nome assoc iate d pr ote ins using ChIP- se q H3K 4me 3 RNA Po l I I
s. a ) b ) hE SC ChI P-se q De te c tion of Class I (ac tive ) and Class II (poise d) e nhanc e r re a d de nsity pro file s we re g e ne ra te d fo r the indic a te d histo ne mo dific a tio ns c e nte re d o n p300-b o und re g io ns in the to p 1000 Cla ss I a nd Cla ss I I e nha nc e rs, re spe c tive ly. c ) hE SC Na no g ChI P-se q sho ws tha t Na no g b inds a t the thre e pre dic te d Cla ss I I e nha nc e r po sitio ns ne a r the CDX2 g e ne
2. Learning chromatin states
Can we find late nt state to e xplain obse r ve d mar ks? Roadmap Epigenomics Consortium et al. Nature 518 , 317-330 (2015) doi:10.1038/nature14248
Hidde n Mar kov Mode ls Hidde n sta te x in [1 .. m] F o r e xa mple , m c a n 15 E mitte d symb o l y c a n b e multi dime nsio na l F o r e xa mple , histo ne a nd a c c e ssib ility da ta a t g e no mic lo c us t One no de e ve ry 200b p do wn g e no me Pa ra me te rs a re P(x t+1 | x t ), P(y t | x t )
Hidde n Mar kov Mode ls c an be use d to c r e ate late nt state s that ge ne r ate c hr omatin mar ks Hidde n Ma rko v Mo de l (Chro mHMM) Divide g e no me into 200b p windo ws Hidde n sta te fo r a 200b p windo w mo de ls wha t histo ne ma rks a re pre se nt in the windo w Unsupe rvise d – re sulting sta te s must b e inte rpre te d with inde pe nde nt da ta T he numb e r o f sta te s is fixe d a nd is a mo de ling de c isio n
ChromHMM Model Parameter Visualization. Hoffman M M et al. Nucl. Acids Res. 2013;41:827-841 P(y t | x t ) P(x t+1 | x t )
Chr omHMM se gme nt base d c hr omatin state s
Tissues and cell types profiled in the Roadmap Epigenomics Consortium. Roadmap Epigenomics Consortium et al. Nature 518 , 317-330 (2015) doi:10.1038/nature14248
Roadmap Epigenomics Consortium et al. Nature 518 , 317-330 (2015) doi:10.1038/nature14248
3. Predicting chromatin state from sequence
DeepSea learns TF binding, accessibility, and chromatin marks 125 DNa se fe a ture s, 690 T F fe a ture s, 104 17% o f g e no me histo ne fe a ture s 690 T F b inding pro file s fo r 160 thre e diffe re nt T F s, 125 c o nvo lutio n DHS pro file s a nd la ye rs with 320, 104 histo ne -ma rk 480 a nd 960 pro file s ke rne ls Chr 8 a nd 9 1000 b p windo w e xc lude d
DeepSea can predict differentially accessible regions based upon SNP value
An ensemble logistic regression classifier based on DeepSea output can identify regulatory variants
4. Three-dimensional interactions
HiC, HiChip, a nd ChI A-PE T da ta re ve a l dista l g e no me inte ra c tio ns
E nhanc e r s r e gulate distal tar ge t ge ne s by ge nome looping E nha nc e r Ma ste r Re g ula to rs Me dia to r Co he sin Po l I I Ge ne
in situ HiC identifies proximal genomic contacts Ce ll. 2014 De c 18; 159(7): 1665–1680.
in situ HiC reveals interactions at 1 – 5 KB resolution
Observed interchromosomal interaction distances fall off exponentially
ChIA-PET identifies protein mediated interactions and improves resolution for those events
ChIA-PET data are consistent with HiC data
ChIA-PET discovered enhancer linkages
Issue s with ChIA- PE T 1. Hig h fa lse ne g a tive ra te . L ib ra rie s pro duc e d a re no t c o mple x e no ug h to pe rmit furthe r disc o ve ry b y a dditio na l se q ue nc ing . 2. Spe c ific to a pro te in (RNA Po lyme ra se I I in o ur e xa mple ) 3. Hi-C a nd de riva tive s ma y so lve the se pro b le ms e ve ntua lly
HiChIP identifies protein mediated interactions
HiChIP is more sensitive than ChIA-PET
HiChIP and ChIA-PET interactions compared Smc1a antibody (part of cohesion complex)
XIST promoter interactions show more support from HiChIP than Hi-C
HiChIP (Smc1a) is more sensitive than HiC
5a. Discovering interactions: Anchor-based
Method 1: Discover anchors using ChIP-seq methods Given anchors, what is the chance of observing an interaction by chance? N to ta l e nds I a ,b inte ra c tio ns o b se rve d c a e nds c b e nds
What is the chance of observing an interaction by chance? N to ta l e nds I a ,b inte ra c tio ns o b se rve d c a e nds c b e nds
E stimating total e ve nts fr om ove r lap I ma g ine we pe rfo rm two b io lo g ic a l re plic a te s o f a n e xpe rime nt a nd o b ta in 1000 e ve nts in e a c h, o f whic h 900 a re ide ntic a l. We c a n use a hype rg e o me tric mo de l to infe r ho w ma ny po ssib le e ve nts e xist ( N ) g ive n two sa mple size s ( m a nd n ) a nd a n o ve rla p ( k ): Using this mo de l, we pre dic t ~1100 to ta l e ve nts
Appr oximate c lose d for m solution for total numbe r of e ve nts T he ML e stima te o f N is a ppro xima te ly: One wa y to se e this is b y using the no rma l a ppro xima tio n o f the b ino mia l a ppro xima tio n to the hype rg e o me tric distrib utio n:
5b. Discovering interactions: Density-based
Me tho d 2: CI D use s de nsity-b a se d c luste ring to disc o ve r c hro ma tin inte ra c tio ns Nucleic Acids Research, 14 February 2019, gkz051, https://doi.org/10.1093/nar/gkz051 • Figure 1. CID uses density-based clustering to discover chromatin interactions. (A) ChIA-PET interactions can be discovered as groups of dense arcs connecting two genomic regions. Each arc is a PET. (B) The PETs plotted on a two-dimensional map using the genomic coordinates of the two reads. Each point is a PET. The colors represent the density values, defined as the number of PETs in the neighborhood. The red dashed square represents the size of the neighborhood. (C) The clustering decision graph. Each point is a PET. The points with high density and high delta values are selected as cluster centers. For simplicity, only large clusters are labelled. (D) The read pairs are assigned to the nearest cluster centers. The clusters are labeled as in (C). (E) The clusters are visualized as arcs. The clusters are labeled as in (C) and (D).
Method 2: Density cluster interaction origins https:/ / a c a de mic .o up.c o m/ na r/ a dva nc e -a rtic le / do i/ 10.1093/ na r/ g We use a thre e -c o mpo ne nt mixture mo de l to de sc rib e c o nditio na l distrib utio n o f PE T -c o unt fro m a ll the PE T c luste rs. One c o mpo ne nt re pre se nts true inte ra c tio n PE T c luste r (T iPC), a nd the o the r two fo r ra ndo m c o llisio n PE T c luste r (Rc PC) a nd ra ndo m lig a tio n PE T c luste r (RlPC), re spe c tive ly. T iPC a nd Rc PC mo de ls inc lude d a ,b dista nc e b e twe e n c luste rs https:/ / a c a de mic .o up.c o m/ b io info rma tic s/ a rtic le / 31/ 23/ 3
Cluster interaction origins
Jaccard coefficient – measure of set similarity
CID is more reproducible and sensitive
6. Predicting enhancer-promoter interactions
TargetFinder uses multiple data types to predict HiC interactions https:/ / www.na ture .c o m/ a rtic le s/ ng .3539
TargetFinder Training Data
TargetFinder – Ratio of the CTCF and RAD21 ChIP-seq signals occurring within interacting enhancers and non- interacting enhancers
TargetFinder – Enrichment of signals at transcription start sites (TSS) Da rk – inte ra c ting ; L ig ht – no n-inte ra c ting
TargetFinder – Performance F e a ture s fo r e nha nc e rs a nd pro mo te rs o nly (E / P), e xte nde d e nha nc e rs a nd pro mo te rs (E E / P), a nd e nha nc e rs a nd pro mo te rs plus the windo ws b e t
Deep learning network for predicting enhancer-promoter interactions
Recommend
More recommend