AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCA TTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAAT ATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAA TAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGA AGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTC Deep le learning approaches to TCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAA GCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAA decode the human genome AGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGC TGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAA CCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCA CTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCT Anshul Kundaje CCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCG Genetics, Computer Science CGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCA GGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTT Stanford University ACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAG http://anshul.kundaje.net AGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGA ATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCC TCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGA
TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA The Human Genome Proje ject ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT 2003 CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA ~ 3 billion nucleotides AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………
Population sequencing to id identify dis isease-associated genetic varia iants Statistically significant association? Oxford Nanopore technology
TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT Decodin ing genome function ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA Function? GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC ~ 3 billion nucleotides TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………
One genome many y cell ll typ ypes ACCAGTTACGACGG TCAGGGTACTGATA CCCCAAACCGTTGA CCGCATTTACAGAC GGGGTTTGGGTTTT GCCCCACACAGGTA CGTTAGCTACTGGT TTAGCAATTTACCG TTACAACGTTTACA GGGTTACGGTTGGG ATTTGAAAAAAAGT TTGAGTTGGTTTTT TCACGGTAGAACGT ACCTTACAAA………… http://www.roadmapepigenomics.org/
Bio iochemical markers of cell ll-type specif ific functional l ele lements Repressed gene Active gene Control elements 99 % Non- Protein coding https://www.broadinstitute.org/news/1504 1.5 % Protein Coding
NIH funded collaborative consortia 100s of cell types and tissues 100s of Cell-Types/Tissues Identifying tissue- specific control elements Learning sequence Machine learning, code of control Probabilistic models, elements Deep learning Interpreting disease- associated genetic variation
Active control elements Active control elements Active genes Repressed elements A comprehensive functional annotation of the human genome • ~20,000 genes • ~2 million novel putative control elements! • cell-type specific activity
2M control ele lements show hig ighly modular tis issue-specific activ ivity Active Inactive • ~20,000 genes • ~2 million novel putative control elements! • modular tissue- 100s of Tissues specific activity! 2M control elements
Decodin ing DNA words and grammars that specif ify tis issue-specific ic control ele lements Regulatory proteins bind DNA words (landing pads) in control elements! ‘Motif Discovery’
Learning dis iscrimin inative DNA words from tis issue-specif ific control l ele lement sequences Class = +1 sequences of control elements active in Tissue 1 Class = +1 Class = +1 Training Classification function Training Input sequences F(X) Output labels (X) (Y) sequences of Class = -1 ‘ Training ’ means control elements NOT learning the Class = -1 active in Tissue function F(X) from 1 but active in multiple input, Class = -1 other tissues output pairs (X,Y)
Deep convolu lutional neural network (CNN) on DNA sequence in inputs Binary Output: Is seq. active Active (1) vs Inactive (0) in cell type 1? Later layers build on patterns of previous layer Learned pattern detectors C G A T A A C C G A T A T 1 1 0 0 0 0 0 1 1 0 1 0 0 A 0 0 0 1 0 0 0 0 0 1 0 1 0 C 0 0 0 0 1 0 0 0 0 0 0 0 1 G 0 0 1 0 0 1 1 0 0 0 0 0 0 T One-hot encoded input: DNA sequence represented as ones and zeros
Mult lti-task deep eep CNNs le learn dis iscriminative DNA word pattern detectors Multi-task learning Is seq. active Is seq. active in Is seq. active in cell type 1? cell type 100? in cell type 2? prediction accuracy Mean auROC = 0.82 Deeper conv. layers Mean auPRC = 0.65 learn DNA word combinations (grammars) Score sequence using filters Convolutional layers Neurons learn DNA word pattern detectors Similar to Kelley et al. 2016 (Basset) Millions of input sequences of control elements Zhou et al. 2015 (DeepSEA)
How can we id identify fy im important parts of f th the in input se sequences? ? Is seq. active Is seq. active in cell type 1? in cell type 2? In-silico mutagenesis • inefficient • misleading results due to saturation/buffering C G A T A A C C G A T A T A C A …................................ C G G Alipanahi et al, 2015 T T T Zhou & Troyanskaya, 2015 Kelley et al 2016
Efficient “Backpropagation” based approaches Is seq. active Is seq. active Is seq. active DeepLIFT Gr Gradient bas based meth thods in cell type 1? in cell type 1? in cell type 2? • Saliency maps (Simonyan 2013) Shrikumar et al. Learning Important Features • Deconv networks (Zeiler, Fergus 2013) Through Propagating Activation Differences • Guided backprop (Springerberg 2014) https://arxiv.org/abs/1704.02685 • Layerwise relevance propagation (Bach CODE: https://github.com/kundajelab/deeplift 2015) • Integrated gradients (Sundarajan 2016) G A T C A G A A G A T A A C C G A T A T C C 1 1 0 0 0 0 0 1 1 0 1 0 0 A 0 0 0 1 0 0 0 0 0 1 0 1 0 C 0 0 0 0 1 0 0 0 0 0 0 0 1 G 0 0 1 0 0 1 1 0 0 0 0 0 0 T Avanti Shrikumar Peyton Greenside
DeepLIFT id identifies combinatorial grammars of f DNA words defi fining ti tissue-specifi fic control ele lements! Shrikumar et al. https://arxiv.org/abs/1704.02685 CODE: https://github.com/kundajelab/deeplift
Dis istinct combinations of f DNA words can active same control ele lement in in dif ifferent ti tissues Control element sequence DeepLIFT scores Tissue: Blood stem cells SPI1 DeepLIFT scores Tissue: Red Position along sequence blood cells Gata (Rc) Gata Gata (Rc) SPI1 SPI1 protein binding data GATA1 protein binding data Validation experiment results Peyton Greenside
Decoding ti tissue-specifi fic combinatorial grammars in in millions of f genomic control ele lements! Peyton Greenside
Recommend
More recommend