decode the human genome
play

decode the human genome - PowerPoint PPT Presentation

AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCA TTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAAT ATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAA


  1. AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCA TTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAAT ATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAA TAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGA AGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTC Deep le learning approaches to TCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAA GCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAA decode the human genome AGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGC TGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAA CCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCA CTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCT Anshul Kundaje CCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCG Genetics, Computer Science CGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCA GGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTT Stanford University ACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAG http://anshul.kundaje.net AGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGA ATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCC TCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGA

  2. TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA The Human Genome Proje ject ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT 2003 CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA ~ 3 billion nucleotides AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………

  3. Population sequencing to id identify dis isease-associated genetic varia iants Statistically significant association? Oxford Nanopore technology

  4. TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT Decodin ing genome function ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA Function? GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC ~ 3 billion nucleotides TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………

  5. One genome  many y cell ll typ ypes ACCAGTTACGACGG TCAGGGTACTGATA CCCCAAACCGTTGA CCGCATTTACAGAC GGGGTTTGGGTTTT GCCCCACACAGGTA CGTTAGCTACTGGT TTAGCAATTTACCG TTACAACGTTTACA GGGTTACGGTTGGG ATTTGAAAAAAAGT TTGAGTTGGTTTTT TCACGGTAGAACGT ACCTTACAAA………… http://www.roadmapepigenomics.org/

  6. Bio iochemical markers of cell ll-type specif ific functional l ele lements Repressed gene Active gene Control elements 99 % Non- Protein coding https://www.broadinstitute.org/news/1504 1.5 % Protein Coding

  7. NIH funded collaborative consortia 100s of cell types and tissues 100s of Cell-Types/Tissues Identifying tissue- specific control elements Learning sequence Machine learning, code of control Probabilistic models, elements Deep learning Interpreting disease- associated genetic variation

  8. Active control elements Active control elements Active genes Repressed elements A comprehensive functional annotation of the human genome • ~20,000 genes • ~2 million novel putative control elements! • cell-type specific activity

  9. 2M control ele lements show hig ighly modular tis issue-specific activ ivity Active Inactive • ~20,000 genes • ~2 million novel putative control elements! • modular tissue- 100s of Tissues specific activity! 2M control elements

  10. Decodin ing DNA words and grammars that specif ify tis issue-specific ic control ele lements Regulatory proteins bind DNA words (landing pads) in control elements! ‘Motif Discovery’

  11. Learning dis iscrimin inative DNA words from tis issue-specif ific control l ele lement sequences Class = +1 sequences of control elements active in Tissue 1 Class = +1 Class = +1 Training Classification function Training Input sequences F(X) Output labels (X) (Y) sequences of Class = -1 ‘ Training ’ means control elements NOT learning the Class = -1 active in Tissue function F(X) from 1 but active in multiple input, Class = -1 other tissues output pairs (X,Y)

  12. Deep convolu lutional neural network (CNN) on DNA sequence in inputs Binary Output: Is seq. active Active (1) vs Inactive (0) in cell type 1? Later layers build on patterns of previous layer Learned pattern detectors C G A T A A C C G A T A T 1 1 0 0 0 0 0 1 1 0 1 0 0 A 0 0 0 1 0 0 0 0 0 1 0 1 0 C 0 0 0 0 1 0 0 0 0 0 0 0 1 G 0 0 1 0 0 1 1 0 0 0 0 0 0 T One-hot encoded input: DNA sequence represented as ones and zeros

  13. Mult lti-task deep eep CNNs le learn dis iscriminative DNA word pattern detectors Multi-task learning Is seq. active Is seq. active in Is seq. active in cell type 1? cell type 100? in cell type 2? prediction accuracy Mean auROC = 0.82 Deeper conv. layers Mean auPRC = 0.65 learn DNA word combinations (grammars) Score sequence using filters Convolutional layers Neurons learn DNA word pattern detectors Similar to Kelley et al. 2016 (Basset) Millions of input sequences of control elements Zhou et al. 2015 (DeepSEA)

  14. How can we id identify fy im important parts of f th the in input se sequences? ? Is seq. active Is seq. active in cell type 1? in cell type 2? In-silico mutagenesis • inefficient • misleading results due to saturation/buffering C G A T A A C C G A T A T A C A …................................ C G G Alipanahi et al, 2015 T T T Zhou & Troyanskaya, 2015 Kelley et al 2016

  15. Efficient “Backpropagation” based approaches Is seq. active Is seq. active Is seq. active DeepLIFT Gr Gradient bas based meth thods in cell type 1? in cell type 1? in cell type 2? • Saliency maps (Simonyan 2013) Shrikumar et al. Learning Important Features • Deconv networks (Zeiler, Fergus 2013) Through Propagating Activation Differences • Guided backprop (Springerberg 2014) https://arxiv.org/abs/1704.02685 • Layerwise relevance propagation (Bach CODE: https://github.com/kundajelab/deeplift 2015) • Integrated gradients (Sundarajan 2016) G A T C A G A A G A T A A C C G A T A T C C 1 1 0 0 0 0 0 1 1 0 1 0 0 A 0 0 0 1 0 0 0 0 0 1 0 1 0 C 0 0 0 0 1 0 0 0 0 0 0 0 1 G 0 0 1 0 0 1 1 0 0 0 0 0 0 T Avanti Shrikumar Peyton Greenside

  16. DeepLIFT id identifies combinatorial grammars of f DNA words defi fining ti tissue-specifi fic control ele lements! Shrikumar et al. https://arxiv.org/abs/1704.02685 CODE: https://github.com/kundajelab/deeplift

  17. Dis istinct combinations of f DNA words can active same control ele lement in in dif ifferent ti tissues Control element sequence DeepLIFT scores Tissue: Blood stem cells SPI1 DeepLIFT scores Tissue: Red Position along sequence blood cells Gata (Rc) Gata Gata (Rc) SPI1 SPI1 protein binding data GATA1 protein binding data Validation experiment results Peyton Greenside

  18. Decoding ti tissue-specifi fic combinatorial grammars in in millions of f genomic control ele lements! Peyton Greenside

Recommend


More recommend