kipoi model zoo for genomics
play

Kipoi: : model zoo for genomics iga Avsec PhD candidate, Technical - PowerPoint PPT Presentation

Kipoi: : model zoo for genomics iga Avsec PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz Genomics ACGTGTCAGTAGTTAAGCTAGTAGCTGATCGGTAACGTAGTGCACGTGTCAGTAGTTAAGCTAGTAGCTGATC 3 billion


  1. Kipoi: : model zoo for genomics Žiga Avsec PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz

  2. Genomics ACGTGTCAGTAGTTAAGCTAGTAGCTGATCGGTAACGTAGTGCACGTGTCAGTAGTTAAGCTAGTAGCTGATC 3 billion letters (x2) = 1 genome 37 trillion cells

  3. Proteins = main building blocks ~100k - 1M Protein1 Protein1 Protein2 Genome Protein3 Protein3 Protein3 3

  4. Proteins = main building blocks ~100k - 1M Protein1 Protein1 Protein2 Genome Protein2 Protein3 4

  5. Proteins = main building blocks ~100k - 1M Protein complex Protein1 Protein3 Protein1 Protein2 Protein2 Protein1 Protein2 Genome Protein2 Function1 Protein3 Function2 5

  6. How to make proteins from the genome?

  7. Gene expression: How information in DNA is read out Protein (~100,000) Translation mature RNA aucaugauacauagaucaugacuca Intron Exon Splicing precursor RNA aucaugauauggauacgcauagaucaugacuca Transcription DNA (2) atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg Gene (~20,000, 1% of DNA) 7

  8. Gene expression: How information in DNA is read out Transcription factor a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg Transcription Factor binding site 8

  9. Gene expression: How information in DNA is read out RNA polymerase a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 9

  10. Gene expression: How information in DNA is read out a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 10

  11. Gene expression: How information in DNA is read out a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 11

  12. Gene expression: How information in DNA is read out The regulatory elements control: ● The position of transcription initiation (what) ● The frequency of transcription (how much) a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 12

  13. Genetic variants can disrupt regulatory elements Patient atc t tatatatcatgatatggatacgcatagatcatgactcaggatacg Reference a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 13

  14. Thousands of regulatory elements across all steps of gene expression Protein degradation Ø Translation RNA degradation aucaug auacauagaucau gacuca Ø Splicing Transcription aucaug auau gg auacg cauagaucaugacuca cttatc a cagtg tatatcatgatatggatacgcatagatcatgactcaggatacg 14

  15. Measuring the regulatory steps via sequencing Experimental data 15

  16. Learning the regulatory steps Experimental data Predictive models 16

  17. Learning the regulatory steps Experimental data Predictive models 17

  18. Detecting regulatory elements with convolutional neural networks GATA TAL cttatc a cagtg tatatcatgatatggatacgcatagatcatgactcaggatacg 18

  19. 19 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  20. 20 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  21. 21 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  22. 22 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  23. 23 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  24. 24 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  25. That’s why we need GPUs in regulatory genomics. 25 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

  26. Learning the regulatory steps Experimental data Predictive models 26

  27. List of published predictive models - Post-transcriptional regulation - RBP binding - Transcriptional regulation - iDeep - TF Binding - rbp_eclip (Avsec et al) - PWM Scanning (Jaspar, Cis-BP, MEME) - miRNA binding - DeepBind - TargetScan - Improved DeepBind - deepMiRGene - FactorNet - Splicing - GERV - MaxEntScan 5’, 3’ - DanQ - Labranchor - CKN-seq - HAL - Chromatin - MMSplice - DeepSEA - SpliceAI - DeepChrome - mRNA half-life - Basenji - Cheng et al 2017 - DNA methylation - Polyadenylation - CpGenie - APARENT - DeepCpG - Translation - DNA Accessibility - Optimus_5Prime - Basset - Cuperus et al 2017 - TSS: - FIDDLE See also: https://github.com/greenelab/deep-review - Gene-Expression - Basenji - Expecto 27

  28. List of published predictive models - Post-transcriptional regulation - RBP binding - Transcriptional regulation - iDeep - TF Binding - rbp_eclip (Avsec et al) - PWM Scanning (Jaspar, Cis-BP, MEME) - miRNA binding - DeepBind - TargetScan - Improved DeepBind - deepMiRGene - FactorNet - Splicing - GERV - MaxEntScan 5’, 3’ - DanQ - Labranchor - CKN-seq - HAL - Chromatin - MMSplice - DeepSEA - SpliceAI - DeepChrome - mRNA half-life - Basenji - Cheng et al 2017 - DNA methylation - Polyadenylation - CpGenie - APARENT - DeepCpG - Translation - DNA Accessibility - Optimus_5Prime - Basset - Cuperus et al 2017 - TSS: - FIDDLE See also: https://github.com/greenelab/deep-review - Gene-Expression - Basenji Can we easily apply these models to new data? - Expecto Can we easily re-use these models? 28

  29. Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics Trained predictive models Code repository Paper supplements Author-maintained web page 29

  30. Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics Trained predictive models Code repository Paper supplements Author-maintained web page Data Bioinformatics software .... 30

  31. The “ F indable A ccessible I nteroperable R eusable” principle not only for data but also for trained predictive models 31

  32. Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data 32

  33. Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data - Data heterogeneity (think one paper = one dataset) 33

  34. Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data - Data heterogeneity (think one paper = one dataset) - Model heterogeneity (from deep learning frameworks to custom code) - Dependency issues 34

  35. Kipoi.org [Kípi] with A. Kundaje, Stanford, and O. Stegle, DKFZ 35 Avsec et al, Nature Biotechnology (In press)

  36. Trained model (model.yaml) 36

  37. Model data-loader model “Parameterized function” Can be implemented using: Output Parameters Model CGTGAGTTT Input TGATCGAGG GTAGCTAGC 37

  38. Model data-loader model 38

  39. Data-loader github.com/kipoi/kipoiseq data-loader model array([[[1, 0, 0, 0], intervals.bed [0, 1, 0, 0], chr1 1000 2000 [0, 0, 1, 0], chr2 5000 7000 [1, 0, 0, 0], resize [ … ]], genome.fa extract transform [[0, 1, 0, 0], >chr1 [0, 0, 1, 0], NNNNNNNNNNNN... [1, 0, 0, 0], [0, 0, 0, 1], [ … ]]])

  40. Dependencies data-loader model 40

  41. Test predictions ... x CGTGAGTTT TGATCGAGG GTAGCTAGC CGTGAGTTT CGTGAGTTT TGATCGAGG TGATCGAGG GTAGCTAGC GTAGCTAGC 41

  42. Schema Supports multiple inputs/outputs ... TGATC GAGGA 42

  43. General information 43

  44. Model repository 44

  45. 45

  46. 46

  47. # model groups: 20 # of models: 2073 47

  48. Testing models - Install new conda environment - Test the pipeline: data -> dataloader -> model - Test the predictions match 48

  49. Testing models - Install new conda environment - Test the pipeline: data -> dataloader -> model - Test the predictions match When? - Pull-request - Nightly (all model groups) 49

  50. Using models 50

  51. For the impatient: 30 seconds introduction to Kipoi

  52. 52

  53. Case study 1: Benchmarking models 53

  54. Benchmarking alternative models 54

  55. Benchmarking alternative models 55

  56. Benchmarking alternative models # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ 56 -o ‘<Model>.preds.h5'

  57. Benchmarking alternative models # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ <- Works very nicely with “intervals_file”: “intervals.bed”, workflow-management tools like “fasta_file”: “hg38.fa”}' \ Snakemake 57 -o ‘<Model>.preds.h5'

  58. Why not a single command? predict <model> -i input.data -o output.data # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ -o ‘<Model>.preds.h5' 58

  59. Why not a single command? predict <model> -i input.data -o output.data # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ -o ‘<Model>.preds.h5' --singularity 59

Recommend


More recommend