Kipoi: : model zoo for genomics Žiga Avsec PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz
Genomics ACGTGTCAGTAGTTAAGCTAGTAGCTGATCGGTAACGTAGTGCACGTGTCAGTAGTTAAGCTAGTAGCTGATC 3 billion letters (x2) = 1 genome 37 trillion cells
Proteins = main building blocks ~100k - 1M Protein1 Protein1 Protein2 Genome Protein3 Protein3 Protein3 3
Proteins = main building blocks ~100k - 1M Protein1 Protein1 Protein2 Genome Protein2 Protein3 4
Proteins = main building blocks ~100k - 1M Protein complex Protein1 Protein3 Protein1 Protein2 Protein2 Protein1 Protein2 Genome Protein2 Function1 Protein3 Function2 5
How to make proteins from the genome?
Gene expression: How information in DNA is read out Protein (~100,000) Translation mature RNA aucaugauacauagaucaugacuca Intron Exon Splicing precursor RNA aucaugauauggauacgcauagaucaugacuca Transcription DNA (2) atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg Gene (~20,000, 1% of DNA) 7
Gene expression: How information in DNA is read out Transcription factor a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg Transcription Factor binding site 8
Gene expression: How information in DNA is read out RNA polymerase a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 9
Gene expression: How information in DNA is read out a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 10
Gene expression: How information in DNA is read out a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 11
Gene expression: How information in DNA is read out The regulatory elements control: ● The position of transcription initiation (what) ● The frequency of transcription (how much) a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 12
Genetic variants can disrupt regulatory elements Patient atc t tatatatcatgatatggatacgcatagatcatgactcaggatacg Reference a tcgta tatatcatgatatggatacgcatagatcatgactcaggatacg 13
Thousands of regulatory elements across all steps of gene expression Protein degradation Ø Translation RNA degradation aucaug auacauagaucau gacuca Ø Splicing Transcription aucaug auau gg auacg cauagaucaugacuca cttatc a cagtg tatatcatgatatggatacgcatagatcatgactcaggatacg 14
Measuring the regulatory steps via sequencing Experimental data 15
Learning the regulatory steps Experimental data Predictive models 16
Learning the regulatory steps Experimental data Predictive models 17
Detecting regulatory elements with convolutional neural networks GATA TAL cttatc a cagtg tatatcatgatatggatacgcatagatcatgactcaggatacg 18
19 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
20 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
21 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
22 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
23 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
24 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
That’s why we need GPUs in regulatory genomics. 25 Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)
Learning the regulatory steps Experimental data Predictive models 26
List of published predictive models - Post-transcriptional regulation - RBP binding - Transcriptional regulation - iDeep - TF Binding - rbp_eclip (Avsec et al) - PWM Scanning (Jaspar, Cis-BP, MEME) - miRNA binding - DeepBind - TargetScan - Improved DeepBind - deepMiRGene - FactorNet - Splicing - GERV - MaxEntScan 5’, 3’ - DanQ - Labranchor - CKN-seq - HAL - Chromatin - MMSplice - DeepSEA - SpliceAI - DeepChrome - mRNA half-life - Basenji - Cheng et al 2017 - DNA methylation - Polyadenylation - CpGenie - APARENT - DeepCpG - Translation - DNA Accessibility - Optimus_5Prime - Basset - Cuperus et al 2017 - TSS: - FIDDLE See also: https://github.com/greenelab/deep-review - Gene-Expression - Basenji - Expecto 27
List of published predictive models - Post-transcriptional regulation - RBP binding - Transcriptional regulation - iDeep - TF Binding - rbp_eclip (Avsec et al) - PWM Scanning (Jaspar, Cis-BP, MEME) - miRNA binding - DeepBind - TargetScan - Improved DeepBind - deepMiRGene - FactorNet - Splicing - GERV - MaxEntScan 5’, 3’ - DanQ - Labranchor - CKN-seq - HAL - Chromatin - MMSplice - DeepSEA - SpliceAI - DeepChrome - mRNA half-life - Basenji - Cheng et al 2017 - DNA methylation - Polyadenylation - CpGenie - APARENT - DeepCpG - Translation - DNA Accessibility - Optimus_5Prime - Basset - Cuperus et al 2017 - TSS: - FIDDLE See also: https://github.com/greenelab/deep-review - Gene-Expression - Basenji Can we easily apply these models to new data? - Expecto Can we easily re-use these models? 28
Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics Trained predictive models Code repository Paper supplements Author-maintained web page 29
Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics Trained predictive models Code repository Paper supplements Author-maintained web page Data Bioinformatics software .... 30
The “ F indable A ccessible I nteroperable R eusable” principle not only for data but also for trained predictive models 31
Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data 32
Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data - Data heterogeneity (think one paper = one dataset) 33
Challenges - Making predictions end-to-end - predict <model> -i input.data -o output.data - Data heterogeneity (think one paper = one dataset) - Model heterogeneity (from deep learning frameworks to custom code) - Dependency issues 34
Kipoi.org [Kípi] with A. Kundaje, Stanford, and O. Stegle, DKFZ 35 Avsec et al, Nature Biotechnology (In press)
Trained model (model.yaml) 36
Model data-loader model “Parameterized function” Can be implemented using: Output Parameters Model CGTGAGTTT Input TGATCGAGG GTAGCTAGC 37
Model data-loader model 38
Data-loader github.com/kipoi/kipoiseq data-loader model array([[[1, 0, 0, 0], intervals.bed [0, 1, 0, 0], chr1 1000 2000 [0, 0, 1, 0], chr2 5000 7000 [1, 0, 0, 0], resize [ … ]], genome.fa extract transform [[0, 1, 0, 0], >chr1 [0, 0, 1, 0], NNNNNNNNNNNN... [1, 0, 0, 0], [0, 0, 0, 1], [ … ]]])
Dependencies data-loader model 40
Test predictions ... x CGTGAGTTT TGATCGAGG GTAGCTAGC CGTGAGTTT CGTGAGTTT TGATCGAGG TGATCGAGG GTAGCTAGC GTAGCTAGC 41
Schema Supports multiple inputs/outputs ... TGATC GAGGA 42
General information 43
Model repository 44
45
46
# model groups: 20 # of models: 2073 47
Testing models - Install new conda environment - Test the pipeline: data -> dataloader -> model - Test the predictions match 48
Testing models - Install new conda environment - Test the pipeline: data -> dataloader -> model - Test the predictions match When? - Pull-request - Nightly (all model groups) 49
Using models 50
For the impatient: 30 seconds introduction to Kipoi
52
Case study 1: Benchmarking models 53
Benchmarking alternative models 54
Benchmarking alternative models 55
Benchmarking alternative models # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ 56 -o ‘<Model>.preds.h5'
Benchmarking alternative models # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ <- Works very nicely with “intervals_file”: “intervals.bed”, workflow-management tools like “fasta_file”: “hg38.fa”}' \ Snakemake 57 -o ‘<Model>.preds.h5'
Why not a single command? predict <model> -i input.data -o output.data # Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ -o ‘<Model>.preds.h5' 58
Why not a single command? predict <model> -i input.data -o output.data # Run model prediction kipoi predict <Model> \ --dataloader_args='{ “intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \ -o ‘<Model>.preds.h5' --singularity 59
Recommend
More recommend