lecture 12 predicting gene expression and splicing
play

Lecture 12: Predicting gene expression and splicing Prof. Manolis - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io


  1. 6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io

  2. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  3. RNA-Seq: De novo tx reconstruction / quantification Count RNA-Seq technology: Microarray technology • Sequence short reads from • Synthesize DNA probe array, mRNA, map to genome complementary hybridization • Variations: • Variations: • Count reads mapping to each • One long probe per gene known gene • Many short probes per gene • Reconstruct transcriptome de • Tiled k-mers across genome novo in each experiment • Advantage: • Advantage: • Can focus on small regions, • Digital measurements, de novo even if few molecules / cell

  4. Expression Analysis Data Matrix • Measure 20,000 genes in 100s of conditions Condition 1Condition 2 n experiments Condition 3 … Gene similarity questions Each experiment measures m genes expression of thousands of ‘spots’, typically genes Expression profile of a gene • Study resulting matrix Experiment similarity questions

  5. Clustering vs. Classification Independent validation of groups that emerge: Known Conditions  Conditions  classes:  Genes  Genes Chronic lymphocytic leukemia B-cell genes in blood cell lines Proliferation genes in transformed cell lines Lymph node genes in diffuse large B-cell lymphoma (DLBCL) Alizadeh, Nature 2000 Alizadeh, Nature 2000 Goal of Clustering : Group similar items Goal of Classification : Extract features that likely come from the same category, from the data that best assign new and in doing so reveal hidden structure elements to ≥1 of well-defined classes • Unsupervised learning • Supervised learning

  6. PCA, Dimensionality reduction

  7. Geometric interpretation of SVD Shearing Rotation Rotation Scaling Mx = M(x) = U( S( V*(x) ) )

  8. Low-rank Approximation • Solution via SVD = σ σ T A U diag ( ,..., , 0 ,..., 0 ) V k 1 k set smallest r-k singular values to zero k ∑ = k = 1 σ T A u v column notation: sum k i i i i of rank 1 matrices : min − = − = σ A X A A • Error: + k k 1 F F = X rank ( X ) k

  9. PCA of MNIST digits

  10. t-SNE of MNIST digits 0 1 3 2 6 8 5 9 7 4

  11. t-SNEs of single-cell Brain data CA1 CA2-4 Subiculum Dentate Gyrus (DG) Brain Hippocampus sub-structures scRNA-seq in 48 individuals, 84k cells, Nature, 2019 scATAC-seq of 262k cells across 7 brain regions 16 Sz/16 BP/16 Controls, 300k cells

  12. Autoencoder: dimensionality reduction with neural net • Tricking a supervised learning algorithm to work in unsupervised fashion • Feed input as output function to be learned. But! Constrain model complexity • Pretraining with RBMs to learn representations for future supervised tasks. Use RBM output as “data” for training the next layer in stack • After pretraining, "unroll” RBMs to create deep autoencoder • Fine-tune using backpropagation [Hinton et al , 2006]

  13. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  14. 1. Up-sampling gene expression patterns

  15. Challenge: Measure few values, infer many values https://arxiv.org/pdf/1902.06068.pdf • Image up-scaling • Digital signal upscaling – Interpolating low-pass filter – Inverse of convolution (de-convolution) (e.g. FIR finite impulse response) – Transfer learning from corpus of images – Low-dim. capture of higher-dim. signal – Low-dim. re-projection to high-dim. img – Nyquist rate (discrete) / freq. (contin.) • Gene expression measurements – Measure 1000 genes, infer the rest • Which 1000 genes? Compressed sensing – Rapid, cheap, reference assay – Measure few combinations of genes – Apply to millions of conditions – Better capture high-dimensional vector

  16. Deep Learning architectures for up-sampling images Pre-sampling super-resolution (SR) Post-sampling SR Progressive up-sampling • Representation/abstract learning Iterative up-and-down sampling – Enables compression, re-upscaling, denoising – Example: autoencoder bottleneck. High-low-high – Modification: de-compression, up-scaling, low-high only

  17. D-GEX - Deep Learning for up-scaling L1000 gene expression • Multi-task Multi-Layer Feed-Forward Neural Net • Non-linear activation function (hyperbolic tangent) • Input: 943 genes, Output: 9520 targets (partition to fit in memory)

  18. D-GEX outperforms Linear Regression or K-nearest-Neighbors • Lower error than LR or KNN • Training rapidly converges • Strictly better for nearly all genes • Deeper = better However: performance still not great, computational limitations

  19. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  20. 2. Composite measurements for compressed sensing

  21. Key insight: Composite measurements better capture modules • Sparse Module Activity Factorization (SMAF)

  22. Making composite measurements in practice • Combinations of probes + barcodes for measurement • More consistent signal-to-noise ratios

  23. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  24. 3. Predicting Expression from Chromatin

  25. Can we predict gene expression from chromatin information? • DNA methylation vs. gene expression • Promoters: high. Gene body: low

  26. Strong enhancers (+H3K27ac) vs. weak enhancers (H3K4me1 only)

  27. DeepChrome: positional histone features predictive of expression Histone mark 1 • Outperforms previous methods • Positional information for each mark • Convolution, pooling, drop-out, Multi-Layer- • Meaningful features selected Perceptron (MLP) alternating lin/non-linear

  28. AttentiveChrome: Selectively attend to specific marks/positions Histone mark 1 • Attention: LSTM: Long short-term memory module • Hierarchical LSTM modules: interactions across marks • Attention focuses on specific positions for specific marks • Consistent improvement over DeepChrome

  29. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  30. 4. Predicting splicing from sequence

  31. Deciphering tissue-specific splicing code Alternatively spliced exon2 exon1 exon2 exon3 300 nt 300 nt 300 nt 300 nt Feature set: RNA feature known motifs, transcript structure extraction in target exon and adjacent exons Exon inclusion: t inc =1, t exc =0, t nc =0 Tissue type Splicing code 3-class softplus Exon exclusion: prediction model: t inc =0, t exc =1, q inc , q exc , q nc t nc =0 [Barash et al., 2010]

  32. Bayesian neural network splicing code 1014 RNA features x 3665 exons Bayesian neural network: • # hidden units follows Poisson( λ ) • Network weights follows spike-and- slab prior Bern(1 − α ) • Likelihood is cross- entropy • Network weights are sampled from the posterior 4 Mouse tissues each with 3 classes (i.e., 12 output units) [Xiong et al., 2011]

  33. Predicts diseasing causing mutations from splicing code [Xiong et al., 2011]

  34. Predicts diseasing causing mutations from splicing code Scoring splicing changes due to SNP ∆ ψ : • Train splice code model on 10,689 exons to predict the 3 splicing classes over 16 human tissues using 1393 sequence features (motifs & RNA structures) Score both the reference ψ ref and alternative ψ alt sequences • harboring one of the 658,420 common variants • Calculate ∆ψt = ψ t ref − ψ r over each tissue t alt Obtain largest absolute or aggregate ∆ ψ t to score effects of • SNPs [Xiong et al., 2011]

Recommend


More recommend