diet networks thin parameters for fat genomics
play

Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - PowerPoint PPT Presentation

Institut des algorithmes dapprentissage de Montral Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-Andr Legault, Marie-Pierre


  1. Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio

  2. Outline Motivation & Challenges • Deep Learning Architectures • Diet Networks • Results • Wrap up and future research directions •

  3. Motivation & Challenges

  4. Motivation Deep Learning Zhou et al., 2015 from http://quincypublicschools.com/library/contact-school/book-stack/ from https://en.wikipedia.org/wiki/GeForce_10_series

  5. Motivation Deep Learning Genomics from http://quincypublicschools.com/library/contact-school/book-stack/ from https://www.genome.gov/sequencingcostsdata/ from https://en.wikipedia.org/wiki/GeForce_10_series

  6. Genomic Data as Fat Data Target millions of simple • variants across the genome (SNPs). Number of participants • limited, even for large datasets. # participants # SNP (features) (samples) Dire imbalance between # samples Fat Data and # input features

  7. Challenges: Parameter explosion Linear classifier: Naive setup for SNP data: # inputs = hundreds of thousands W (parameters) # samples = thousands # samples << # parameters In deep networks: the # parameters in the 1st layer grows linearly with the # inputs. # inputs = # parameters

  8. Challenges: Overfitting from https://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fitting/

  9. Challenges: The curse of dimensionality Considering: 300K SNPs - 3 possible values (0, 1, 2) - 3 300K combinations !! from http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

  10. Why deep learning? Capturing information directly from the raw input data is not trivial and • often involves complex and non-linear functions. Many problems become easier if the input data is transformed into a • representation that emphasizes its most relevant characteristics.

  11. Multi-Layer Perceptron (MLP) Supervised learning: desired output values are provided Describe data as a hierarchy of concepts Unsupervised learning: aims to discover hidden structure in the input data.

  12. CNN - reducing the number of parameters Parameter sharing. • Exploit spatially local • correlations. Suitable for data with a • grid-like topology. Problem: When the full DNA sequence is unavailable, other type of methods seem more appropriate.

  13. Diet Networks

  14. The idea Use a novel neural network reparametrization, which considerably reduces • the number of free parameters when the input is very high-dimensional and orders of magnitude larger than the number of training samples.

  15. The model Input data: Fx100 N x F , N << F 100 MLP MLP MLP 50K 500 Emb. MLP Emb. 100 Fx100 30M 300K Input = 1 feature (SNP) Input = 1 sample 1 x N 1 x F

  16. Embeddings Raw (learnt embedding, end to end training) • MLP MLP Per class histograms • Emb. MLP

  17. Per class histogram Individuals 0 0 2 1 0 2 0 1 Class 1: 1 x 0, 2 x 1, 1 x 2 0.25 0.50 0.25 s P N S Class 2: 3 x 0, 0 x 1, 1 x 2 0.75 0 0.25 0.25 0.50 0.25 0.75 0 0.25

  18. The 1000 Genomes Project (1) Large-scale comparison of DNA sequences from populations, thanks to the presence of • genetic variations. Represents 26 populations from 5 geographical regions, in total 3 ,450 individuals • SNP inclusion/ exclusion criteria: • Genetic variants with frequencies of at least 5% Excluded SNPs positioned on sex chromosomes Only included SNPs in approximate linkage equilibrium with each other As a result, we obtained 315 ,345 SNPs, encoded as having 0, 1 or 2 copies of a genetic • mutation (non-reference nucleotide).

  19. Experimental setup Ethnicity prediction from SNPs on 1000 Genomes data. • Metric: misclassification error and number of free parameters. • 5-fold crossvalidation. •

  20. Quantitive results (1) Embedding Misclassification error (%) # free parameters Without reconstruction Basic MLP 8.31 +- 1.83 31.5M Diet Networks (raw end2end) 7.62 +- 02 227.3k Diet Networks (histograms) 6.90+- 1.60 18.0k With reconstruction Basic MLP 7.76 +- 1.38 63M Diet Networks (raw end2end) 6.85 +- 1.72 534.8k Diet Networks (histograms) 7.01 +- 1.20 28.1k

  21. Quantitive results (2) Embedding Misclassification error (%) Diet Networks (histograms) 7.01 +- 1.20 PCA (10 PCs) 20.56 +- 3.20 PCA (50 PCs) 12.29 +- 0.89 PCA (100 PCs) 10.52 +- 0.25 PCA (200 PCs) 9.33 +- 1.24

  22. Quantitive results (3)

  23. What is the network learning? Layer 2 MLP Layer 1 MLP Input

  24. What is the network learning? Raw input Layer 1 Layer 2 Ethnicities

  25. What is the network learning? Raw input Layer 1 Layer 2 Continents

  26. Wrap up and future research directions

  27. Wrap up We demonstrated the potential of deep learning models to tackle genomic- • specific tasks. The parameter explosion introduced by high dimensional genomic data can • be mitigated by smart model parameterization, such as Diet Networks.

  28. What comes next… Conducting genetic association studies, with emphasis on population-aware • analyses of SNP data in disease cohorts. Identify the genetic basis of common diseases to achieve a better patient • risk prediction and improve our overall understanding of disease etiology.

  29. Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio Thank you! @adri_romsor Code: https://github.com/adri-romsor/DietNetworks

Recommend


More recommend