a deep learning based approach
play

A deep learning based approach for genetic risk prediction Raquel - PowerPoint PPT Presentation

A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani Whole


  1. A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani

  2. Whole Genome Sequencing vs. Genotype array Full Data Sparse Data (whole genome sequencing) (genotype array) 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

  3. Whole Genome Sequencing vs. Genotype array Full Data Sparse Data ~80M genetic variants ~4 million genetic 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

  4. Genetic imputation problem ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... HapMap or Reference ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 1,000 Genomes haplotypes … (whole genome) ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... Prediction 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Cases and Study 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Controls typed genotypes 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Genotype array 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

  5. A typical imputation approach ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... Multiethnic ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... Haplotype Reference ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... Reference panel … Consortium ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... (HRC) Mapping 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Linkage disequilibrium 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Study (LD r 2 ) structure 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 genotypes 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 Prediction 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

  6. A typical imputation approach ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... Muli-ethinic ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... Haplotype Reference ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... Reference panel … Consortium ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... (HRC) Mapping 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 Linkage disequilibrium 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 Study (LD r 2 ) structure 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 genotypes 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 Prediction 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

  7. Polygenic Risk Score (PRS)

  8. Polygenic Risk Calculation w/ Trait* w/o Trait Σ Design Results Polygenic Risk Score 100,000+ subjects Millions of known variants Cumulative sum *Trait can often be heterogeneous e.g. coronary artery = heart attack, stroke, bypass surgery, etc.

  9. Objectives 1. More accurate and faster imputation 2. Find important genetic variants 3. Better polygenic risk score calculation

  10. Our proposed approach Encoding v Output Input Hidden layer layer layer 1 1 1 1 0 2 0 0 2 2 𝑥 𝑥′ Decoding

  11. Denoising autoencoder for image restoration Noise Mask Bigdeli, Siavash Arjomand, and Matthias Zwicker. "Image restoration using autoencoding priors." arXiv preprint arXiv:1703.09964 (2017). Wang, Ruxin, and Dacheng Tao. "Non-local auto-encoder with collaborative stabilization for image restoration." IEEE Transactions on Image Processing 25.5 (2016): 2117-2129.

  12. Genotype imputation case study example Ground truth Masked input (whole genome sequencing) (genotype array) 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? Mask

  13. Case study: 9p21.3 region of the genome • Length: 59846 bp • 846 genetic variants in reference panel (whole genome data) • Approx. 200 common variants • Approx. 600 rare variants • Only 17-47 variants in genotype array!!! • Strong association to coronary artery disease (CAD) • Genotyped and sequenced in many studies

  14. Training on the reference panel: Data augmentation strategy ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... Reference ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... Whole Genome ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... Mask ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... Masked ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... input ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... Autoencoder ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... Reconstructed ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... Output ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ...

  15. Customized Sparsity Loss Function  Sparsity loss with Kullback-Leibler (KL) / cross entropy element: 𝜍 = 𝜍 ∗ log 𝜍 + 1 − 𝜍 ∗ log 1 − 𝜍 𝐸 𝐿𝑀 (𝜍| ො 𝜍 ො 1 − ො 𝜍  Customized loss adjusted for hidden activation sparsity: 𝑜 𝑚𝑝𝑡𝑡 = 𝑁𝑇𝐹 + 𝛾 ∗ ෍ 𝐸 𝐿𝑀(𝑗) 𝑗=1  Mean Squared Error: 𝑜 𝑁𝑇𝐹 = 1 𝑜 ෍ (𝑧 𝑗 − ො 𝑧 𝑗 ) 𝑗=1

  16. Hyper parameters to be optimized • b • r • Activation functions • L1/L2 regularizers • Learning rate • Batch size

  17. Parallel Grid Search Hyperparameter optimization approach 10000 10000 100 9999 9999 … … … 2 4 4 1 3 3 2 2 100 X grid 10000 X training 10000 X training 1 1 search samples Hyperparameter Trained model Grid samples Grid samples combinations performance 9 GPUs available: - 7 GTX 1080, 860 hours Accuracy, loss - 1 Titan V, (sequential run, Sparsity, MSE - 1 Titan Xp 100 epochs)

  18. Grid Search Results: training accuracy

  19. Grid Search results: assessing best hyperparameter values

  20. Effect of hyper parameter values in training accuracy  b b Pearson correlation (r 2 )  r r  Learning rate

  21. Effect of hyper parameter values in training accuracy  b b Pearson correlation (r 2 )  r r  Learning rate

  22. Effect of hyper parameter values in training accuracy  b b Pearson correlation (r 2 )  r r  Learning rate

  23. Effect of hyper parameter values in training accuracy  b b Pearson correlation (r 2 )  r r  Learning rate

  24. Optimizing batch size: training accuracy Accuracy Loss Learning steps Learning steps 10 batches 50 batches 100 batches 1000 batches

  25. Optimizing batch size: training run time Accuracy Run time (hours) 10 batches 50 batches 100 batches 1000 batches

  26. Testing on multiple case studies • Atherosclerosis Risk in Communities (ARIC) • More than 3000 samples • Whole genome sequencing (846 variants, 0% mask, ground truth) • Affymetrix 6.0 genotype array (17 variants, 98% mask, input data) • Framingham Heart Study (FHS) • More than 500 samples • Whole genome sequencing (846 variants, 0% mask, ground truth) • Illumina 500K genotype array (47 variants, 95% mask, input data) • Illumina 5M (93 variants, 89% mask, input data)

  27. Accuracy in additional case studies: Proposed approach versus common statistic methodology Performance: all variants Performance: rare variants

  28. Accuracy in additional case studies: Proposed approach versus common statistic methodology Performance: all variants Performance: common variants

  29. Run time: Proposed approach versus common statistic methodology

  30. Linkage disequilibrium structure: ARIC Ground truth Linkage disequilibrium (LD) r 2 Prediction All variants Rare variants Common variants

  31. Linkage disequilibrium structure: FHS Ground truth Linkage disequilibrium (LD) r 2 Prediction All variants Rare variants Common variants

  32. Interpretability: identifying representative genetic variants Maximal information criteria

Recommend


More recommend