Large-scale machine learning for genotype / phenotype association Aidan O’Brien Health Data Analytics 2018 HEALTH AND BIOSECURITY aydun1
By 2025 it is estimated that 50% of the world population will have been sequenced. Frost&Sullivan Data acquisition of BigData disciplines in 2025 Genomics YouTube Astronomy 20 EB Storage / year Twitter Stephens et al. BigData: Astronomical or Genomical (2015) 2 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Understanding disease and finding biomarkers https://www.projectmine.com/about/ 3 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Finding the disease gene(s) Gene1 Gene2 cases controls 4 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Complex diseases are driven by multiple genes Need an cases approach to capture feature- interactions controls 5 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Machine learning on 1.7 Trillion datapoints Genomic profile Disease 80 Million features status 22,500 samples Individuals A B C Disease genes 6 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Machine learning can capture complex features Trad. GWAS (logistic regression) Required Solution Genomic profile Genomic profile Individuals Individuals Predictive variants Predictive variants 7 | Large-scale Machine Learning for Gen- Phen Association | Aidan O’Brien | @aydun1
Random forest – a collection of decision trees 8 | Large-scale Machine Learning for Gen- Phen Association | Aidan O’Brien | @aydun1
Population-scale genomic data analysis requires BigData solutions High-performance compute cluster Hadoop/Spark compute cluster Focus Compute-intensive Data-intensive Fault tolerant No Yes Node-bound Yes No Parallelization 100+ CPU 1000+ CPU Parallelization procedure bespoke standardized CSIRO solution 9 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Solution: VariantSpark - “Wide” machine learning for population - scale cohorts Speed high Variant Spark SparkML MLlib low Spark Core Accuracy high low “Analyzes 3000 individuals with 80M features in 30 minutes“ BMC Genomics 2015, 16:1052 PMID: 26651996 (citation=16) 10 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
VariantSpark – amplifies association in the signal • Bone Mineral Density (BMD) as the phenotype: 1,936 individuals with 7.2 Million variants (imputed from array) • Replicate known BMD genes identified by traditional GWAS (single loci regression). • Amplify signal over traditional methods so smaller cohorts give robust insights More accurate biomarker discovery 11 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Hipster Index Synthetic dataset Genome Hipster? 1 1 0 1 1 0 1 Y 1 1 0 0 1 0 1 Y 0 0 0 0 1 0 1 N 1 1 1 0 1 0 1 Y 0 0 0 1 0 1 1 N HipsterScore = 0 0 1 0 1 0 0 N (2 * B6 ) + ( 0.2 * B2 ) + (1.5 * R1) + (0.1 * C2) + (3 * B6 * B2) + (2.5 * R1 * C1) + noise 0 1 1 0 0 0 0 N independent interacting 1 1 1 0 1 1 1 Y 0 0 0 0 0 0 0 N 12 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Share research notebooks • Databricks • AWS EKS • Try it on your data https://docs.databricks.com/applications /genomics/variant-spark.html 13 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Understanding relationships can lead to clinical applications Correcting Genomes Treating Finding Individuals Disease Genes CSIRO’s cloud - based solutions 14 | Innovation In Digital Health - Open Floor Forum | Denis C. Bauer | @allPowerde
Three things to remember • Complex diseases need software to detect gene-interactions • VariantSpark detects gene-interactions • Bringing findings into clinical practise requires new cloud technologies 15 | Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Let’s build a healthier world together Team We are hiring… You? Denis Bauer, Arash Bayat Oscar Luo, Laurence Wilson, Aidan O’Brien Brendan Hosking Rob Dunne, Piotr Szul Natalie Twine, …email Denis PhD PhD PhD PhD PhD Collaborators Software Lynn Langit News Top 10 Australian IT stories of 2017 Keynote Aidan O’Brien, CSIRO Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1
Recommend
More recommend