amyotrophic lateral sclerosis
play

Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational - PowerPoint PPT Presentation

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational Bioinformatics | @nat_twine 11 October 2017 HEATH & BIOSECURITY Genomics will outpace other BigData


  1. Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational Bioinformatics | @nat_twine 11 October 2017 HEATH & BIOSECURITY

  2. Genomics will outpace other BigData disciplines Astronomy Genomics YouTube Twitter Stephens et al. PLOS Biology 2015 Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  3. Population-scale genomic data analysis requires BigData solutions Desktop compute High-performance Hadoop/Spark compute cluster compute cluster Focus small data Compute-intensive Data-intensive Fault tolerant No No Yes Node-bound Yes Yes No Parallelization 10 CPU 100 CPU 1000 CPU Parallelization bespoke bespoke standardized procedure CSIRO solution Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  4. Amyotrophic lateral sclerosis (ALS) • ALS is a devastating motor neurone disease • Leads to death within 3 years • Affects more than 200,000 people worldwide • Causes largely unknown – genetic component • Project MinE - sequencing 15,000 ALS genomes worldwide Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  5. What is our hypothesis?  Sporadic cases are potentially related but separated over generations • ALS is reported to be 5% familial and 95% sporadic • Familial component is potentially higher than 5% • Australia is a small population and disease is late onset What are our aims ?  Uncover hidden patient relationships to increase detection power  Identify novel disease causing variants • Datasets available: GOI – Exomes (Familial, n=137) – WGS (Sporadic, n=800) – Project MinE WGS (Sporadic, n=15,000) SNP A SNP B SNP C Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  6. Existing methods for measuring relatedness • PLINK ( Chang CC et al. GigaScience, 2015 ) • KING ( Manichaikul A et al. Bioinformatics, 2010 ) • SNPduo, ERSA, GRAB, XIBD (etc.) • Limitations of these tools – Designed to identify and remove relatives as part of GWAS workflow – Identifying more distant relatives is challenging – Tools effective at distant relationship detection are SLOW • We want to expand existing family structures – Identify more distant relationships with confidence Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  7. Relatedness between ALS patients using KING • KING identifies close relationships • 172 Familial and Sporadic ALS Exomes Each blue dot represents a relationship between a pair of ALS patients. Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  8. Relatedness between ALS patients using KING n=172 ( 137 Familial and 35 Sporadic) Degree of relationship Number True positives False positives Unknown Duplicates 6 6 (100%) 0 0 1st degree 33 33 (100%) 0 0 2nd degree 23 23 (100%) 0 3rd degree 27 12 (44%) 9 (33%) 6 (22%) 4th degree 1310 n/a n/a n/a 5th degree 7852 n/a n/a n/a • KING identified 8 novel relationships , at 3 rd degree • 3 we have ruled out as false positives. • 5 are potentially REAL as can’t be classified as FP (no mutation status known). Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  9. How can we improve on this result? • True positive rate - 44% - this needs to improve • Whole Genome Sequencing >> Exome (SNP density) • WGS (n=800) cohort has 42 million variants • High density data - > more informative • BUT - Existing tools struggle/fail with Big Data volumes • We are implementing relatedness testing in VariantSpark • to identify novel relationships in – 800 WGS Sporadic and Familial ALS – 15,000 WGS samples (Project MinE) Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  10. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)  Bringing BigLearning to genomics applications. z  VariantSpark learns from 3000 individuals and 80 million mutations in under 30 minutes  Association testing Speed  Clustering  Classification Accuracy Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  11. Using VariantSpark to identify relatedness • Data-driven rather than model-driven approach • VariantSpark can handle 80 million variants x 3000 individuals • What is the genetic distance between samples ? (allele sharing) • Euclidean distance • Identity by descent (IBD) (as in PLINK) • Sliding window for IBD segments (as in ERSA) • Include data from 1000 Genomes as controls (family and ancestry known) • Approaches currently being tested for feasibility Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  12. Testing using different distance measures Euclidean distance 1 accuracy distance (IBD) 0.4 4 5 UR 1 5 10 1 2 3 Degree of relationship Degree of relationship Exomes (n= 137 Familial ALS) Euclidean distance performs well until 4 th degree • • Plink (IBD) performs better than other distances Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  13. Effective methods are compute intensive Ramstetter et al., Genetics 2017 • IBD segment based methods most accurate for more distant relatives • BUT – They are also are most compute intensive (a.k.a SLOW) Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  14. Next steps in tool development • Simulate a large pedigree using whole genome data • Calculate genetic distance between each sample • Measure performance of different distance measures  sensitivity and specificity (AUC) • Generate relationship degree metrics from simulated cohort • Implement these methods in VariantSpark  speed and scalability proof of principle cohort Identify novel relationships in Sporadic ALS WGS (n=800) Familial ALS WGS (n=89) Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  15. VariantSpark application – genetic association Bone Mineral Density (BMD) as the phenotype; 1,936 individuals with 7.2 Million variants (imputed from array). • Joint-loci analysis (machine learning - random forest) • Replicate known BMD genes identified by traditional GWAS (single loci regression). • Amplify signal over traditional methods so smaller cohorts give robust insights • Random forests identifies interaction of 2 or more loci • We will use this methodology to identify novel & modulating ALS variants

  16. In summary: BigLearning to understand ALS Identify related Personalised individuals treatment Novel disease- Preventative causing variants measures Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  17. Transformational Bioinformatics Team Aidan O’Brien Denis Bauer Natalie Twine Rob Dunne Piotr Szul Kaitao Lai Oscar Luo Laurence Wilson Collaborators Software Ian Blair Kelly Williams Emily McCann Jenn Fifita Adrian White Mia Champion News Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Recommend


More recommend