Speeding Genetic Discovery in Autism through the iHART Information Commons J. Jung, N. Stockham, K. Paskov, A. Gupta, M. Sun, D. Wall Stanford University March 2018
Exponential Growth of Sequenced Genomes Stephens ZD et al., “Big Data: Astronomical or Genomical?”, PLoS 2015, https://doi.org/10.1371/journal.pbio.1002195
Computing Capacity & Network Bandwidth Number of Transistors per Core 700000000 AMD Epyc 600000000 500000000 400000000 Zeon 7400 300000000 200000000 Core2 Duo 100000000 0 2004 2006 2008 2010 2012 2014 2016 2018 - Network Bandwidth is still 10 Gbps max, just like 10 years ago - But cloud computing enables us to use as many cores as needed Source: https://en.wikipedia.org/wiki/Transistor_count
Impact of Cloud Computing in Data Sharing - Past: Everything was packed in one bottle - Data, data storage, access tool, and computing facility - Had to use the fixed platform, or - Had to copy everything and to build a new one - Now: loosely coupled in cloud - Can access remote data with a group of different tools locally - Almost no need for local premises
Precision Medicine and Cloud Computing - Genetic profiling needs 5K > cases for low-OR disorders - Larger sample size makes better precision - Data integration needed for accelerated discovery - Cloud computing provides cost-effective solutions Source: “Quantifying realistic sample size requirements for human genome epidemiology”, Int J Epidemiol 2009
iHART: Autism Research & Technology Initiative - Whole genome sequencing of families with Autistic kids - Illumina HiSeqX, targeting coverage 30x - Phase I Data Set: about to submit a main paper - 2,308 individuals in 493 (multiplex) families - 94 MZ twins (89 ASD), 445 female autistic children - 750 TB of BAMs and 1 TB of VCFs - Phase II Data Set - Another 2,254 individuals in 567 (multiplex) families
Genetic Findings in Autism Spectrum Disorder - Many syndromic genes - No strong GWAS signals - Rare CNVs & de novos in multiple pathways Source: “Convergence of genes and cellular pathways dysregulated in autism spectrum disorders“, Am J Hum Genet, 2014
Preliminary Findings with iHART Cohort - Phase I Data set has been analyzed - Novel ASD-risk genes with inherited protein-truncating mutations - Novel structural variants and confirmation of known variants - Phase II Dataset Analysis is ongoing
What Data should be Shared? - Annotated VCF for the whole cohort - Primary source for the variants and their annotation - Per-sample gVCF calls - Required to do joint genotyping collectively with other data - Pedigree and phenotype traits - For trio/quad analyses and association studies - Raw read BAM files - For confirming calls, de novo analysis, and realigning
iHART Data Sharing Scheme - Data and access platforms are on Cloud - Accessing variant data and their annotations - Virtual machines and shared data disk with VCF - SQL-like search engine for VCF data access - Data overview with Web UI: Illumina Caselog - Accessing raw reads - Direct Visualization of BAMs with IGV - Traditional data copy via AWS S3 and Globus
iHART Website for Data Access Request - Secure and traceable data access using cloud account
Virtual Machine & Shared Data Disk on Cloud - Tailored Virtual Machine Image (AMI) - Open to Access-granted users - Maintains popular genomic processing tools - 3rd-party AMI and Docker will be available - Share Data Snapshot (Volume) - Contains main genome reference files and iHART VCF - Multiple users can use the same data individually - Can attach as a local disk when an instance is launched
SQL-like Engine on VCF Data: Amazon Athena - Can query on variants, annotation, pedigree, and phenotypes
Data Overview with Web UI: Illumina CaseLog
Direct Visualization of BAMs with IGV - With presigned URL of BAMs, IGV can show reads without BAM file downloading
Traditional Data Copy - Amazon S3 - Multiple, concurrent download possible - Popular files will be on CDN for low latency / high transfer speed - Globus (planned) - Reliable and faster than other traditional methods (e.g., ftp or scp) - Online monitoring feature available - Works on different platforms, including clouds and HPC clusters
Recommend
More recommend