How we built a global search engine for genetic data Miro Cupak VP Engineering, DNAstack 13/06/2018 @mirocupak
What and why? Beacon Network • https://beacon-network.org/ • largest search and discovery engine of human genetic mutations • from the Global Alliance for Genomics & Health (GA4GH) • case study • standard problem architecture fun with stats technologies @mirocupak � 2
Background @mirocupak � 3
https://beacon-network.org @mirocupak � 4
https://beacon-network.org @mirocupak � 5
Trends sequencing cost decreasing exponentially (3M times since 2000) https://www.nature.com/news/technology- the-1-000-genome-1.14901 @mirocupak � 6
Trends genomic data volume increasing exponentially (1M times since 2000) http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195 @mirocupak � 7
Trends 4E+10 3E+10 Data Volumes by 2025 (GB) up to 2 billion human genomes sequenced in the next 10 years 2E+10 (more data annually than uploaded to and ) 1E+10 0E+00 Twitter Youtube Genomics http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195 Lower Bound Upper Bound @mirocupak � 8
Problem • no single institution will have sufficient resources • still, institutions don’t have enough data • common diseases • rare diseases • challenge • discovering data • solution • traditional approach of data aggregation in a single centralized site not working • federated system capable of executing cross-dataset and cross-institution queries is needed @mirocupak � 9
GA4GH & Beacon Project • nonprofit standards organization • a coalition of over 500 leading institutions working in health care, research, disease advocacy, life science, and information technology • goal: enable responsible sharing of genomic and clinical data http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/ • established in 2013 GAWhitePaperJune3.pdf • experiment to test the willingness of international sites to share genetic data in the simplest of all technical contexts • initiative requiring collaboration of many different GA4GH groups • started in 2014 and quickly gained traction https://beacon-project.io/ @mirocupak � 10
Beacon @mirocupak � 11
Beacon • simple web service allowing users to query institution’s databases to determine whether they contain a genetic variant of interest • receives questions of the form Do you have information about this mutation? • responds with yes or no , optionally with additional information about the mutation • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available. @mirocupak � 12
Standard: Before Beacon Network • no formal specification • receives questions of the form Do you have information about this mutation? • responds with yes or no • 4 public beacons, each API different request method assembly notation • • supported parameters supported alleles • • parameter names dataset support • • chromosome identifiers response format • • positional base data included in the response • • @mirocupak � 13
Standard: Before Beacon Network @mirocupak � 14
Standard: 0.1 • 2014 • really simple (2 records) • true/false response • format: Avro • not enough traction • too vague • issues partially addressed by the Beacon Network @mirocupak � 15
Standard: 0.2 • 2015 • true/false/overlap/null response • datasets • simple data use conditions • self description • format: Avro • complex (9 records) • not well adopted • not polished enough @mirocupak � 16
Standard: 0.3 • 2016 • simplified 0.2 • based on real needs, successful • true/false/null response • data model improvements, extended metadata and response, improved support for datasets and cross-dataset queries, data versioning • modular and extensible • tooling • format: Avro → Proto3 @mirocupak � 17
Standard: 0.4 • 2018 • stable and more flexible • support for more complex mutations • improved error handling • improved data use conditions • various minor improvements • developer experience • format: Proto3 → OpenAPI @mirocupak � 18
Beacon Network @mirocupak � 19
Architecture @mirocupak � 20
Data • access data stored in a relational database @mirocupak � 21
Service • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization @mirocupak � 22
Processor • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling @mirocupak � 23
Converter • first stage in the query execution pipeline • translating query parameters @mirocupak � 24
Requester • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters @mirocupak � 25
Fetcher • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response @mirocupak � 26
Parser • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized @mirocupak � 27
Mapper • translation between different representations of objects @mirocupak � 28
REST • handling client requests • data serialization @mirocupak � 29
Search execution @mirocupak � 30
Stats @mirocupak � 31
Size • 100 installations • 40 institutions • 18 countries • 6 continents @mirocupak � 32
Users • 13k users • 136 countries @mirocupak � 33
Searches @mirocupak � 34
Assemblies Others 11% GRCh38 6% GRCh37 83% @mirocupak � 35
Chromosomes Chr. 2 18% Others 39% Chr. 17 14% Chr. 1 11% Chr. 7 Chr. 13 7% 11% @mirocupak � 36
Variants 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6% • 84k distinct mutations 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 7 : 140453136 C (BRAF) 1 : 43815163 C (MPL) 2% 2% 1 : 115258747 A (NRAS) 14 : 23894969 A (MYH7) 1% 1% 2 : 29432776 C (ALK) 1% 2 : 212289100 C (ERBB4) 1% Others 74% @mirocupak � 37
Deleteriousness SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 1000000 1000000 Number of variants Number of variants 1000 1000 1 1 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 Score Score 55% probably damaging, 22% possibly 69% damaging, 31% tolerated damaging, 23% benign @mirocupak � 38
Rarity • 25% rare variants (1,000 Genomes Project) 10000 Number of variants 100 1 0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99 Allele frequency @mirocupak � 39
Genes FAM110C 11% Symbol Name BRCA1 FAM110C Family With Sequence Similarity 110 Member C 1 10% BRCA1 BRCA1, DNA Repair Associated 2 BRCA2 BRCA2, DNA Repair Associated 3 BRCA2 Others 9% PPARA Peroxisome Proliferator Activated Receptor Alpha 4 53% ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 5 PPARA 4% BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase 6 ERBB4 3% MPL MPL Proto-Oncogene, Thrombopoietin Receptor MPL 7 BRAF 2% RET MYH7 Myosin Heavy Chain 7 3% 8 1% MYH7 2% KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 9 KIT RET Ret Proto-Oncogene 1% 10 @mirocupak � 40
Disorders & clinical abnormalities OMIM HPO Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 1 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 2 Fanconi anemia, complementation group D1 Scoliosis 3 Prostate cancer Short stature 4 Pancreatic cancer 2 Cognitive impairment 5 Medulloblastoma Constipation 6 Glioblastoma 3 Somatic mutation 7 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 8 Breast cancer, male, susceptibility to Failure to thrive 9 Wilms tumor Nausea and vomiting 10 @mirocupak � 41
Questions? https://mirocupak.com @mirocupak � 42
Recommend
More recommend