How we built a global search engine for genetic data Miro Cupak VP - PowerPoint PPT Presentation

How we built a global search engine for genetic data Miro Cupak VP Engineering, DNAstack 13/06/2018 @mirocupak

What and why? Beacon Network • https://beacon-network.org/ • largest search and discovery engine of human genetic mutations • from the Global Alliance for Genomics & Health (GA4GH) • case study • standard problem architecture fun with stats technologies @mirocupak � 2

Background @mirocupak � 3

https://beacon-network.org @mirocupak � 4

https://beacon-network.org @mirocupak � 5

Trends sequencing cost decreasing exponentially (3M times since 2000) https://www.nature.com/news/technology- the-1-000-genome-1.14901 @mirocupak � 6

Trends genomic data volume increasing exponentially (1M times since 2000) http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195 @mirocupak � 7

Trends 4E+10 3E+10 Data Volumes by 2025 (GB) up to 2 billion human genomes sequenced in the next 10 years 2E+10 (more data annually than uploaded to and ) 1E+10 0E+00 Twitter Youtube Genomics http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195 Lower Bound Upper Bound @mirocupak � 8

Problem • no single institution will have sufficient resources • still, institutions don’t have enough data • common diseases • rare diseases • challenge • discovering data • solution • traditional approach of data aggregation in a single centralized site not working • federated system capable of executing cross-dataset and cross-institution queries is needed @mirocupak � 9

GA4GH & Beacon Project • nonprofit standards organization • a coalition of over 500 leading institutions working in health care, research, disease advocacy, life science, and information technology • goal: enable responsible sharing of genomic and clinical data http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/ • established in 2013 GAWhitePaperJune3.pdf • experiment to test the willingness of international sites to share genetic data in the simplest of all technical contexts • initiative requiring collaboration of many different GA4GH groups • started in 2014 and quickly gained traction https://beacon-project.io/ @mirocupak � 10

Beacon @mirocupak � 11

Beacon • simple web service allowing users to query institution’s databases to determine whether they contain a genetic variant of interest • receives questions of the form Do you have information about this mutation? • responds with yes or no , optionally with additional information about the mutation • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available. @mirocupak � 12

Standard: Before Beacon Network • no formal specification • receives questions of the form Do you have information about this mutation? • responds with yes or no • 4 public beacons, each API different request method assembly notation • • supported parameters supported alleles • • parameter names dataset support • • chromosome identifiers response format • • positional base data included in the response • • @mirocupak � 13

Standard: Before Beacon Network @mirocupak � 14

Standard: 0.1 • 2014 • really simple (2 records) • true/false response • format: Avro • not enough traction • too vague • issues partially addressed by the Beacon Network @mirocupak � 15

Standard: 0.2 • 2015 • true/false/overlap/null response • datasets • simple data use conditions • self description • format: Avro • complex (9 records) • not well adopted • not polished enough @mirocupak � 16

Standard: 0.3 • 2016 • simplified 0.2 • based on real needs, successful • true/false/null response • data model improvements, extended metadata and response, improved support for datasets and cross-dataset queries, data versioning • modular and extensible • tooling • format: Avro → Proto3 @mirocupak � 17

Standard: 0.4 • 2018 • stable and more flexible • support for more complex mutations • improved error handling • improved data use conditions • various minor improvements • developer experience • format: Proto3 → OpenAPI @mirocupak � 18

Beacon Network @mirocupak � 19

Architecture @mirocupak � 20

Data • access data stored in a relational database @mirocupak � 21

Service • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization @mirocupak � 22

Processor • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling @mirocupak � 23

Converter • first stage in the query execution pipeline • translating query parameters @mirocupak � 24

Requester • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters @mirocupak � 25

Fetcher • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response @mirocupak � 26

Parser • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized @mirocupak � 27

Mapper • translation between different representations of objects @mirocupak � 28

REST • handling client requests • data serialization @mirocupak � 29

Search execution @mirocupak � 30

Stats @mirocupak � 31

Size • 100 installations • 40 institutions • 18 countries • 6 continents @mirocupak � 32

Users • 13k users • 136 countries @mirocupak � 33

Searches @mirocupak � 34

Assemblies Others 11% GRCh38 6% GRCh37 83% @mirocupak � 35

Chromosomes Chr. 2 18% Others 39% Chr. 17 14% Chr. 1 11% Chr. 7 Chr. 13 7% 11% @mirocupak � 36

Variants 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6% • 84k distinct mutations 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 7 : 140453136 C (BRAF) 1 : 43815163 C (MPL) 2% 2% 1 : 115258747 A (NRAS) 14 : 23894969 A (MYH7) 1% 1% 2 : 29432776 C (ALK) 1% 2 : 212289100 C (ERBB4) 1% Others 74% @mirocupak � 37

Deleteriousness SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 1000000 1000000 Number of variants Number of variants 1000 1000 1 1 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 Score Score 55% probably damaging, 22% possibly 69% damaging, 31% tolerated damaging, 23% benign @mirocupak � 38

Rarity • 25% rare variants (1,000 Genomes Project) 10000 Number of variants 100 1 0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99 Allele frequency @mirocupak � 39

Genes FAM110C 11% Symbol Name BRCA1 FAM110C Family With Sequence Similarity 110 Member C 1 10% BRCA1 BRCA1, DNA Repair Associated 2 BRCA2 BRCA2, DNA Repair Associated 3 BRCA2 Others 9% PPARA Peroxisome Proliferator Activated Receptor Alpha 4 53% ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 5 PPARA 4% BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase 6 ERBB4 3% MPL MPL Proto-Oncogene, Thrombopoietin Receptor MPL 7 BRAF 2% RET MYH7 Myosin Heavy Chain 7 3% 8 1% MYH7 2% KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 9 KIT RET Ret Proto-Oncogene 1% 10 @mirocupak � 40

Disorders & clinical abnormalities OMIM HPO Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 1 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 2 Fanconi anemia, complementation group D1 Scoliosis 3 Prostate cancer Short stature 4 Pancreatic cancer 2 Cognitive impairment 5 Medulloblastoma Constipation 6 Glioblastoma 3 Somatic mutation 7 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 8 Breast cancer, male, susceptibility to Failure to thrive 9 Wilms tumor Nausea and vomiting 10 @mirocupak � 41

Questions? https://mirocupak.com @mirocupak � 42

How we built a global search engine for genetic data Miro Cupak VP - PowerPoint PPT Presentation

How we built a global search engine for genetic data Miro Cupak VP Engineering, DNAstack 13/06/2018 @mirocupak What and why? Beacon Network https://beacon-network.org/ largest search and discovery engine of human genetic mutations

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

1 2 Genetic Program Genetic Program Parameter 3 Genetic Program Genetic Program 4 Softcoding

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Genetic.io Genetic Algorithms in all their shapes and forms ! Genetic.io Make something of your

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Genetic Programming What is it? Genetic Programming Genetic programming (GP) is an

Automatic Search Engine Evaluation Automatic Search Engine Evaluation with Click- -through Data

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

The search engine you can see Connects people to information and services The search engine you

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

2016 AGM UPDATE 26 October 2016 Key outcomes delivered by FWPA 8-storey timber buildings

O CTOBER 27 TH , 2016 A GENDA Welcome May/October, 2016 Minutes Approval Sharing

READ Act (Reading to Ensure Academic Development) Review and Approval of Diagnostic and Summative

Protection Area Updates, Initiatives, and Implementation Progress Municipal Working Group Info

GitOps 101 2019-11-05, Michael Hausenblas Who am I? Developer Advocate in the AWS container

ALTMETRICS AND OPEN PEER REVIEW MODULE AT DIGITAL.CSIC ALTMETRICS AT DIGITAL.CSIC DIVERSIFYING

February 8, 2018 Transportation Proposition The proposed proposition requests to change the

4Q18 AND FULL YEAR 2018 EARNINGS PRESENTATION February 6, 2019 DISCLOSURE STATEMENT This