SDSC 2013 Summer Institute Biomedical data integration system and web search engine Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Big Data: “the zillionics realm” In 2012, there were created and replicated 2,800,000 petabytes (PB) 2,800,000,000 terabytes (TB) 2,800,000,000,000 gigabytes (GB) 2,800,000,000,000,000 megabytes (MB) 2,800,000,000,000,000,000 kilobytes (KB) 2,800,000,000,000,000,000,000 bytes
YouTube’s video uploaded per year 15 PB Nasdaq stock E-mails sent per year market 2,986 PB database 3 PB Facebook’s content uploaded per year National 183 PB Climatic Data Center Database 6 PB Google’s Kaiser’s search index medical 98 PB records 31 PB Library of Congress digital collection 5 PB Large Hadron Collider's annual data output 15 PB Data from Wired, May 2013
The number of sequences in NCBI GenBank April 2013 GenBank 164,136,731 sequences 594 GB (0.0006 PB)
High-throughput sequencing: Illumina HiSeq 2500 enables to sequence a human genome, 20 exomes, or 150 RNA-seq samples per day NCBI Sequence Read Archive contains 4.1 PB of sequence reads
Gene expression data: 0.05 PB • Array Express database (at EBI): 1,104,037 assays (0.014 PB of archived data) • NCBI GEO database: 931,577 samples
Metabolic & Signaling pathways
Protein-protein interaction & Transcriptional regulatory networks & Host-pathogen interactions
• KEGG: 245,393 pathways • REACTOME: 25,000+ pathways, 70,000+ reactions • PathGuide (survey of 325 pathway resources): 205+ Million reactions for 28+ Million gene/proteins ( 0.01 PB ) Image of interactions among interaction database is from pathguide.org
YouTube’s video uploaded per year 15 PB Nasdaq stock E-mails sent per year market 2,986 PB database 3 PB Facebook’s content uploaded per year National 183 PB Climatic Data Center Database 6 PB Google’s Kaiser’s search index Molecular Biology data in medical 98 PB public databases ~4.5 PB records 31 PB Library of Congress digital collection 5 PB Large Hadron Collider's annual data output 15 PB Data from Wired, May 2013
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) How can a molecular biologist embrace such amount of data in their entirety?
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Data Integration Resources
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Data Integration Resources
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Data Integration Resources This leaves a molecular biologist to work with partial, incomplete, incomprehensive data sets!
Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Biological Ontologies The Semantic Web technologies Data Warehouse
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages
Database web pages Other web pages For each ontological term A and page X , calculate the relevance score of X to A.
Part of the multiple alignment ontology (MAO) Thompson et al., 2005, NAR, PMID: 16043635 Grey boxes represent concepts and colored arrows represent relationships: red, is_a ; blue, part_of ; green, is_attribute .
Mao.obo
Database web pages Other web pages For each ontological term A Automatically extract and page X , calculate the data and map them relevance score of X to A . into the internal Calculate a rank of each page. database schema
User Community Web-portal & API Java-application integromeDB.org BiologicalNetworks.org IntegromeDB Public Data on the Web User’s Private Data
integromedb.org
Integromedb.org visit statistics (7/16/2012 – 7/15/2013)
Real problems of data integration: #1 • Collecting data and maintaining data consistency: it is infeasible to collect data from thousands of data sources neither downloading them no via web crawling (if a crawler sends a request to a website each 10 seconds, to download the entire GeneBank -- 163 million sequence records -- would require 50 years, from one IP address) . Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases
Real problems of data integration: #1 (solution) • Hybrid approach : large databases (100,000+ web pages) are downloaded as SQL, XML, or RDF models, while all other resources on the web are reached via the web crawling. Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases
Real problems of data integration: #2 • Data mapping: – No unified biological ontology (there are conflicts and inconsistencies across ontologies) – Data are heterogeneous (data integration requires mapping various types of data onto a set of stable gene and protein ids) Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases
Real problems of data integration: #2 (solution) • IntegromeDB Ontology: The Open Biomedical Ontologies (OBO) consortium and the National Center for Biomedical Ontology (NCBO) provide the mapping among different ontologies. This mapping (for 120 ontologies) was used to develop IntegromeDB Ontology.
Real problems of data integration: #2 (solution) • Each web page is integrated in the database as is along with two calculated scores: PageRank and the Lucene score. The latter is calculated for each word from the database dictionaries (object names, IDs, synonyms) and IntegromeDB Ontology. • The downloaded files—SQL, XML, and RDF—are mapped to the IntegromeDB database schema via transforming them into the RDF-compatible format. The mapping includes automatic determination of Node IDs, such as names and synonyms of biological entities from a dictionary of 70 million gene aliases.
• The database is a PostgreSQL database modeled as a node- ( Objects: proteins, ligands, molecular complexes, and genes ) edge- ( relationships between objects: up/down regulation, molecular transport, molecular synthesis, enzymatic activity ) typed labeled meta-graph, where the labels are described by their own schema. IntegromeDB RDBMS Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases
IntegromeDB Database • Currently, the database integrates data from a billion web pages populated from molecular biology databases listed in the NAR depository, with 100 major databases being directly downloaded and manually mapped to the IntegromeDB schema. • Using that manual mapping, mapping algorithms to perform automatic mapping of SQL, XML, and RDF files have been trained.
User Community Web Interface API User Data IntegromeDB RDBMS Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases
Real problems of data integration: #3 User Interface: Which types of search to allow and how to • represent the search results on the web page? Search results for a gene or protein are organized on the web page • into a dashboard of relevant attributes that are grouped by data sources and by pair-wise similarity, defined using the normalized Levenshtein distance and an empirical threshold. The web pages are sorted by relevance. For each ontological term A • and page X, integrated into the system, the relevance score of X to A is calculated as follows: RL(X,A) = K1×PO(A) + K2×PP(X) + K3×PT(X) + K4×PR(X) where K1-K4 – empirical coefficients; PO – frequency of a term A in the ontology containing A ; PP – frequency of A in X (corresponds to the Lucene score); PT – frequency of the term A in different HTML tag fields of the page X; PR – PageRank of X .
Real problems of data integration: #4 • Serving the user requests instantly: large SQL join- queries over a large relational database are time- consuming. • A noSQL solutions based on the Hadoop architecture is under development. It will be used for two tasks: to store, to update, to process, and to index crawled data; and for fast access of data from the IntegromeDB relational database.
Recommend
More recommend