Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations S100 Martínez-Romero, M. , O’Connor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University
What is metadata? • Data that describe data • Crucial for: • Finding experimental datasets online • Understanding how the experiments were performed • Reusing the data to perform new analyses AMIA 2017 | amia.org 2
AMIA 2017 | amia.org 3
Poor metadata age age [y] Age age [year] AGE age [years] `Age age in years age (after birth) age of patient age (in years) Age of patient age (y) age of subjects age (year) age(years) age (years) Age(years) Age (years) Age(yrs.) Age (Years) Age, year age (yr) age, years age (yr-old) age, yrs age (yrs) age.year Age (yrs) age_years AMIA 2017 | amia.org 4
Poor metadata An analysis of metadata from NCBI’s BioSample • 73% of “Boolean” values • nonsmoker, former-smoker • 26% of “integer” values • JM52, UVPgt59.4, pig • 68% of ontology terms • presumed normal, wild_type Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies . SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria. AMIA 2017 | amia.org 5
Metadata authoring is hard [Your presentation on this and next slides] AMIA 2017 | amia.org 6
Metadata template • A computational platform for metadata management • Goal: Overcome the impediments to creating high-quality metadata Metadata template AMIA 2017 | amia.org 7
DESIGN TEMPLATE FILL IN METADATA SUBMIT METADATA Template authors Metadata authors (e.g., standards (e.g., scientists) LINCS committees) Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal template metadata Template Designer Metadata Editor Metadata Repository AMIA 2017 | amia.org 8
DESIGN TEMPLATE FILL IN METADATA SUBMIT METADATA Template authors Metadata authors (e.g., standards (e.g., scientists) LINCS committees) Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal template metadata Template Designer Metadata Editor Metadata Repository We developed a metadata recommendation system AMIA 2017 | amia.org 9
Metadata recommendation system store 1 https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… metadata A sample study Acute stress disorder Stanford University John Doe Longitudinal Metadata Editor Metadata Repository 3 2 generate analyze suggestions existing metadata Metadata Recommender AMIA 2017 | amia.org 10
Filling in a CEDAR template AMIA 2017 | amia.org 11
AMIA 2017 | amia.org 12
AMIA 2017 | amia.org 13
AMIA 2017 | amia.org 14
AMIA 2017 | amia.org 15
Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 16
Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 17
Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 18
Evaluation workflow CEDAR BioSample 20% Test template dataset • For “disease”, ”sex”, BioSample and “tissue” Gene 80% template Training Expression • Top 3 suggestions instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 19
Testing & Analysis Compared suggested vs. expected metadata Measure: Reciprocal Rank (RR) . Appropriate to judge systems that return a ranking of suggestions when there is only a relevant result !"#$%&'#() !(+, (!!) = 1 1 Position of the expected result in the ranking of suggestions AMIA 2017 | amia.org 20
How is the RR calculated? Reciprocal Rank Expected Suggested K (RR) 1) asthma asthma 2) lung cancer 1 1/1 3) respiratory disease 1) myeloma lymphoma 2) lymphoma 2 1/2 3) acute myeloid leukemia 1) respiratory disease lung cancer 2) asthma 3 1/3 3) lung cancer Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61 AMIA 2017 | amia.org 21
Results 1 On average: Mean Reciprocal Rank (MRR) 0.9 • Metadata 0.8 Recommender = 0.77 0.7 • Baseline 0.6 (majority vote) = 0.31 0.5 0.4 0.3 Better performance with 0.2 respect to the baseline for: 0.1 • Fields with many 0 different values disease tissue sex • Templates with many Baseline Metadata Recommender correlated fields AMIA 2017 | amia.org 22
Summary • We developed a metadata recommendation system as part of an end-to-end system for metadata management called CEDAR • Generates context-sensitive suggestions in real time • Incorporates both ontology-based and free-text suggestions AMIA 2017 | amia.org 23
Summary Our approach makes it easier for scientists to generate high-quality metadata for experimental datasets • So that the datasets can be found, interpreted, and reused • Essential to ensure scientific reproducibility AMIA 2017 | amia.org 24
facebook.com/metadatacenter @metadatacenter Channel: Metadata Center github.com/metadatacenter http://cedar.metadatacenter.org AMIA 2017 | amia.org 25
Recommend
More recommend