fast and accurate metadata authoring using ontology based
play

Fast and Accurate Metadata Authoring Using Ontology-Based - PowerPoint PPT Presentation

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations S100 Martnez-Romero, M. , OConnor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University What


  1. Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations S100 Martínez-Romero, M. , O’Connor, M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University

  2. What is metadata? • Data that describe data • Crucial for: • Finding experimental datasets online • Understanding how the experiments were performed • Reusing the data to perform new analyses AMIA 2017 | amia.org 2

  3. AMIA 2017 | amia.org 3

  4. Poor metadata age age [y] Age age [year] AGE age [years] `Age age in years age (after birth) age of patient age (in years) Age of patient age (y) age of subjects age (year) age(years) age (years) Age(years) Age (years) Age(yrs.) Age (Years) Age, year age (yr) age, years age (yr-old) age, yrs age (yrs) age.year Age (yrs) age_years AMIA 2017 | amia.org 4

  5. Poor metadata An analysis of metadata from NCBI’s BioSample • 73% of “Boolean” values • nonsmoker, former-smoker • 26% of “integer” values • JM52, UVPgt59.4, pig • 68% of ontology terms • presumed normal, wild_type Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies . SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria. AMIA 2017 | amia.org 5

  6. Metadata authoring is hard [Your presentation on this and next slides] AMIA 2017 | amia.org 6

  7. Metadata template • A computational platform for metadata management • Goal: Overcome the impediments to creating high-quality metadata Metadata template AMIA 2017 | amia.org 7

  8. DESIGN TEMPLATE FILL IN METADATA SUBMIT METADATA Template authors Metadata authors (e.g., standards (e.g., scientists) LINCS committees) Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal template metadata Template Designer Metadata Editor Metadata Repository AMIA 2017 | amia.org 8

  9. DESIGN TEMPLATE FILL IN METADATA SUBMIT METADATA Template authors Metadata authors (e.g., standards (e.g., scientists) LINCS committees) Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal template metadata Template Designer Metadata Editor Metadata Repository We developed a metadata recommendation system AMIA 2017 | amia.org 9

  10. Metadata recommendation system store 1 https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… metadata A sample study Acute stress disorder Stanford University John Doe Longitudinal Metadata Editor Metadata Repository 3 2 generate analyze suggestions existing metadata Metadata Recommender AMIA 2017 | amia.org 10

  11. Filling in a CEDAR template AMIA 2017 | amia.org 11

  12. AMIA 2017 | amia.org 12

  13. AMIA 2017 | amia.org 13

  14. AMIA 2017 | amia.org 14

  15. AMIA 2017 | amia.org 15

  16. Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 16

  17. Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 17

  18. Evaluation workflow CEDAR BioSample 20% Test template dataset BioSample Gene 80% template Training Expression instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 18

  19. Evaluation workflow CEDAR BioSample 20% Test template dataset • For “disease”, ”sex”, BioSample and “tissue” Gene 80% template Training Expression • Top 3 suggestions instances dataset (1) metadata ( ≈ 35K) Preprocessing (3) Training and Ingestion (4) Testing & (2) Analysis Evaluation CEDAR Metadata Semantic results Repository annotation Metadata Recommender Annotated BioSample 80% Training template dataset instances ( ≈ 35K) 20% Test dataset AMIA 2017 | amia.org 19

  20. Testing & Analysis Compared suggested vs. expected metadata Measure: Reciprocal Rank (RR) . Appropriate to judge systems that return a ranking of suggestions when there is only a relevant result !"#$%&'#() !(+, (!!) = 1 1 Position of the expected result in the ranking of suggestions AMIA 2017 | amia.org 20

  21. How is the RR calculated? Reciprocal Rank Expected Suggested K (RR) 1) asthma asthma 2) lung cancer 1 1/1 3) respiratory disease 1) myeloma lymphoma 2) lymphoma 2 1/2 3) acute myeloid leukemia 1) respiratory disease lung cancer 2) asthma 3 1/3 3) lung cancer Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61 AMIA 2017 | amia.org 21

  22. Results 1 On average: Mean Reciprocal Rank (MRR) 0.9 • Metadata 0.8 Recommender = 0.77 0.7 • Baseline 0.6 (majority vote) = 0.31 0.5 0.4 0.3 Better performance with 0.2 respect to the baseline for: 0.1 • Fields with many 0 different values disease tissue sex • Templates with many Baseline Metadata Recommender correlated fields AMIA 2017 | amia.org 22

  23. Summary • We developed a metadata recommendation system as part of an end-to-end system for metadata management called CEDAR • Generates context-sensitive suggestions in real time • Incorporates both ontology-based and free-text suggestions AMIA 2017 | amia.org 23

  24. Summary Our approach makes it easier for scientists to generate high-quality metadata for experimental datasets • So that the datasets can be found, interpreted, and reused • Essential to ensure scientific reproducibility AMIA 2017 | amia.org 24

  25. facebook.com/metadatacenter @metadatacenter Channel: Metadata Center github.com/metadatacenter http://cedar.metadatacenter.org AMIA 2017 | amia.org 25

Recommend


More recommend