assisted curation does text mining really help
play

Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) - PowerPoint PPT Presentation

Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) by Benedict Fehringer Seminar: Unlocking the Secrets of the Past: Text Mining for Historical Documents Supervisor: Dr. Caroline Sporleder (and Martin Schreiber)


  1. Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) by Benedict Fehringer Seminar: „Unlocking the Secrets of the Past: Text Mining for Historical Documents“ Supervisor: Dr. Caroline Sporleder (and Martin Schreiber) 23.02.2012 Donnerstag, 23. Februar 2012

  2. Outline ! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References Donnerstag, 23. Februar 2012

  3. Outline ! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References Donnerstag, 23. Februar 2012

  4. Basic study elements - Content - ! Curation of biomedical literature ! For example, protein-protein interaction recognition: 1. Which protein are there? 2. If two proteins are named, are they in interaction? Donnerstag, 23. Februar 2012

  5. Example for protein-protein interaction recognition [...] An example is YHR105W, which interacts with one protein involved in 1. Which proteins are there? vesicular transport, Akr2, and with YGL161C, an uncharacterized protein 2. If two proteins are named, are that interacts with two transport they in interaction? proteins, Yip1 and Pep12. YHR105W also interacts with YPL246C, another uncharacterized protein that interacts with Ypt1 and Vam7, proteins implicated in vesicular transport and membrane fusion, respectively. [...] Source: Schwikowski, Uetz, & Fields (pp. 1259, 2000) Donnerstag, 23. Februar 2012

  6. Basic study elements - Research Question - ! Curation of biomedical literature ! For example, protein-protein interaction recognition: 1. Which protein are there? 2. If two proteins are named, are they in interaction? ! Task should be supported by text mining Donnerstag, 23. Februar 2012

  7. Related Work ! Increasing development of information extraction systems (spurred on by BioCreAtIvE II competition; Krallinger, Leitner, & Valencia, 2007) ! studies suggest reduction of curation time ! But: lack of user studies for extrinsically evaluation ! no validation by curator feedback about affecting their work and usefulness Donnerstag, 23. Februar 2012

  8. Basic study elements - Evaluation - ! Curation of biomedical literature ! For example, protein-protein interaction recognition: 1. Which protein are there? 2. If two proteins are named, are they in interaction? ! Task should be supported by text mining ! Evaluation by: ! objective performance metrics (e.g. speed improvement, number of records) ! focusing on user feedback, too Donnerstag, 23. Februar 2012

  9. Outline ! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References Donnerstag, 23. Februar 2012

  10. Curation Scenario - General - ! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates Donnerstag, 23. Februar 2012

  11. Curation Scenario - General - ! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers How can NLP help the curator ! Middle step: Filtering papers into candidates work? Donnerstag, 23. Februar 2012

  12. Curation Scenario - General - ! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates ! Basic Assumption: Information Extraction (IE) techniques are likely effective in identifying entities and relations " More specific: NLP can propose candidate PPIs Donnerstag, 23. Februar 2012

  13. Curation Scenario - General - ! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates ! Basic Assumption: Information Extraction (IE) techniques are likely effective in identifying entities and relations " More specific: NLP can propose candidate PPIs Donnerstag, 23. Februar 2012

  14. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  15. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  16. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  17. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  18. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  19. Curation Scenario - Concrete - Information Flow in the Curation Process Source: Alex et al. (p. 558, 2008) Donnerstag, 23. Februar 2012

  20. NLP Engine - Main Components - Concrete Subtasks NLP-Components 1. Exists protein‘s name in 1. Named Entity sentence? Recognition 2. Which protein do they name? 2. Term Identification 3. If two proteins are named, are 3. Relation Extraction they in interaction? Donnerstag, 23. Februar 2012

  21. NLP Engine - Creation details - ! How should the interface design look like? Donnerstag, 23. Februar 2012

  22. NLP Engine - Creation details - For example: To decide which species is associated with which protein should be quite simple for an ! How should the interface design look like? expert but not necessarily for the software. ! How should the labour be divided between human and the software? Donnerstag, 23. Februar 2012

  23. NLP Engine - Creation details - For example: Should recall or precision ! How should the interface design look like? be improved? ! How should the labour be divided between human and the software? ! Which functional characteristics of the NLP engine would be optimal? Donnerstag, 23. Februar 2012

  24. NLP Engine - Creation details - ! How should the interface design look like? ! How should the labour be divided between human and the software? ! Which functional characteristics of the NLP engine would be optimal? The focus will be on the third question. Donnerstag, 23. Februar 2012

  25. Outline ! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References Donnerstag, 23. Februar 2012

  26. Pipeline-Components Pre- Named Entity Corpus processing Recognition Component Relation Term Performance Extraction Identification Donnerstag, 23. Februar 2012

  27. Pipeline-Components inter-annotator 217 Papers agreement Pre- Named Entity Corpus 84.9 64.8 processing Recognition PPI FRAG* 9 Entities relations relations were enriched with Component Relation Term Performance 88.4 Extraction Identification 87.1 59.6 Properties Normalized Attributes *linked fragments and mutants to their parents Donnerstag, 23. Februar 2012

  28. Pipeline-Components inter-annotator 217 Papers agreement Pre- Named Entity Corpus 84.9 64.8 processing Recognition PPI FRAG* 9 Entities relations relations Corpus consists of 2 million tokens: were enriched with Component Relation Term Performance 88.4 Extraction Identification - TRAIN (66%) 87.1 59.6 - DEVTEST (17%) Properties Normalized Attributes - TEST (17%) *linked fragments and mutants to their parents Donnerstag, 23. Februar 2012

  29. Pipeline-Components Pre- Named Entity Corpus processing Recognition Component Relation Term Performance Extraction Identification Donnerstag, 23. Februar 2012

  30. Pipeline-Components Pre- Named Entity Corpus processing Recognition Sentence Adding useful Attaches NCBI* boundary Tokenization linguistic taxonomy Component Relation Term detection markup identifiers Performance Extraction Identification *National Center for Biotechnology Information Donnerstag, 23. Februar 2012

  31. Pipeline-Components Pre- Named Entity Corpus processing Recognition Component Relation Term Performance Extraction Identification Donnerstag, 23. Februar 2012

  32. Pipeline-Components no entity Pre- Named Entity Corpus processing Recognition Component Relation Term Performance Extraction Identification entity Donnerstag, 23. Februar 2012

  33. Pipeline-Components entity no entity Sum no entity pred pred entity Pre- Named Entity 9 3 12 Corpus real processing Recognition no entity 1 11 12 real Sum 10 14 24 Component Relation Term Performance Extraction Identification entity Donnerstag, 23. Februar 2012

  34. Pipeline-Components entity no entity Sum no entity pred pred entity Pre- Named Entity 9 3 12 Corpus real processing Recognition no entity 1 11 12 real Sum 10 14 24 Component Relation Term Performance Extraction Identification Recall: 9/12 = 0.75 entity Donnerstag, 23. Februar 2012

  35. Pipeline-Components entity no entity Sum no entity pred pred entity Pre- Named Entity 9 3 12 Corpus real processing Recognition no entity 1 11 12 real Sum 10 14 24 Component Relation Term Performance Extraction Identification Recall: 9/12 = 0.75 Precision: 9/10 = 0.9 entity Donnerstag, 23. Februar 2012

Recommend


More recommend