evaluation of text data mining for evaluation of text
play

Evaluation of text data mining for Evaluation of text data mining - PowerPoint PPT Presentation

Evaluation of text data mining for Evaluation of text data mining for database curation: lessons learned database curation: lessons learned from the KDD Challenge Cup from the KDD Challenge Cup Alexander S. Yeh, Lynette Hirschman, and


  1. Evaluation of text data mining for Evaluation of text data mining for database curation: lessons learned database curation: lessons learned from the KDD Challenge Cup from the KDD Challenge Cup Alexander S. Yeh, Lynette Hirschman, and Alexander A Morgan Alexander S. Yeh, Lynette Hirschman, and Alexander A Morgan

  2. Introduction Introduction � The idea behind ‘challenge cup’ was to present � The idea behind ‘challenge cup’ was to present teams with real or realistic training and test data to teams with real or realistic training and test data to “create measurable forward progress in [the] field” “create measurable forward progress in [the] field” � They mined papers from the Flybase database � They mined papers from the Flybase database (FlyBase is a comprehensive database for information on the genetics and (FlyBase is a comprehensive database for information on the genetics and molecular biology of Drosophila. It includes data from the Drosophila Genome molecular biology of Drosophila. It includes data from the Drosophila Genome Projects and data curated from the literature. FlyBase is a joint project with the Projects and data curated from the literature. FlyBase is a joint project with the Berkeley Drosophila Genome Project.) Berkeley Drosophila Genome Project.)

  3. Methods: Contest Set-up Methods: Contest Set-up � Given a set of papers (full text) on genetics or � Given a set of papers (full text) on genetics or molecular biology and, for each paper, a list of the molecular biology and, for each paper, a list of the genes mentioned in the paper genes mentioned in the paper � Determine whether the paper meets FlyBase gene � Determine whether the paper meets FlyBase gene expression curation criteria, and for each gene, expression curation criteria, and for each gene, indicate whether the full paper has experimental indicate whether the full paper has experimental evidence for gene products (mRNA and/or evidence for gene products (mRNA and/or protein) protein)

  4. What then needed to Return What then needed to Return � A ranked list of the papers in order of probability � A ranked list of the papers in order of probability of the need for curation, the presence of of the need for curation, the presence of experimental evidence needs a higher ranking. experimental evidence needs a higher ranking. (curated: articles from the literature that have been reviewed by the curation staff (curated: articles from the literature that have been reviewed by the curation staff at RGD who have read the article and extracted the specific information of at RGD who have read the article and extracted the specific information of interest to RGD which was subsequently loaded into the database. ) interest to RGD which was subsequently loaded into the database. ) � A yes/no decision on whether on curate each paper � A yes/no decision on whether on curate each paper � The each gene in each paper an individual � The each gene in each paper an individual decision about whether the paper has evidence for decision about whether the paper has evidence for gene products gene products

  5. Data Training Data Training � Data consisted of 862 ‘cleaned’ full text � Data consisted of 862 ‘cleaned’ full text papers papers � Genes renamed to standard convention � Genes renamed to standard convention � Matched to flybase standards � Matched to flybase standards

  6. Results Results Sub Task: Best: 1st Med Low Sub Task: Best: 1st Med Low Ranked List: 84% 81% 69% 35% Ranked List: 84% 81% 69% 35% Y/N Paper: 78% 61% 58% 32% Y/N Paper: 78% 61% 58% 32% Y/N Products: 67% 47% 35% 8% Y/N Products: 67% 47% 35% 8%

  7. Winning strategy Winning strategy � Manually constructed rules that were � Manually constructed rules that were matched against patterns deemed of matched against patterns deemed of ‘interest’ ‘interest’ � All teams moved away “bag of words” � All teams moved away “bag of words” approach common in test classification did approach common in test classification did more with domain experts more with domain experts

  8. Lessons Learned Lessons Learned � PDF form not suitable for Processing, � PDF form not suitable for Processing, furthermore HTML had its own challenge. furthermore HTML had its own challenge. Too many linked file mapping. Too many linked file mapping. � Many times to properly “mine” requires a � Many times to properly “mine” requires a significant biology understanding and significant biology understanding and understanding of flybase conventions understanding of flybase conventions

  9. Lessons Learned Lessons Learned � The more hightec automated weighted � The more hightec automated weighted techniques produced far less ‘correct’ techniques produced far less ‘correct’ answers than those programs written answers than those programs written manually to the specific constraints of the manually to the specific constraints of the task. task. � It was important to know both what and � It was important to know both what and ‘where’ to look for features and patterns. ‘where’ to look for features and patterns.

  10. Third sub task most difficult Third sub task most difficult � Associations and indicators varied for each � Associations and indicators varied for each gene. Different patterns. gene. Different patterns. � A way to combat the structure may be use � A way to combat the structure may be use more extensive linguistic structure more extensive linguistic structure indicators. Similarities and better indicators. Similarities and better relationship structures would help the relationship structures would help the system more with reliability system more with reliability

  11. Points of Note Points of Note � Training of the test data in not practical in � Training of the test data in not practical in normal circumstances. normal circumstances. � Nor is requiring golden html. � Nor is requiring golden html. � Either transcripts or proteins papers failure � Either transcripts or proteins papers failure shows text mining over reliance on simple shows text mining over reliance on simple associations. associations. � What about the things that should be left in � What about the things that should be left in that are mined out? that are mined out?

Recommend


More recommend