Abbreviation detection for biomedical articles by Sonja Kenari
Agenda Introduction Background Implementation Results Further Improvents
Introduction Full project description COVID-19 Open Research Dataset Challenge (CORD-19): What do we know about vaccines and therapeutics? Abbreviation Dictionary Relationship NER detection tagger extraction 1
Introduction Abbreviation Detection spaCy Python library for NLP Abbreviation detection Makes it easier to: Find articles of interest faster ? Keep up with the amount of new abbreviations 2
Background Abbreviation Detection Pre trained models by spaCy scispaCy: AbbreviationDetector Detect: abbreviations & definitions short form long form Accuracy? 3
Implementation Generate Pubannotations data subset [json] pubannotation [json] 100 out of 60,000 articles metadata file [csv] 4
Implementation Generating files of abbreviations web scraping scispaCy output file format metadata file [csv] data subset [json] url full texts HTML parser AbbreviationDetector BeautifulSoup abbreviation, abbreviations, csv files Abbreviation, Abbreviations csv files 5
Implementation Evaluation Compare the 2 { detected abbreviations with spaCy [csv] detected abbreviations with web scraping [csv] Number unique short forms detected by spaCy = (%) Number short forms detected by web scraping Number unique long forms detected by spaCy = (%) Number long forms detected by web scraping 6
Result Result Abbreviation lists in short forms hit rate long forms hit rate Highest: 87.5% Highest: 52.6% Lowest: 25% Lowest : 0% 20 out of 100 notable faults - spaCy weak on long form - text from json files not updated after url articles - faults in denotation extraction 7
Further Improvements spaCy Optimize programs Improve the results Make more time effjcient Extract from web scraper Pubannotations Update data Instead of full text extraction 8
Thank you for listening! Questions...? Sonja Kenari nat14sta@student.lu.se 9 2020-05-29
Recommend
More recommend