a simple algorithm for identifying abbreviation
play

A SIMPLE ALGORITHM FOR IDENTIFYING ABBREVIATION DEFINITIONS IN - PDF document

A SIMPLE ALGORITHM FOR IDENTIFYING ABBREVIATION DEFINITIONS IN BIOMEDICAL TEXT ARIEL S. SCHWARTZ MARTI A. HEARST Computer Science Division SIMS University of California, Berkeley University of California, Berkeley Berkeley, CA 94720


  1. A SIMPLE ALGORITHM FOR IDENTIFYING ABBREVIATION DEFINITIONS IN BIOMEDICAL TEXT ARIEL S. SCHWARTZ MARTI A. HEARST Computer Science Division SIMS University of California, Berkeley University of California, Berkeley Berkeley, CA 94720 Berkeley, CA 94720 sariel@cs.berkeley.edu hearst@sims.berkeley.edu Abstract The volume of biomedical text is growing at a fast rate, creating challenges for humans and computer systems alike. One of these challenges arises from the frequent use of novel abbreviations in these texts, thus requiring that biomedical lexical ontologies be continually updated. In this paper we show that the problem of identifying abbreviations’ definitions can be solved with a much simpler algorithm than that proposed by other research efforts. The algorithm achieves 96% precision and 82% recall on a standard test collection, which is at least as good as existing approaches. It also achieves 95% precision and 82% recall on another, larger test set. A notable advantage of the algorithm is that, unlike other approaches, it does not require any training data. 1 Introduction There has been an increased interest recently in techniques to automatically extract information from biomedical text, and particularly from MEDLINE abstracts. 3, 4, 7, 15 The size and growth rate of biomedical literature creates new challenges for researchers who need to keep up to date. One specific issue is the high rate at which new abbreviations are introduced in biomedical texts. Existing databases, ontologies, and dictionaries must be continually updated with new abbreviations and their definitions. In an attempt to help resolve the problem, new techniques have been introduced to automatically extract abbreviations and their definitions from MEDLINE abstracts. In this paper we propose a new, simple, fast algorithm for extraction of abbreviations from biomedical text. The scope of the task addressed here is the same as the one described in Pustejovsky et al.: 14 identify <“short form”, “long form”> pairs where there exists a mapping (of any kind) from characters in the short form to characters in the long form. a a Throughout the paper we use the terms “short form” and “long form” interchangeably with “abbreviation” and “definition”. We also use the term “short form” to indicate both abbreviations and acronyms, conflating these as have previous authors.

  2. Many abbreviations in biomedical text follow a predictable pattern, in which the first letter of each word in the long form corresponds to one letter in the short form, as in methyl methanesulfonate sulfate (MMS) . However, there are many cases in which the correct match between the short form and long form requires words in the long form to be skipped, or matching of internal letters in long form words, as in Gcn5-related N-acetyltransferase (GNAT) . In this paper, we describe a very simple, fast algorithm for this problem that achieves both high recall and high precision. 2 Related Work Pustejovsky et al. 13, 14 present a solution for identifying abbreviations based on hand-built regular expressions and syntactic information to identify boundaries of noun phrases. When a noun phrase is found to precede a short form enclosed in parentheses, each of the characters within the short form is matched in the long form. A score is assigned that corresponds to the number of non-stopwords in the long form divided by the number of characters in the short form. If the result is below a threshold of 1.5, then the match is accepted. This algorithm achieved 72% recall and 98% on “the gold standard,” a small, publicly available evaluation corpus that this group created, working better than a similar algorithm that does not take syntax into account. b Pustejovsky et al. 13 also summarize some drawbacks of other earlier pattern- based approaches, noting that the results of Taghva et al. 17 look good (98% precision and 93% recall on a different test set), but do not account for abbreviations whose letters may correspond to a character internal to a definition word, a common occurrence in biomedical text. They also find that the Acrophile algorithm of Larkey et al. 8 does not perform well on the gold standard. Chang et al. 5 present an algorithm that uses linear regression on a pre-selected set of features, achieving 80% precision at a recall level of 83%, and 95% precision at 75% recall on the same evaluation collection (this increases to 82% recall and 99% precision on a corrected version). c Their algorithm uses dynamic programming to find potential alignments between short and long form, and uses the results of this to compute feature vectors for correctly identified definitions. They then use binary logistic regression to train a classifier on 1000 candidate pairs. Yeates et al. 19 examine acronyms in technical text. They address a more difficult problem than some other groups in that their test set includes instances that do not have distinct orthographic markers such as parentheses to indicate the b There are some errors in the gold standard. The results reported by Pustejovsky et al. are on a variation of the gold standard with some corrections, but the actual corrections made are not reported in the paper. Unfortunately, the corrections needed on the standard are not standardized. c Personal communication, H. Schuetze.

  3. proximity of a definition to an abbreviation (they report that only two thirds of the examples take this form). Their algorithm creates a code that indicates the distance of the definition words from the corresponding characters in the acronym, and uses compression to learn the associations. They compile a large test collection consisting of 1080 definitions; training on two thirds and testing on the remainder, reporting the results on a precision/recall curve. Park and Byrd 12 present a rule-based algorithm for extraction of abbreviation definitions from general text. The algorithm creates rules on the fly that model how the short form can be translated into the long form. They create a set of five translation rules, a set of five rules for determining candidate long forms based on their length, and a set of six heuristics for determining which definition to choose if there are many potential candidates. These are: syntactic cues, rule priority, distance between definition and abbreviation, capitalization criteria, number of words in the definition, and number of stopwords in the definition. Rule priority is based on how often the rule has been applied in the past. They evaluate their algorithm on 177 abbreviations taken from engineering texts, achieving 98% precision and 95% recall. No mention is made of the size and nature of the training set, or whether it was distinct from the test set. Yu et al. 21 present another rule-based algorithm for mapping abbreviations to their full forms in biomedical text. Their algorithm is similar to that of Park and Byrd. For a given short form, the algorithm extracts all the candidate long forms that start with the same character as the short form. The algorithm then tries to match the candidate long forms to the short form starting from the shortest long form, by iteratively applying 5 pattern-matching rules. The rules include heuristics such as prioritizing matching the first character of a word, allowing the use of internal letters only if the first letter of a word was matched, and so on. The algorithm was evaluated on a small collection of biomedical text containing 62 matching pairs, achieving 95% precision and 70% recall on average. Adar 1 presents an algorithm that generates a set of paths through the window of text adjacent to an abbreviation (starting from the leftmost character), and scores these paths to find the most likely definition. Scoring rules used include “for every abbreviation character that occurs at the start of a definition word, add 1”, and “A bonus point is awarded for definitions that are immediately adjacent to the parenthesis”. After processing a large set of abbreviation-definition pairs, the results are clustered in order to identify spelling variants among the definitions. N-gram clustering is coupled with lookup into the MeSH hierarchy to further improve the clusters. Performance on a smaller subset of the gold standard yielded 85% recall and 94% precision; the author notes that 2 definitions identified by his algorithm should have been marked correct in the standard, resulting in a precision of 95%. d d Results verified through personal communication with the author.

Recommend


More recommend