A Simple Algorithm for Identifying Abbreviation Definitions in - PDF document

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A.S. Schwartz, M.A. Hearst Pacific Symposium on Biocomputing 8:451-462(2003)

A�SIMPLE�ALGORITHM�FOR�IDENTIFYING�ABBREVIATION� DEFINITIONS�IN�BIOMEDICAL�TEXT�� ARIEL�S.�SCHWARTZ� MARTI�A.�HEARST�� Computer�Science�Division� SIMS� University�of�California,�Berkeley� University�of�California,�Berkeley� Berkeley,�CA�94720 � Berkeley,�CA�94720� sariel@cs.berkeley.edu�� hearst@sims.berkeley.edu� Abstract� � The�volume�of�biomedical�text�is�growing�at�a�fast�rate,�creating�challenges�for�humans�and� computer� systems� alike.� One� of� these� challenges� arises� from� the� frequent� use� of� novel� abbreviations�in�these�texts,�thus�requiring�that�biomedical�lexical�ontologies�be�continually� updated.�In�this�paper�we�show�that�the�problem�of�identifying�abbreviations’�definitions�can� be� solved� with� a� much� simpler� algorithm� than� that� proposed�by� other� research� efforts.� The� algorithm�achieves�96%�precision�and�82%�recall�on�a�standard�test�collection,�which�is�at�least� as� good� as� existing� approaches.� It� also� achieves� 95%� precision� and� 82%� recall� on� another,� larger�test�set.�A�notable�advantage�of�the�algorithm�is�that,�unlike�other�approaches,�it�does�not� require�any�training�data.�� 1 � Introduction� There�has�been�an�increased�interest�recently�in�techniques�to�automatically�extract� information�from�biomedical�text,�and�particularly�from�MEDLINE�abstracts. 3,�4,�7,�15 � The� size� and� growth� rate� of� biomedical� literature� creates� new� challenges� for� researchers�who�need�to�keep�up�to�date.�One�specific�issue�is�the�high�rate�at�which� new� abbreviations� are� introduced� in� biomedical� texts.� Existing� databases,� ontologies,� and� dictionaries� must� be� continually� updated� with� new� abbreviations� and� their� definitions.� In� an� attempt� to� help� resolve� the� problem,� new� techniques� have� been� introduced� to� automatically� extract� abbreviations� and� their� definitions� from�MEDLINE�abstracts.�� In� this� paper� we� propose� a� new,� simple,� fast� algorithm� for� extraction� of� abbreviations� from� biomedical� text.� The� scope� of� the� task� addressed� here� is� the� same� as� the� one� described� in� Pustejovsky� et� al.: 14 � identify� <“short� form”,� “long� form”>�pairs�where�there�exists�a�mapping�(of�any�kind)�from�characters�in�the�short� form�to�characters�in�the�long�form. a � �� a �Throughout�the�paper�we�use�the�terms�“short�form”�and�“long�form”�interchangeably�with� “abbreviation”�and�“definition”.��We�also�use�the�term�“short�form”�to�indicate�both�abbreviations�and� acronyms,�conflating�these�as�have�previous�authors.� � �

Many�abbreviations�in�biomedical�text�follow�a�predictable�pattern,�in� which� the�first�letter�of�each�word�in�the�long�form�corresponds�to�one�letter�in�the�short� form,�as�in� methyl�methanesulfonate�sulfate�(MMS) .�However,�there�are�many�cases� in�which�the�correct�match�between�the�short�form�and�long�form�requires�words�in� the�long�form�to�be�skipped,�or�matching�of�internal�letters�in�long�form�words,�as�in� Gcn5-related�N-acetyltransferase�(GNAT) .�In�this�paper,�we�describe�a�very�simple,� fast�algorithm�for�this�problem�that�achieves�both�high�recall�and�high�precision.� 2 � Related�Work� Pustejovsky� et� al. 13,� 14� present� a� solution� for� identifying� abbreviations� based� on� hand-built�regular�expressions�and�syntactic�information�to�identify�boundaries�of� noun�phrases.�When�a�noun� phrase�is� found�to�precede�a�short� form�enclosed�in� parentheses,� each� of� the� characters� within� the� short� form� is� matched� in� the� long� form.�A�score�is�assigned�that�corresponds�to�the�number�of�non-stopwords�in�the� long� form�divided�by�the�number�of�characters�in�the� short�form.�If�the�result�is� below�a�threshold�of�1.5,�then�the�match�is�accepted.�This�algorithm�achieved�72%� recall�and�98%�on�“the�gold�standard,”�a�small,�publicly�available�evaluation�corpus� that�this�group�created,�working�better�than�a�similar�algorithm�that�does�not�take� syntax�into�account. b � Pustejovsky�et�al. 13 �also�summarize�some�drawbacks�of�other�earlier�pattern- based� approaches,� noting� that� the� results� of� Taghva� et� al. 17 � look� good� (98%� precision� and� 93%� recall� on� a� different� test� set),� but� do� not� account� for� abbreviations�whose�letters�may�correspond�to�a�character�internal�to�a�definition� word,�a�common�occurrence�in�biomedical�text.�They�also�find�that�the�Acrophile� algorithm�of�Larkey�et�al. 8 �does�not�perform�well�on�the�gold�standard.� Chang�et�al. 5 �present�an�algorithm�that�uses�linear�regression�on�a�pre-selected� set�of�features,�achieving�80%�precision�at�a�recall�level�of�83%,�and�95%�precision� at�75%�recall�on�the�same�evaluation�collection�(this�increases�to�82%�recall�and� 99%�precision�on�a�corrected�version). c �Their�algorithm�uses�dynamic�programming� to�find�potential�alignments�between�short�and�long�form,�and�uses�the�results�of�this� to�compute�feature�vectors�for�correctly�identified�definitions.�They�then�use�binary� logistic�regression�to�train�a�classifier�on�1000�candidate�pairs.� Yeates� et� al. 19 � examine� acronyms� in� technical� text.� They� address� a� more� difficult�problem�than�some�other�groups�in�that�their�test�set�includes�instances�that� do� not� have� distinct� orthographic� markers� such� as� parentheses� to� indicate� the� �� b �There�are�some�errors�in�the�gold�standard.��The�results�reported�by�Pustejovsky�et�al. 13 �are�on�a� variation�of�the�gold�standard�with�some�corrections,�but�the�actual�corrections�made�are�not�reported�in� the�paper.��Unfortunately,�the�corrections�needed�on�the�standard�are�not�standardized.� c �Personal�communication,�H.�Schuetze.� � �

A Simple Algorithm for Identifying Abbreviation Definitions in - PDF document

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A.S. Schwartz, M.A. Hearst Pacific Symposium on Biocomputing 8:451-462(2003) ASIMPLEALGORITHMFORIDENTIFYINGABBREVIATION

A SIMPLE ALGORITHM FOR IDENTIFYING ABBREVIATION DEFINITIONS IN BIOMEDICAL TEXT ARIEL S. SCHWARTZ

NAACCR RECOMMENDED ABBREVIATION LIST ORDERED BY WORD/TERM(S) WORD/TERM(S) ABBREVIATION/SYMBOL

Most commonly used echocardiographic abbreviations Only use abbreviation if used more than 3 times

Extension, Abbreviation and Refinement - Identifying High-Level Dependence Structures Using

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

A simple and robust A simple and robust algorithm for extracting algorithm for extracting

Limits on Representing Functions by Linear Combinations of Simple Functions 0,1

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Division of Behavioral Health Services Abbreviations and Acronyms List Acronym/Abbreviation

2 Gonorrhea ............................................................ GC Intensive care unit

JLR standard abbreviation list 3D three-dimensional AA arachidonic acid AAV adeno-associated

WAYNESBORO HOSPITAL Waynesboro, Pa. SUBJECT: Medical Staff Approved Abbreviation List POLICY #: 8

DRIVER LICENSING ABBREVIATION CODES WITH CHARGE POINTS BDS108 01/2017 Charge Charge

JID ABBREVIATION GUIDE STANDARD ABBREVIATONS (Abbreviations That Do Not Need to Be Spelled Out)

Office of the Registrar Course Catalog- Abbreviations

The triangular formulation of the Nambu- Goldstone theorem Ivan Arraut Tokyo University of

Motivation What is the threefactor model (3FM)? Javier Estrada A model to estimate the

organic solar cells michele.maggini@unipd.it Humanitys core problems in 2050 Energy

11 Language syntax Contents 11.1 Overview 11.1.1 varlist 11.1.2 by varlist: 11.1.3 if exp

COMMONLY USED ABBREVIATIONS AND TERMS in CLINCAL TRIALS Abbreviation Definition ADR Adverse Drug

A score book page has a place for each person in the batting order, and then a tiny box (usually

Jou ournal Title Abbreviation on pISSN SSN eISSN SSN Pub Publishe sher Free Acce Fre