IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005 1075 Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation Roberto Navigli and Paola Velardi Abstract —Word Sense Disambiguation (WSD) is traditionally considered an AI-hard problem. A break-through in this field would have a significant impact on many relevant Web-based applications, such as Web information retrieval, improved access to Web services, information extraction, etc. Early approaches to WSD, based on knowledge representation techniques, have been replaced in the past few years by more robust machine learning and statistical techniques. The results of recent comparative evaluations of WSD systems, however, show that these methods have inherent limitations. On the other hand, the increasing availability of large-scale, rich lexical knowledge resources seems to provide new challenges to knowledge-based approaches. In this paper, we present a method, called structural semantic interconnections ( SSI ), which creates structural specifications of the possible senses for each word in a context and selects the best hypothesis according to a grammar G , describing relations between sense specifications. Sense specifications are created from several available lexical resources that we integrated in part manually, in part with the help of automatic procedures. The SSI algorithm has been applied to different semantic disambiguation problems, like automatic ontology population, disambiguation of sentences in generic texts, disambiguation of words in glossary definitions. Evaluation experiments have been performed on specific knowledge domains (e.g., tourism, computer networks, enterprise interoperability), as well as on standard disambiguation test sets. Index Terms —Natural language processing, ontology learning, structural pattern matching, word sense disambiguation. � 1 I NTRODUCTION W The SensEval workshop series are specifically dedicated ORD sense disambiguation (WSD) is perhaps the most critical task in the area of computational linguistics to the evaluation of WSD algorithms. Systems compete on (see [1] for a survey). Early approaches were based on different tasks (e.g., full WSD on generic texts, disambigua- semantic knowledge that was either manually encoded [2], tion of dictionary sense definitions, automatic labeling of [3] or automatically extracted from existing lexical re- semantic roles) and in different languages. English All- sources, such as WordNet [4], [5], LDOCE [6], and Roget’s Words (full WSD on annotated corpora, such as the Wall thesaurus [7]. Similarly to other artificial intelligence Street Journal and the Brown Corpus) is among the most applications, knowledge-based WSD was faced with the attended competitions. At Senseval-3, held in March 2004, knowledge acquisition bottleneck. Manual acquisition is a 17 supervised and 9 unsupervised systems participated in heavy and endless task, while online dictionaries provide the task. The best systems were those using a combination semantic information in a mostly unstructured way, making of several machine learning methods, trained with data on it difficult for a computer program to exploit the encoded word cooccurrences and, in few cases, with syntactic lexical knowledge. features, but nearly no system used semantic information. 4 More recently, the use of machine learning, statistical The best systems reached about 65 percent precision, 65 percent recall, 5 a performance considered well below and algebraic methods ([8], [9]) prevailed on knowledge- based methods, a tendency that clearly emerges in the main the needs of many real-world applications [10]. Comparing performances and trends with respect to previous SensEval Information Retrieval conferences and in comparative system evaluations, such as SIGIR, 1 TREC 2 , and SensEval. 3 events, the feeling is that supervised machine learning methods have little hope of providing a real break-through, These methods are often based on training data (mainly, the major problem being the need for high quality training word cooccurrences) extracted from document archives and data for all the words to be disambiguated. from the Web. The lack of high-performing methods for sense disambi- guation may be considered the major obstacle that pre- 1. http://www.acm.org/sigir/. vented an extensive use of natural language processing 2. http://trec.nist.gov/. 3. http://www.senseval.org/. techniques in many areas of information technology, such as information classification and retrieval, query proces- sing, advanced Web search, document warehousing, etc. On . The authors are with the Dipartimento di Informatica, Universita ` of Roma the other hand, new emerging applications, like the so- “La Sapienza,” via Salaria 113, 00198 Roma, Italy. called Semantic Web [11], foster “an extension of the current E-mail: {navigli, velardi}@di.uniroma.it. web in which information is given well-defined meaning , Manuscript received 2 Jan. 2004; revised 14 Apr. 2005; accepted 14 Apr. 2005; published online 12 May 2005. Recommended for acceptance by M. Basu. 4. One of the systems reported the use of domain labels, e.g., medicine, For information on obtaining reprints of this article, please send e-mail to: tourism, etc. tpami@computer.organdreferenceIEEECSLogNumberTPAMISI-0003-0104. 5. A performance sensibly lower than for Senseval-2. 0162-8828/05/$20.00 � 2005 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on October 30, 2008 at 12:23 from IEEE Xplore. Restrictions apply.
Recommend
More recommend