A Discriminative Approach to Japanese Abbreviation Extraction Naoaki Okazaki † Mitsuru Ishizuka † okazaki@is.s.u-tokyo.ac.jp ishizuka@i.u-tokyo.ac.jp Jun’ichi Tsujii †‡ tsujii@is.s.u-tokyo.ac.jp † Graduate School of Information ‡ School of Computer Science, Science and Technology, University of Manchester University of Tokyo National Centre for Text Mining (NaCTeM) 7-3-1 Hongo, Bunkyo-ku, Manchester Interdisciplinary Biocentre, Tokyo 113-8656, Japan 131 Princess Street, Manchester M1 7DN, UK Abstract Abbreviations are among a highly productive type of term variants, which substitutes fully expanded This paper addresses the difficulties in rec- terms with shortened term-forms. Most previous ognizing Japanese abbreviations through the studies aimed at establishing associations between use of previous approaches, examining ac- abbreviations and their full forms in English (Park tual usages of parenthetical expressions in and Byrd, 2001; Pakhomov, 2002; Schwartz and newspaper articles. In order to bridge the Hearst, 2003; Adar, 2004; Nadeau and Turney, gap between Japanese abbreviations and 2005; Chang and Sch¨ utze, 2006; Okazaki and Ana- their full forms, we present a discrimina- niadou, 2006). Although researchers have proposed tive approach to abbreviation recognition. various approaches to solving abbreviation recog- More specifically, we formalize the abbrevi- nition through methods such as deterministic algo- ation recognition task as a binary classifica- rithm, scoring function, and machine learning, these tion problem in which a classifier determines studies rely on the phenomenon specific to English a positive (abbreviation) or negative (non- abbreviations: all letters in an abbreviation appear in abbreviation) class, given a candidate of ab- its full form. breviation definition. The proposed method However, abbreviation phenomena are heavily de- achieved 95.7% accuracy, 90.0% precision, pendent on languages. For example, the term one- and 87.6% recall on the evaluation corpus segment broadcasting is usually abbreviated as one- containing 7,887 (1,430 abbreviations and seg in Japanese; English speakers may find this pe- 6,457 non-abbreviation) instances of paren- culiar as the term is likely to be abbreviated as 1SB thetical expressions. or OSB in English. We show that letters do not pro- vide useful clues for recognizing Japanese abbrevia- 1 Introduction tions in Section 2. Elaborating on the complexity of the generative processes for Japanese abbreviations, Human languages are rich enough to be able to Section 3 presents a supervised learning approach to express the same meaning through different dic- Japanese abbreviations. We then evaluate the pro- tion; we may produce different sentences to convey posed method on a test corpus from newspaper arti- the same information by choosing alternative words cles in Section 4 and conclude this paper. or syntactic structures. Lexical resources such as WordNet (Miller et al., 1990) enhance various NLP 2 Japanese Abbreviation Survey applications by recognizing a set of expressions re- ferring to the same entity/concept. For example, text Researchers have proposed several approaches to retrieval systems can associate a query with alterna- abbreviation recognition for non-alphabetical lan- tive words to find documents where the query is not guages. Hisamitsu and Niwa (2001) compared dif- ferent statistical measures (e.g., χ 2 test, log like- obviously stated. 889
Table 1: Parenthetical expressions used in Japanese newspaper articles lihood ratio) to assess the co-occurrence strength the 1st, 2nd, and 4th words in the full form. Since between the inner and outer phrases of parenthet- all letters in an acronym appear in its full form, pre- ical expressions X (Y) . Yamamoto (2002) utilized vious approaches to English abbreviations are also the similarity of local contexts to measure the para- applicable to Japanese acronyms. Unfortunately, in phrase likelihood of two expressions based on the this survey the number of such ‘authentic’ acronyms distributional hypothesis (Harris, 1954). Chang and amount to as few as 90 (1.2%). Teng (2006) formalized the generative processes of The second group acronym with translation (II) is Chinese abbreviations with a noisy channel model. characteristic of non-English languages. Full forms Sasano et al. (2007) designed rules about letter types are imported from foreign terms (usually in En- and occurrence frequency to collect lexical para- glish), but inherit the foreign abbreviations. The phrases used for coreference resolution. third group alias (III) presents generic paraphrases How are these approaches effective in recogniz- that cannot be interpreted as abbreviations. For ex- ing Japanese abbreviation definitions? As a prelimi- ample, Democratic People’s Republic of Korea is nary study, we examined abbreviations described in known as its alias North Korea . Even though the parenthetical expressions in Japanese newspaper ar- formal name does not refer to the ‘northern’ part, the ticles. We used the 7,887 parenthetical expressions alias consists of Korea , and the locational modifier that occurred more than eight times in Japanese ar- North . Although the second and third groups retain ticles published by the Mainichi Newspapers and their interchangeability, computers cannot recognize Yomiuri Shimbun in 1998–1999. Table 1 summa- abbreviations with their full forms based on letters. rizes the usages of parenthetical expressions in four The last group (IV) does not introduce inter- groups. The field ‘para’ indicates whether the inner changeable expressions, but presents additional in- and outer elements of parenthetical expressions are formation for outer phrases. For example, a location interchangeable. usage of a parenthetical expression X (Y) describes The first group acronym (I) reduces a full form to an entity X , followed by its location Y . Inner and a shorter form by removing letters. In general, the outer elements of parenthetical expressions are not process of acronym generation is easily interpreted: interchangeable. We regret to find that as many as the left example in Table 1 consists of two Kanji let- 81.9% of parenthetical expressions were described ters taken from the heads of the two words, while for this usage. Thus, this study regards acronyms the right example consists of the letters at the end of (with and without translation) and alias as Japanese 890
Recommend
More recommend