A Discriminative Approach to Japanese Abbreviation Extraction Naoaki - PDF document

A Discriminative Approach to Japanese Abbreviation Extraction Naoaki Okazaki † Mitsuru Ishizuka † okazaki@is.s.u-tokyo.ac.jp ishizuka@i.u-tokyo.ac.jp Jun’ichi Tsujii †‡ tsujii@is.s.u-tokyo.ac.jp † Graduate School of Information ‡ School of Computer Science, Science and Technology, University of Manchester University of Tokyo National Centre for Text Mining (NaCTeM) 7-3-1 Hongo, Bunkyo-ku, Manchester Interdisciplinary Biocentre, Tokyo 113-8656, Japan 131 Princess Street, Manchester M1 7DN, UK Abstract Abbreviations are among a highly productive type of term variants, which substitutes fully expanded This paper addresses the difficulties in rec- terms with shortened term-forms. Most previous ognizing Japanese abbreviations through the studies aimed at establishing associations between use of previous approaches, examining ac- abbreviations and their full forms in English (Park tual usages of parenthetical expressions in and Byrd, 2001; Pakhomov, 2002; Schwartz and newspaper articles. In order to bridge the Hearst, 2003; Adar, 2004; Nadeau and Turney, gap between Japanese abbreviations and 2005; Chang and Sch¨ utze, 2006; Okazaki and Ana- their full forms, we present a discrimina- niadou, 2006). Although researchers have proposed tive approach to abbreviation recognition. various approaches to solving abbreviation recog- More specifically, we formalize the abbrevi- nition through methods such as deterministic algo- ation recognition task as a binary classifica- rithm, scoring function, and machine learning, these tion problem in which a classifier determines studies rely on the phenomenon specific to English a positive (abbreviation) or negative (non- abbreviations: all letters in an abbreviation appear in abbreviation) class, given a candidate of ab- its full form. breviation definition. The proposed method However, abbreviation phenomena are heavily de- achieved 95.7% accuracy, 90.0% precision, pendent on languages. For example, the term one- and 87.6% recall on the evaluation corpus segment broadcasting is usually abbreviated as one- containing 7,887 (1,430 abbreviations and seg in Japanese; English speakers may find this pe- 6,457 non-abbreviation) instances of paren- culiar as the term is likely to be abbreviated as 1SB thetical expressions. or OSB in English. We show that letters do not pro- vide useful clues for recognizing Japanese abbrevia- 1 Introduction tions in Section 2. Elaborating on the complexity of the generative processes for Japanese abbreviations, Human languages are rich enough to be able to Section 3 presents a supervised learning approach to express the same meaning through different dic- Japanese abbreviations. We then evaluate the pro- tion; we may produce different sentences to convey posed method on a test corpus from newspaper arti- the same information by choosing alternative words cles in Section 4 and conclude this paper. or syntactic structures. Lexical resources such as WordNet (Miller et al., 1990) enhance various NLP 2 Japanese Abbreviation Survey applications by recognizing a set of expressions re- ferring to the same entity/concept. For example, text Researchers have proposed several approaches to retrieval systems can associate a query with alterna- abbreviation recognition for non-alphabetical lan- tive words to find documents where the query is not guages. Hisamitsu and Niwa (2001) compared different statistical measures (e.g., χ 2 test, log like- obviously stated. 889

Table 1: Parenthetical expressions used in Japanese newspaper articles lihood ratio) to assess the co-occurrence strength the 1st, 2nd, and 4th words in the full form. Since between the inner and outer phrases of parenthet- all letters in an acronym appear in its full form, pre- ical expressions X (Y) . Yamamoto (2002) utilized vious approaches to English abbreviations are also the similarity of local contexts to measure the para- applicable to Japanese acronyms. Unfortunately, in phrase likelihood of two expressions based on the this survey the number of such ‘authentic’ acronyms distributional hypothesis (Harris, 1954). Chang and amount to as few as 90 (1.2%). Teng (2006) formalized the generative processes of The second group acronym with translation (II) is Chinese abbreviations with a noisy channel model. characteristic of non-English languages. Full forms Sasano et al. (2007) designed rules about letter types are imported from foreign terms (usually in En- and occurrence frequency to collect lexical para- glish), but inherit the foreign abbreviations. The phrases used for coreference resolution. third group alias (III) presents generic paraphrases How are these approaches effective in recogniz- that cannot be interpreted as abbreviations. For ex- ing Japanese abbreviation definitions? As a prelimi- ample, Democratic People’s Republic of Korea is nary study, we examined abbreviations described in known as its alias North Korea . Even though the parenthetical expressions in Japanese newspaper ar- formal name does not refer to the ‘northern’ part, the ticles. We used the 7,887 parenthetical expressions alias consists of Korea , and the locational modifier that occurred more than eight times in Japanese ar- North . Although the second and third groups retain ticles published by the Mainichi Newspapers and their interchangeability, computers cannot recognize Yomiuri Shimbun in 1998–1999. Table 1 summa- abbreviations with their full forms based on letters. rizes the usages of parenthetical expressions in four The last group (IV) does not introduce inter- groups. The field ‘para’ indicates whether the inner changeable expressions, but presents additional in- and outer elements of parenthetical expressions are formation for outer phrases. For example, a location interchangeable. usage of a parenthetical expression X (Y) describes The first group acronym (I) reduces a full form to an entity X , followed by its location Y . Inner and a shorter form by removing letters. In general, the outer elements of parenthetical expressions are not process of acronym generation is easily interpreted: interchangeable. We regret to find that as many as the left example in Table 1 consists of two Kanji let- 81.9% of parenthetical expressions were described ters taken from the heads of the two words, while for this usage. Thus, this study regards acronyms the right example consists of the letters at the end of (with and without translation) and alias as Japanese 890

A Discriminative Approach to Japanese Abbreviation Extraction Naoaki - PDF document

A Discriminative Approach to Japanese Abbreviation Extraction Naoaki Okazaki Mitsuru Ishizuka okazaki@is.s.u-tokyo.ac.jp ishizuka@i.u-tokyo.ac.jp Junichi Tsujii tsujii@is.s.u-tokyo.ac.jp Graduate School of Information

NAACCR RECOMMENDED ABBREVIATION LIST ORDERED BY WORD/TERM(S) WORD/TERM(S) ABBREVIATION/SYMBOL

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Japanese Layout Requirements Richard Ishida 1 Japanese Layout Requirements This presentation

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Most commonly used echocardiographic abbreviations Only use abbreviation if used more than 3 times

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa

Towards a learning approach for abbreviation detection and resolution Klaar Vanopstal, Bart

Lessons Learnt from Japanese Red Cross Response to 3.11 Naoki Shiratsuchi Japanese Red Cross

Rare donor program in Japan Yoshihiko Tani, MD, PhD Japanese Red Cross Kinki Block Blood Center

AN INTRODUCTION TO NIIGATA SAKE Jonny Woodward WHAT IS SAKE? The word sake in Japanese is

Advanced Invasive Plant Management Managing Japanese Barberry & Japanese Knotweed

November 3 rd , 2008 Experiences in Japanese Licensing Basic Japanese licensing /

JAPANESE JAPANESE CASHEW MARKET CASHEW MARKET CASHEW MARKET CASHEW MARKET What Consumers are

Jou ournal Title Abbreviation on pISSN SSN eISSN SSN Pub Publishe sher Free Acce Fre

A score book page has a place for each person in the batting order, and then a tiny box (usually

COMMONLY USED ABBREVIATIONS AND TERMS in CLINCAL TRIALS Abbreviation Definition ADR Adverse Drug

B afeb .................. afebrile AIDS ................ acquired immunodeficiency syndrome Ba

A W ORD ABBREVI ATI ON ACCOUNTING ACCTG ACTIVE MANAGEMENT AREA AMA ACQUISITION ACQN

Breed Code Guide Cattle breeds and their abbreviation codes Code Cattle Breed Code Cattle

TERM ABBREVIATION Address ADDR Also Known As AKA And & Appointment APPT Approximately

Abbreviation and Acronym Disambiguation in Clinical Discourse Serguei Pakhomov, PhD 1 , Ted

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A Discriminative Approach to Japanese Abbreviation Extraction Naoaki - PDF document

A Discriminative Approach to Japanese Abbreviation Extraction Naoaki Okazaki Mitsuru Ishizuka okazaki@is.s.u-tokyo.ac.jp ishizuka@i.u-tokyo.ac.jp Junichi Tsujii tsujii@is.s.u-tokyo.ac.jp Graduate School of Information

NAACCR RECOMMENDED ABBREVIATION LIST ORDERED BY WORD/TERM(S) WORD/TERM(S) ABBREVIATION/SYMBOL

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Japanese Layout Requirements Richard Ishida 1 Japanese Layout Requirements This presentation

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Japanese waste paper trend Japanese waste paper trend High collection &amp; reuse High

Most commonly used echocardiographic abbreviations Only use abbreviation if used more than 3 times

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa

Towards a learning approach for abbreviation detection and resolution Klaar Vanopstal, Bart

Lessons Learnt from Japanese Red Cross Response to 3.11 Naoki Shiratsuchi Japanese Red Cross

Rare donor program in Japan Yoshihiko Tani, MD, PhD Japanese Red Cross Kinki Block Blood Center

AN INTRODUCTION TO NIIGATA SAKE Jonny Woodward WHAT IS SAKE? The word sake in Japanese is

Advanced Invasive Plant Management Managing Japanese Barberry &amp; Japanese Knotweed

November 3 rd , 2008 Experiences in Japanese Licensing Basic Japanese licensing /

JAPANESE JAPANESE CASHEW MARKET CASHEW MARKET CASHEW MARKET CASHEW MARKET What Consumers are

Jou ournal Title Abbreviation on pISSN SSN eISSN SSN Pub Publishe sher Free Acce Fre

A score book page has a place for each person in the batting order, and then a tiny box (usually

COMMONLY USED ABBREVIATIONS AND TERMS in CLINCAL TRIALS Abbreviation Definition ADR Adverse Drug

B afeb .................. afebrile AIDS ................ acquired immunodeficiency syndrome Ba

A W ORD ABBREVI ATI ON ACCOUNTING ACCTG ACTIVE MANAGEMENT AREA AMA ACQUISITION ACQN

Breed Code Guide Cattle breeds and their abbreviation codes Code Cattle Breed Code Cattle

TERM ABBREVIATION Address ADDR Also Known As AKA And &amp; Appointment APPT Approximately

Abbreviation and Acronym Disambiguation in Clinical Discourse Serguei Pakhomov, PhD 1 , Ted

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Advanced Invasive Plant Management Managing Japanese Barberry & Japanese Knotweed

TERM ABBREVIATION Address ADDR Also Known As AKA And & Appointment APPT Approximately