Language Resource Addition: Dictionary or Corpus? Shinsuke Mori - PowerPoint PPT Presentation

Language Resource Addition: Dictionary or Corpus? Shinsuke Mori Graham Neubig Kyoto University NAIST 2014 May 29 1 / 30

Table of Contents Overview Morphological Analysis Evaluation Realistic Cases Conclusion 2 / 30

NLP for Applications ◮ Machine learning approach 1. Annotation standard 2. Language resource (Texts with annotations) 3. Classifiers ◮ High accuracy in the general domain ◮ We have enough large annotated data ◮ Not sufficiently accurate for various texts ◮ Achieve a high accuracy by all means!! 3 / 30

Language Resource Addition for ML-based NLP ✓ ✏ Language resource addition never betrays!! ✒ ✑ ◮ As dictionary entries ◮ Without context ⇒ Improve NLP ◮ Easy for tool users ··· You just edit the dictionary. ◮ As an annotated corpus ··· You need re-training. ◮ Not easy for tool users ◮ With context ⇒ Improve more? 4 / 30

Task for Experiments ◮ Japanese morphological analysis = WS + PT ✓ ✏ Word segmentation (WS) 吾輩は猫である ex.) I am a cat ⇓ 吾輩は猫である ✒ ✑ ✓ ✏ Part-of-speech tagging (PT) 吾輩は猫である ex.) ⇓ N P N P V Suf ✒ ✑ ◮ Most ambiguity lies in WS 5 / 30

Sequence-based Approach (SB) ◮ MeCab: CRF-based joint method [Kudo 04] 吾輩は猫である N P N P V Suf ◮ refers to the word to be tagged w , the word sequences to its left w − and right w + , and their POS ◮ requires fully annotated language resources ✞ ☎ ex.) 吾輩 /N は /P 猫 /N で /P あ /V る /Suf ✝ ✆ Cf. [Tsuboi 08] 6 / 30

Pointwise Approach (PW) ◮ KyTea: 2-step pointwise method (SVM or other) [Neubig 11] ◮ Word segmentation ⇒ POS tagging 吾輩は猫である 0 1 1 1 1 1 ◮ refers to only the word to be tagged w , and the character sequences to its left c − and right c + ◮ never refers to any estimated values! ◮ is trainable from partially annotated language resources ✞ ☎ ex.) 吾輩は猫である ✝ ✆ no annot. no annot. 7 / 30

Pointwise Approach (PW) ◮ KyTea: 2-step pointwise method (SVM or other) [Neubig 11] ◮ Word segmentation ⇒ POS tagging 吾輩は猫である N ◮ refers to only the word to be tagged w , and the character sequences to its left c − and right c + ◮ never refers to any estimated values! ◮ is trainable from partially annotated language resources ✞ ☎ ex.) 吾輩は猫 /N である ✝ ✆ no annot. no annot. 8 / 30

Dictionary or Corpus ✓ ✏ Dictionary word1/POS1,POS2 word2/POS2,POS3 . . . ✒ ✑ ✓ ✏ Corpus left context word1/POS1 right context left context word1/POS2 right context left context word2/POS2 right context left context word2/POS3 right context . . . ✒ ✑ ◮ Unknown words are found in real texts with contexts 9 / 30

Experimental Setting 1. BCCWJ (Balanced Corpus of Contemporary Written Japanese) [Maekawa 08] Corpus Domain #words General 784k (Core Data - Yahoo!QA) General + Web 898k (Core Data) Web for test 13.0k Dictionary Domain #words Coverage (word/POS) General 29.7k 96.3% General + Web 32.5k 97.9% 10 / 30

MA and method ◮ Morphological analyzer 1. MeCab: CRF-based joint method [Kudo 04] 2. KyTea: 2-step pointwise method [Neubig 11] ◮ Adaptation strategies 1. No adaptation: Use the corpus and the dictionary in the general domain. 2. Dictionary addition (no re-training): Add words appearing in the Web training corpus to the dictionary (MeCab only). 3. Dictionary addition (re-training): + estimate the weights on the general domain training data. 4. Corpus addition: Add annotated sentences in the Web training corpus and train the parameters. 11 / 30

Accuracy Mesurement ◮ N REF : the number of word-POS pairs in the correct sentence ◮ N SY S : in the system output ◮ N LCS : the length of the LCS (longuest common subsequence) Recall = N LCS Prec. = N LCS , . N REF N SY S ◮ F-measure: the harmonic mean of the Recall and the Prec. � − 1 � 1 2 N LCS 2( R − 1 + P − 1 ) F = = . N REF + N SY S 12 / 30

Word Segmentation Accuracy Adaptation strategy MeCab KyTea No adaptation 95.20% 95.54% Dict. addition (no re-training) 96.59% - Dict. addition (re-training) 96.55% 96.75% Corpus addition 96.85% 97.15% ◮ Dictionary addition: +1.35% (MeCab), +1.21% (KyTea) ◮ Corpus addition: +0.30% (MeCab), +0.40% (KyTea) 75~80% Without context With context 13 / 30

Realistic Cases ◮ The previous experiments are somewhat artificial or in-vitro ◮ Full annotation required ✞ ☎ ex.) 吾輩 /N は /P 猫 /N で /P あ /V る /Suf ✝ ✆ ◮ Two real adaptation scenarios or in-vivo ◮ Partial annotation ✞ ☎ ex.) 吾輩は猫 /N である ✝ ✆ no annot. no annot. ◮ Only KyTea (MeCab does not support such data) ◮ focusing on word segmentation where most ambiguity lies 14 / 30

Case 1: Recipe Text Analysis for Procedural Text Understanding 1. 各各各各 / ホットホットホットホットドッグドッグドッグドッグパン /F パンパンパンのののの / 内側内側内側内側 /F にににに、、、、 / マヨネーズマヨネーズマヨネーズマヨネーズ /F 、、、、 / マスタードマスタードマスタード /F マスタード、、、、 / 甘味甘味甘味甘味料料 /F 料料 F-part-of (each) ( cmi ) (mayonnaise) (mustard) (sweet relish) (hot dog buns) (of) (Incide) をををを / 広げ広げ /Ac 広げ広げるるるる。。。。 d-obj d-obj d-obj ◮ Recipe flow graph i-obj d-obj ( cmd ) (spread) ( infl. ) corpus [Mori 14] / フランクフルトフランクフルトフランクフルトフランクフルト /F をををを / 入れ入れ /Ac 入れ入れ、、、、 / １３１３１３１３ × × × × ９９９９ “ /St “ “ “ のののの / オーブンオーブンオーブンオーブン皿 /T 皿皿皿にににに / 置置置置 /Ac くくくく。。。。 i-obj other-mod d-obj i-obj (05/29 Session: (of) (baking dish) ( cmi ) (place) ( infl. ) (frankfurter) ( cmd ) (fill) (13 x 9 “) 2. 各各各各 / ホットホットドッグホットホットドッグドッグ /F ドッグにににに / チリチリチリチリ /F 、、、、 / チーズチーズ /F チーズチーズ、、、、 / オニオンオニオンオニオンオニオン /F をををを / ふりかけふりかけ /Ac ふりかけふりかけるるるる。。。。 F-eq i-obj P34 - Corpora and d-obj d-obj d-obj (each) (hot dog) ( cmi ) (chili) (cheese) (onion) ( cmd ) (sprinkle) ( infl. ) d-obj 3. / アルミホイルアルミホイルアルミホイルアルミホイル /T でででで / 覆覆 /Ac 覆覆いいいい、、、、 / オーブンオーブンオーブンオーブン /T にににに / 置置置 /Ac 置くくくく。。。。 Annotation) T-comp d-obj i-obj (aluminum foil) (cmc) (cover) ( infl. ) (oven) ( cmi ) (place) ( infl. ) そしてそしてそしてそして、、、、 / ３５０３５０３５０３５０度度度度 /St でででで / ４５４５４５４５分分分分間間間間 /D / 焼焼焼焼 /Ac くくくく。。。。 T-comp d-obj other-mod (then) (350 degrees) (cmc) (45 minutes) (bake) ( infl. ) ◮ Specifications #Sent. #NEs #Words #Char. Training 1,760 13,197 33,088 50,002 Test 724 – 13,147 19,975 15 / 30

Language Resource Addition: Dictionary or Corpus? Shinsuke Mori - PowerPoint PPT Presentation

Language Resource Addition: Dictionary or Corpus? Shinsuke Mori Graham Neubig Kyoto University NAIST 2014 May 29 1 / 30 Table of Contents Overview Morphological Analysis Evaluation Realistic Cases Conclusion 2 / 30 NLP for

The Corpus of Old English P . S. Langeslag The Dictionary of Old English Corpus 3060 Texts

5: The Corpus of Old English The Dictionary of Old English Corpus 3060 texts A Poetry

Sustainable: (Mirriam-Webster Dictionary) Relating to, or being a method of harvesting or using

Poetic Diction P . S. Langeslag The Dictionary of Old English Corpus 3060 Texts Table 1:

Dictionary and Monolingual Corpus-based Query Translation for Basque-English CLIR Xabier Saralegi

Using Corpus Linguistics in Legal Research: Lessons from the Law and Language at the European

A visit to Halifax, or a visit in Halifax? Opening doors to language through learners

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein

3000PATowards a National Reference Corpus of German Clinical Language Udo Hahn a , Franz

The Extended SPaRKy Restaurant Corpus designing a corpus with variable information density David

Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh)

Mathematics Education and Language Diversity: From Language-as-Problem to Language-as-Resource

Corpus of Modern Greek and the Greek Language Question Maxim Kisilier 2011- 2015: RAS program

30+ years of corpus-based language variation studies. Experiences, challenges and inspirations

A Corpus of Natural Language for Visual Reasoning Cornell Natural Language Visual Reasoning

A Practical Course in Corpus Linguistics for Students with a Humanist Background Mihaela Vela

Low-Resource NLP David R. Mortensen Algorithms for Natural Language Processing Learning

Corpus of Contemporary Lithuanian Language the Standardised Way Erika RIMKUT, Jolanta

T HE C REATION OF S RP K ORP RS/ SrpKorp RS/ Corpus

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

An XML Markup Language An XML Markup Language Framework for Lexical Databases Framework for

Understanding an R corpus IN TRODUCTION TO N ATURAL LAN GUAGE P ROCES S IN G IN R Kasey Jones

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing