Building a Tokenizer for Indonesian David Moeljadi and Hannah Choi Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore The 21st International Symposium on Malay/Indonesian Linguistics (ISMIL 21), Langkawi Research Center 4 May 2017 Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 1 / 13
Outline 1. Tokenization 2. Wordnet Bahasa 3. Our proposal Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 2 / 13
There is no good tokenizer for Indonesian Many benefjts we can get, esp. for natural language processing, corpora etc. We will propose our guidelines Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 3 / 13 → we are building a good one (early stage) → open to comments and suggestions
Tokenization Tokenization or word segmentation is the task of separating out text; the segmentation of text [3] Tokens: words, numbers, punctuation marks, parentheses, quotation marks, and similar entities Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 4 / 13 (tokenizing) words or other meaningful elements (tokens) from running
An Example in English 2 1/2 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) Corpus linguistics: an international handbook, volume 1 </S> . said Lowe Mr. ” , hours three to for “Most customers don’t want to sit in a turboprop for 2 1/2 to three customers hours,” Mr. Lowe said. Wall Street Journal corpus Tokenization result: <S> “ Most do turboprop n’t want to sit in a 5 / 13
An Example in Indonesian di 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) </S> . ’ WIB 19.00 jam sana kumpul …salah satu relawannya Ahok bilang ‘Kita kumpul di sana jam 19.00 Kita ‘ bilang Ahok nya relawan salah satu <S> Tokenization result: KOMPAS .com “Merespons Pembakaran Bunga, Relawan Ahok-Djarot Nyalakan Lilin” WIB’. … 6 / 13
Purpose of Tokenization translation, information retrieval, sentiment analysis 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) extraction, term extraction, … identifjcation of collocations, determining verb frames, information 7 / 13 Tokenization is useful both in linguistics (where it is a form of text text mining (the process of deriving high-quality information from text). (taking an input and producing some sort of linguistic structure for it) or analysis. segmentation), and in computer science, where it forms part of lexical The list of tokens becomes input for further processing such as parsing Text → tokenization → part-of-speech (POS) tagging → lemmatization → sense/semantic tagging → semantic disambiguation → machine → syntactic parsing → treebank building, corpus query, lexicography
Current situation NLTK tokenizer ( http://text-processing.com/demo/tokenize/ ) morphInd ( http://septinalarasati.com/work/morphind/ ) http://morphadorner.northwestern.edu/morphadorner/ wordtokenizer/example/ … Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 8 / 13
Tokenization problems Multiword expressions 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) … Affjxes tree/your monkey”, penanya “questioner/his/her/the pen”, … “Gilt-head bream’s house/my frozen house”, keramu “keramu Problems: kucek “rub,scrub/checked by me”, rumah bekuku bukunya “the/his/her book”, … you”, dikejarnya “chased by him/her”, mengejarmu “chase you”, e.g. isn’t, he’s, we’ll, kukejar “chased by me”, kaukejar “chased by Clitics room”, kambing hitam “scapegoat/black goat”, … Problems: orang tua “parent/old person”, kamar kecil “toilet/small dan lain-lain “et cetera”, … e.g. New York, rumah sakit “hospital”, memberi tahu “tell, inform”, 9 / 13 e.g. se-Indonesia “whole/entire Indonesia”, seekor “one cl ”, …
Wordnet an open-source, free semantic lexicon a resource for the study of lexical semantics http://wordnet.princeton.edu synset (synonym set): a group of words with closely related meanings e.g. the noun “car” has 5 difgerent meanings (senses), thus belongs to multiple synsets. One synset for “car” consists of many members. [2] Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 10 / 13
Wordnet Bahasa Indonesian: 48,689 synsets and 58,541 words 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) of English, Chinese, Japanese and Indonesian, … has been used for sense tagging NTU Multilingual Corpus (NTU-MC) Malay: 38,736 synsets and 45,664 words Open Wordnet Bahasa (Nurril Hirfana, Suerya & Bond, 2011) http://wn-msa.sourceforge.net 3 Indonesian Wordnet (Riza, Budiono & Hakim, 2010) 2 Malay Wordnet (Lim & Hussein, 2006) 1 The Combined Wordnet Bahasa [1]: open source 11 / 13
Our proposal General rules: 4 May 2017 Tokenizer for Indonesian Moeljadi & Choi (LMS, NTU) (both se and ekor are in Wordnet) ekor meanings (both pena and nya are in Wordnet) nya ( orang , tua , and orang tua are in Wordnet) Wordnet 12 / 13 1 Do not tokenize multiword expressions into words if they are in e.g. orang tua “parent/old person” → orang tua “parent” 2 Split clitics from the bases e.g. penanya “questioner/my pen” → pena 3 Split affjxes from the stems if the affjxes have consistent, predictable e.g. seekor “one cl ” → se
References Francis Bond et al. “The combined Wordnet Bahasa”. In: NUSA: Linguistic studies of languages in and around Indonesia 57 (2014), pp. 83–100. Christiane Fellbaum. WordNet: an electronic lexical database . Cambridge: MIT Press, 1998. url : http://wordnet.princeton.edu/man/wninput.5WN.html (visited on 11/24/2014). Daniel Jurafsky and James H. Martin. Speech and Language Processing . 2nd ed. New Jersey: Pearson Education, Inc., 2009. Moeljadi & Choi (LMS, NTU) Tokenizer for Indonesian 4 May 2017 13 / 13
Recommend
More recommend