Building an online Indonesian dictionary from Word and Excel fjles David Moeljadi Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore NIE-ELL Postgraduate Conference (PGC), National Institute of Education (NIE), Singapore 20 April 2017 Moeljadi (LMS, NTU) KBBI V 20 April 2017 1 / 31
Outline 1. Kamus Besar Bahasa Indonesia (KBBI) 2. From Word and Excel to Database 3. Features in the Online KBBI V Moeljadi (LMS, NTU) KBBI V 20 April 2017 2 / 31
Kamus Besar Bahasa Indonesia (KBBI) the offjcial dictionary of the Indonesian language published by Badan Pengembangan dan Pembinaan Bahasa (The Language Development and Cultivation Agency) or Badan Bahasa under Ministry of Education and Culture, Republic of Indonesia KBBI Fourth Edition (KBBI IV) [5] had its data in Microsoft Excel and Word fjles Moeljadi (LMS, NTU) KBBI V 20 April 2017 3 / 31
Dictionary entries in KBBI Moeljadi (LMS, NTU) KBBI V 20 April 2017 4 / 31
Dictionary entries in KBBI Moeljadi (LMS, NTU) KBBI V 20 April 2017 5 / 31
Dictionary entries in KBBI Moeljadi (LMS, NTU) KBBI V 20 April 2017 6 / 31
Dictionary entries in KBBI Cross-references Moeljadi (LMS, NTU) KBBI V 20 April 2017 7 / 31
The Online KBBI before October 2016 data from KBBI III, for simple word search by root ( kata dasar ) the result is exactly in the same format as the one in the printed dictionary the data was not structured, no database Moeljadi (LMS, NTU) KBBI V 20 April 2017 8 / 31
From KBBI IV to KBBI V Moeljadi (LMS, NTU) KBBI V 20 April 2017 9 / 31
From KBBI IV to KBBI V Moeljadi (LMS, NTU) KBBI V 20 April 2017 10 / 31
Word and Excel fjles Moeljadi (LMS, NTU) KBBI V 20 April 2017 11 / 31
From Word and Excel to Rich Text Format (rtf) Moeljadi (LMS, NTU) KBBI V 20 April 2017 12 / 31
From rtf to HyperText Markup Language (html) Moeljadi (LMS, NTU) KBBI V 20 April 2017 13 / 31
Using Python… The data was broken down by lemmas, sublemmas ( derived words, compounds, proverbs, and idioms ), labels, pronunciations, defjnitions, examples, scientifjc names, and chemical formulas using regular expression , a language for specifying text search strings which requires a pattern that we want to search for and a corpus of texts to search through [4]. Moeljadi (LMS, NTU) KBBI V 20 April 2017 14 / 31
Regular expression Moeljadi (LMS, NTU) KBBI V 20 April 2017 15 / 31
KBBI Database SQLite ( www.sqlite.org ) Moeljadi (LMS, NTU) KBBI V 20 April 2017 16 / 31
The current state of the KBBI Database Lemmas: 48,140 Derived words: 26,197 Compound words: 30,375 Proverbs: 2,039 Idioms: 267 Entries (total): 108,238 Defjnition sentences: 126,635 Examples: 29,251 Moeljadi (LMS, NTU) KBBI V 20 April 2017 17 / 31
What can we get from KBBI Database? I SELECT entri, jenis, makna FROM baseview WHERE entri="sedia payung sebelum hujan"; domain labels) SELECT entri, ragam, bahasa, makna FROM baseview WHERE ragam="ark" and bahasa="Jw"; Moeljadi (LMS, NTU) KBBI V 20 April 2017 18 / 31 1 More specifjc and targeted word lookups, e.g. ▶ looking up phrases and MWEs such as compound words, idioms, and proverbs as well as derived words ▶ looking up entries by their labels (part-of-speech, language, and
What can we get from KBBI Database? II sesuatu 557 kata 806 tempat 1,858 proses 573 823 1,595 perihal 2,703 orang Freq. Word Freq. Word alat menjadikan Word 835 20 April 2017 KBBI V Moeljadi (LMS, NTU) … … 656 hasil bagian 745 526 mempunyai 664 yang 1,526 tidak 547 pohon Freq. 19 / 31 orang 10,312 8,638 dalam 26,221 dan 6,793 pada untuk 6,110 43,613 yang Freq. Word Freq. Word … atau Word seperti … 7,280 dari 12,016 dengan 3,422 7,756 14,414 tidak 12,410 sebagainya 4,746 tentang 8,537 di Freq. 2 Lexicography analysis ▶ extracting the most frequent words in the defjnition sentences → can be used as a lexical set for the Indonesian learner’s dictionary ▶ extracting the most frequent genus terms in the defjnition sentences
What can we get from KBBI Database? III … 7.6% peN-...-an peng abadi an 1,780 7.2% … … abai an … Total 24,587 100.0% Moeljadi (LMS, NTU) KBBI V 20 April 2017 1,873 -an 11.0% meng abadi reduplication in Indonesian Affjx/Redup. Example Number Percentage meN- 5,185 21.1% meN-...-kan meng abadi kan 2,884 11.7% ber- ber abang 2,704 20 / 31 3 Linguistic analysis ▶ grouping the derived words based on affjxes and patterns of 4 Online and offmine applications etc.
The Online KBBI V offjcially launched on 28 October 2016 [1], its user interface and the system were made using ASP.NET ( www.asp.net ). https://kbbi.kemdikbud.go.id/ Dictionary Writing System (DWS) [2] which enables lexicographers to compile and edit dictionary text, as well as to facilitate project management, typesetting, and output to printed or electronic media Moeljadi (LMS, NTU) KBBI V 20 April 2017 21 / 31
Some features in the Online KBBI data can be easily 20 April 2017 KBBI V Moeljadi (LMS, NTU) to print format the data in the database print function can convert no print function Print function from web crawlers to protect the data customized security system crawled Security system Before 28 Oct 2016 examples (crowdsourcing) lemmas, defjnitions, and to add, edit, and deactivate +online public participation board in Badan Bahasa done within the editorial workfmow Lexicographical advanced (+by labels etc.) basic (by roots) Word search After 28 Oct 2016 22 / 31
Lexicographical workfmow in the Online KBBI Moeljadi (LMS, NTU) KBBI V 20 April 2017 23 / 31
How a new lemma can be included in KBBI? NOT OK si.ha.lu.an v saling bertemu (cf. ber.se.mu.ka ) NOT OK ojeg n sepeda atau sepeda motor yang ditambangkan dengan cara memboncengkan penumpang atau penyewanya (cf. ojek ) NOT OK la.bu.la.bu.wai n nasi yang diberi air putih ditambah garam atau ikan asin Dora Amalia, p.c. Moeljadi (LMS, NTU) KBBI V 20 April 2017 24 / 31 1 Having a unique concept 2 According to the Indonesian spelling rules 3 Euphonic (being pleasing to the ear) 4 Having positive connotations 5 Having a high frequency of use
Rejected proposal Moeljadi (LMS, NTU) KBBI V 20 April 2017 25 / 31
Accepted proposal Moeljadi (LMS, NTU) KBBI V 20 April 2017 26 / 31
Current situation (as of 20 April 2017) Word lookups Proposals Popularity (according to Alexa Traffjc Ranks www.alexa.com ) Moeljadi (LMS, NTU) KBBI V 20 April 2017 27 / 31 ▶ Total: 2,733,592 (10.93/minute, 653.90/hour, 15,741.62/day) ▶ Total: 8,375 (48.23/day) ▶ Accepted: 2,681 ▶ Rejected: 494 ▶ Being processed: 4,732 ▶ Global rank: 2,548 ▶ Rank in Indonesia: 64
Future work add etymological information connect to corpora link to other lexical resources such as Wordnet Bahasa [3] Moeljadi (LMS, NTU) KBBI V 20 April 2017 28 / 31
Acknowledgments Thanks to NTU HSS library support stafg: Rashidah Ismail, Raihana 20 April 2017 KBBI V Moeljadi (LMS, NTU) order the dictionary paper dictionary for months; and to Wong Oi May who helped us Abdul Wahid, and Tan Chuan Ko for allowing me to borrow KBBI IV Thanks to Lie Gunawan for creating the iOS application Thanks to Dora Amalia for the KBBI IV data and her support applications Thanks to Jaya Satrio Hendrick for designing the Android and iOS Thanks to Randy Sugianto for creating the Android application Thanks to Ian Kamajaya for building the Online KBBI Thanks to Ivan Lanin for improving the database advice on the database structure Thanks to Francis Bond and Luis Morgado da Costa for the precious 29 / 31
References Dora Amalia, ed. Kamus Besar Bahasa Indonesia . 5th ed. Jakarta: Badan Pengembangan dan Pembinaan Bahasa, 2016. B. T. Sue Atkins and Michael Rundell. The Oxford Guide to Practical Francis Bond et al. “The combined Wordnet Bahasa”. In: NUSA: Linguistic studies of languages in and around Indonesia 57 (2014), pp. 83–100. Daniel Jurafsky and James H. Martin. Speech and Language Dendy Sugono, ed. Kamus Besar Bahasa Indonesia Pusat Bahasa . 4th ed. Jakarta: PT Gramedia Pustaka Utama, 2008. Moeljadi (LMS, NTU) KBBI V 20 April 2017 30 / 31 Lexicography . Oxford University Press, 2008. Processing . 2nd ed. New Jersey: Pearson Education, Inc., 2009.
Moeljadi (LMS, NTU) KBBI V 20 April 2017 31 / 31
Recommend
More recommend