A General Introduction to Lexical Databases Emmanuel Keuleers Department of Experimental Psychology Ghent University EMLAR 2015 - Utrecht, April 15-17, 2015 • What can you find in a lexical database? • How can you find it?
Lexical Databases • Like a dictionary • Lexical properties of interest to psycholinguists • Frequency, orthography, phonology, morphology, syntax, … • Subjective ratings of those words • Behavioural responses to those data Lexical Databases • No standard: each database has its own format, peculiarities, ... • Text files, web interfaces, e-mail services, etc ... • In essence, a lexical database is just a list with a bunch of information about words.
Lexical Databases • The truth: you'll have to find out where to find something and be prepared to do some processing work. CELEX: the big and complex lexical database
History • Centre for Lexical Information • Founded in Nijmegen in 1986 • Max Planck Institute for Psycholinguistics & Interfaculty Research Unit for Language and Speech of the University of Nijmegen (now CLS) • Project ended in 2000 • Three large databases with lexical information for Dutch, English, and German • Dutch Database • 124,136 lemmata • 381,292 wordforms • 211,389 corpus types • English Database • 52,446 lemmata • 160,594 wordforms • 220,271 corpus types • German database • 51,728 lemmata • 365,530 wordforms • 290,712 corpus types
Wordforms, lemmas, and corpus types Corpus types • Letter strings, regardless of part of speech • a walk in the park = to walk slowly = i walk alone = you walk alone
Wordforms • Letter strings disambiguated for part of speech (and sometimes meaning) • a walk in the park ≠ to walk slowly ≠ i walk alone ≠ you walk alone • (walk, noun, singular), (walk, verb, infinitive), (walk, verb, 1p), (walk, verb, 2p) Lemmas • Headwords • (walk, noun): a walk in the park = the long walks • (walk, verb): I'm walking slowly = i walk alone = he walks too fast
Celex Build Up • Information from dictionary sources • Corpus counts or correlation with existing frequency counts • Almost completely biased towards written language Dutch Database Sources • Van Dale's Comprehensive Dictionary of Contemporary Dutch (1984) • 80,000 lemmata • Word List of the Dutch Language ('Het Groene Boekje') (1954), plus later revisions, including the 1994 spelling reform • 65,000 lemmata • The most frequent lemmata from the text corpus of the Institute for Dutch Lexicology (INL) 42,380,000 words in all • 15,000 lemmata
English Database Sources • Oxford Advanced Learner's Dictionary (1974) • 41,000 lemmata • Longman Dictionary of Contemporary English (1978) • 53,000 lemmata German Database Sources • Bonnlex, supplied by the Institute for Communication Research and Phonetics in Bonn • Molex, supplied by the Institute for German Language in Mannheim • Noetic Circle Services (MIT) German spelling lexicon
Dutch Frequency Sources • INL Corpus (42 million tokens) • 930 entire fiction and non-fiction books (approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection. English Frequency Sources • COBUILD/Birmingham corpus (17.9 million tokens) • 16.6 million tokens from written texts • 1.3 million tokens from transcribed dialogue
German Frequency Sources • Mannheimer Korpus I, Mannheimer Korpus II and Bonner Zeitungskorpus 1 • 5.4 million tokens • written texts like newspapers, fiction and non- fiction • Freiburger Korpus • 600,000 tokens • transcribed speech • Corpus Types • Frequency • Orthography
• Lemma lexica • Frequency • Orthography • Phonology • Derivational Morphology • Grammatical information • Wordform Lexica • Frequency • Orthography • Phonology • Inflectional Morphology
Frequency Verb Frequency Deviation Freq/Million accept 3712 0 207.37 accord 2010 12 112.29 achieve 2121 0 118.49 act 2212 430 123.58 add 4190 0 234.08 agree 3424 0 191.28
Frequency/ Lexicon Form Frequency Deviation million lemma act 2212 430 123.58 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acting 489 103 27.32 wordform acts 187 80 10.45 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acted 92 366 5.14 • Lemma frequency • Frequency over all wordforms of the lemma • Wordform frequency • Deviation == 0 : exact count • Deviation > 0 : result of disambiguation
• Less than 100 tokens • Manual disambiguation • More than 100 tokens • Disambiguation on a sample of 100 tokens • Frequency ± deviation = 95 % confidence interval • No disambiguation for verbal flection • Frequency divided between forms • Frequency Deviation > Frequency • No disambiguation for German • Frequency divided between forms
• English and German databases have separate fields for written and spoken frequencies • Spoken frequencies based on very small corpora • 1.3 million for English • 0.4 million for German • What does it mean when an entry in CELEX has a frequency of zero • Many entries in the database sources were not found in the frequency sources • A few entries do not come from database sources but are left with a zero frequency after disambiguation • will have deviation > zero • Many entries added to CeLex for morphological decomposition of other lemmas have a frequency of zero
Word frequency distributions Word frequency distributions word frequency rank the 1 093 546 1 of 540 085 2 and 514 946 3 to 483 428 4 a 422 334 5 in 337 995 6 that 217 376 7 it 199 920 8 i 198 139 9
• Let's plot the rank of each word in the COBUILD corpus against its frequency. • The word with the highest frequency gets the highest rank (1), the word with the lowest frequency gets the lowest rank (220,270). • In total there are 17.9 million word tokens in the COBUILD corpus.
• Not very clear. • Let's plot it again so that the difference between a frequency of 1 and a frequency of 10 is the same as the difference between a frequency of 10 and a frequency of 100 the of 10 6 =1,000,000 and to a in 10 3 =1,000 all these words have frequency 1 10 0 =1
• Word frequency lists are composed of very few words with a very high frequency • Most words (corpus types) occur only once in the corpus! • The relation between word frequency and rank is log linear.
Comparing frequencies • Word frequencies from different databases cannot be easily compared because of different corpus sizes • Example: Celex Dutch ±42m vs Celex English ±18million • Solution: frequency per million words Frequency per million frequency word frequency rank per million the 1 093 546 60 955.74 1 of 540 085 30 105.07 2 and 514 946 28 703.79 3 to 483 428 26 946.93 4 a 422 334 23 541.47 5 in 337 995 18 840.30 6 that 217 376 12 116.83 7 it 199 920 11 143.81 8 i 198 139 11 044.54 9
Comparing frequencies • Beware! Some frequency lists contain words with a frequency of 0 • Log10(0) is not something that can be computed • Solution: always add 1 to the raw frequencies when you are transforming to frequencies per million Formula Frequency per million = Raw Frequency +1 (adjusted) Corpus size in million FPM ('that') = 217 376 +1 =12116.89 17.94 log10(12116.89)=4.08
Zipf Values Van Heuven, Mandera, Keuleers, & Brysbaert (2014) Formula Freq. per billion = Raw Frequency +1 Corpus size in billion FPB ('that') = 217 376 +1 =12116889.63 .01794 log10(12116889.63)=7.08
Relative Frequency log10(fpm) word frequency zipf the 1 093 546 0.0602191 4.78 7.78 of 540 085 0.0297413 4.47 7.47 and 514 946 0.0283569 4.45 7.45 to 483 428 0.0266213 4.43 7.43 a 422 334 0.0232570 4.37 7.37 in 337 995 0.0186127 4.27 7.27 that 217 376 0.0119704 4.08 7.08 it 199 920 0.0110092 4.04 7.04 i 198 139 0.0109111 4.04 7.04 Orthography
• Lemma and wordform lexica list orthographic variants with separate frequencies Status • Dutch: preferred, non-preferred, informal • preferred & non-preferred: in “Groene Boekje” • informal: non-standard forms occurring at least once in INL corpus • English: British, American • British: acceptable for British • American: occurs only in American • German • No orthographic variants
Lemma ID Form Status Frequency 1070 aardappelcroquet preferred 0 1070 aardappelkroket non-preferred 0 1138 aardelektrode preferred 0 1138 aardelectrode non-preferred 0 1202 aardolieprodukt preferred 6 1202 aardolieproduct non-preferred 0 1357 abductie preferred 0 1357 abduktie non-preferred 0 Lemma ID Form Status Frequency 1359 anaesthesia British 12 1359 anesthesia American 1 1360 anaesthetic British 47 1360 anesthetic American 4 1361 anaesthetic British 8 1361 anesthetic American 0 1362 anaesthetist British 16
Recommend
More recommend