Lexical Databases Like a dictionary Lexical properties of interest - PDF document

A General Introduction to Lexical Databases Emmanuel Keuleers Department of Experimental Psychology Ghent University EMLAR 2015 - Utrecht, April 15-17, 2015 • What can you find in a lexical database? • How can you find it?

Lexical Databases • Like a dictionary • Lexical properties of interest to psycholinguists • Frequency, orthography, phonology, morphology, syntax, … • Subjective ratings of those words • Behavioural responses to those data Lexical Databases • No standard: each database has its own format, peculiarities, ... • Text files, web interfaces, e-mail services, etc ... • In essence, a lexical database is just a list with a bunch of information about words.

Lexical Databases • The truth: you'll have to find out where to find something and be prepared to do some processing work. CELEX: the big and complex lexical database

History • Centre for Lexical Information • Founded in Nijmegen in 1986 • Max Planck Institute for Psycholinguistics & Interfaculty Research Unit for Language and Speech of the University of Nijmegen (now CLS) • Project ended in 2000 • Three large databases with lexical information for Dutch, English, and German • Dutch Database • 124,136 lemmata • 381,292 wordforms • 211,389 corpus types • English Database • 52,446 lemmata • 160,594 wordforms • 220,271 corpus types • German database • 51,728 lemmata • 365,530 wordforms • 290,712 corpus types

Wordforms, lemmas, and corpus types Corpus types • Letter strings, regardless of part of speech • a walk in the park = to walk slowly =   i walk alone = you walk alone

Wordforms • Letter strings disambiguated for part of speech (and sometimes meaning) • a walk in the park ≠ to walk slowly ≠   i walk alone ≠ you walk alone • (walk, noun, singular), (walk, verb, infinitive), (walk, verb, 1p), (walk, verb, 2p) Lemmas • Headwords • (walk, noun): a walk in the park = the long walks • (walk, verb): I'm walking slowly = i walk alone = he walks too fast

Celex Build Up • Information from dictionary sources • Corpus counts or correlation with existing frequency counts • Almost completely biased towards written language Dutch Database Sources • Van Dale's Comprehensive Dictionary of Contemporary Dutch (1984) • 80,000 lemmata • Word List of the Dutch Language ('Het Groene Boekje') (1954), plus later revisions, including the 1994 spelling reform • 65,000 lemmata • The most frequent lemmata from the text corpus of the Institute for Dutch Lexicology (INL) 42,380,000 words in all • 15,000 lemmata

English Database Sources • Oxford Advanced Learner's Dictionary (1974) • 41,000 lemmata • Longman Dictionary of Contemporary English (1978) • 53,000 lemmata German Database Sources • Bonnlex, supplied by the Institute for Communication Research and Phonetics in Bonn • Molex, supplied by the Institute for German Language in Mannheim • Noetic Circle Services (MIT) German spelling lexicon

Dutch Frequency Sources • INL Corpus (42 million tokens) • 930 entire fiction and non-fiction books (approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection. English Frequency Sources • COBUILD/Birmingham corpus (17.9 million tokens) • 16.6 million tokens from written texts • 1.3 million tokens from transcribed dialogue

German Frequency Sources • Mannheimer Korpus I, Mannheimer Korpus II and Bonner Zeitungskorpus 1 • 5.4 million tokens • written texts like newspapers, fiction and non- fiction • Freiburger Korpus • 600,000 tokens • transcribed speech • Corpus Types • Frequency • Orthography

• Lemma lexica • Frequency • Orthography • Phonology • Derivational Morphology • Grammatical information • Wordform Lexica • Frequency • Orthography • Phonology • Inflectional Morphology

Frequency Verb Frequency Deviation Freq/Million accept 3712 0 207.37 accord 2010 12 112.29 achieve 2121 0 118.49 act 2212 430 123.58 add 4190 0 234.08 agree 3424 0 191.28

Frequency/ Lexicon Form Frequency Deviation million lemma act 2212 430 123.58 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acting 489 103 27.32 wordform acts 187 80 10.45 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acted 92 366 5.14 • Lemma frequency • Frequency over all wordforms of the lemma • Wordform frequency • Deviation == 0 : exact count • Deviation > 0 : result of disambiguation

• Less than 100 tokens • Manual disambiguation • More than 100 tokens • Disambiguation on a sample of 100 tokens • Frequency ± deviation = 95 % confidence interval • No disambiguation for verbal flection • Frequency divided between forms • Frequency Deviation > Frequency • No disambiguation for German • Frequency divided between forms

• English and German databases have separate fields for written and spoken frequencies • Spoken frequencies based on very small corpora • 1.3 million for English • 0.4 million for German • What does it mean when an entry in CELEX has a frequency of zero • Many entries in the database sources were not found in the frequency sources • A few entries do not come from database sources but are left with a zero frequency after disambiguation • will have deviation > zero • Many entries added to CeLex for morphological decomposition of other lemmas have a frequency of zero

Word frequency distributions Word frequency distributions word frequency rank the 1 093 546 1 of 540 085 2 and 514 946 3 to 483 428 4 a 422 334 5 in 337 995 6 that 217 376 7 it 199 920 8 i 198 139 9

• Let's plot the rank of each word in the COBUILD corpus against its frequency. • The word with the highest frequency gets the highest rank (1), the word with the lowest frequency gets the lowest rank (220,270). • In total there are 17.9 million word tokens in the COBUILD corpus.

• Not very clear. • Let's plot it again so that the difference between a frequency of 1 and a frequency of 10 is the same as the difference between a frequency of 10 and a frequency of 100 the of 10 6 =1,000,000 and to a in 10 3 =1,000 all these words have frequency 1 10 0 =1

• Word frequency lists are composed of very few words with a very high frequency • Most words (corpus types) occur only once in the corpus! • The relation between word frequency and rank is log linear.

Comparing frequencies • Word frequencies from different databases cannot be easily compared because of different corpus sizes • Example: Celex Dutch ±42m vs Celex English ±18million • Solution: frequency per million words Frequency per million frequency   word frequency rank per million the 1 093 546 60 955.74 1 of 540 085 30 105.07 2 and 514 946 28 703.79 3 to 483 428 26 946.93 4 a 422 334 23 541.47 5 in 337 995 18 840.30 6 that 217 376 12 116.83 7 it 199 920 11 143.81 8 i 198 139 11 044.54 9

Comparing frequencies • Beware! Some frequency lists contain words with a frequency of 0 • Log10(0) is not something that can be computed • Solution: always add 1 to the raw frequencies when you are transforming to frequencies per million Formula Frequency per million = Raw Frequency +1 (adjusted) Corpus size in million FPM ('that') = 217 376 +1 =12116.89 17.94 log10(12116.89)=4.08

Zipf Values Van Heuven, Mandera, Keuleers, & Brysbaert (2014) Formula Freq. per billion = Raw Frequency +1 Corpus size in billion FPB ('that') = 217 376 +1 =12116889.63 .01794 log10(12116889.63)=7.08

Relative Frequency log10(fpm) word frequency zipf the 1 093 546 0.0602191 4.78 7.78 of 540 085 0.0297413 4.47 7.47 and 514 946 0.0283569 4.45 7.45 to 483 428 0.0266213 4.43 7.43 a 422 334 0.0232570 4.37 7.37 in 337 995 0.0186127 4.27 7.27 that 217 376 0.0119704 4.08 7.08 it 199 920 0.0110092 4.04 7.04 i 198 139 0.0109111 4.04 7.04 Orthography

• Lemma and wordform lexica list orthographic variants with separate frequencies Status • Dutch: preferred, non-preferred, informal • preferred & non-preferred: in “Groene Boekje” • informal: non-standard forms occurring at least once in INL corpus • English: British, American • British: acceptable for British • American: occurs only in American • German • No orthographic variants

Lemma ID Form Status Frequency 1070 aardappelcroquet preferred 0 1070 aardappelkroket non-preferred 0 1138 aardelektrode preferred 0 1138 aardelectrode non-preferred 0 1202 aardolieprodukt preferred 6 1202 aardolieproduct non-preferred 0 1357 abductie preferred 0 1357 abduktie non-preferred 0 Lemma ID Form Status Frequency 1359 anaesthesia British 12 1359 anesthesia American 1 1360 anaesthetic British 47 1360 anesthetic American 4 1361 anaesthetic British 8 1361 anesthetic American 0 1362 anaesthetist British 16

Lexical Databases Like a dictionary Lexical properties of interest - PDF document

A General Introduction to Lexical Databases Emmanuel Keuleers Department of Experimental Psychology Ghent University EMLAR 2015 - Utrecht, April 15-17, 2015 What can you find in a lexical database? How can you find it? Lexical

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

An XML Markup Language An XML Markup Language Framework for Lexical Databases Framework for

Module 3: Creating and Managing Databases Overview Creating Databases Creating

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

Lexical Ambiguity Why is there Lexical Ambiguity? Ling 580E,F,I Quicky definition: Term

Br.A.I.N Summer School September 27, 2019 ACTIVA: A utomatic C ontrol in T otal I ntra V enous A

Transformational Leadership Experience From Inception to Implementation National Healthcare

Harvest and Transport of Fingerlings From Ponds S tephen Grausgruber Iowa S tate University

Trailblazing Forging your own Path Dr. William S. Silver President, New Markets CannaCraft

FY10 Financial Performance Presented to the University of Illinois Board of Trustees September

The University of Scranton Department of Nursing Masters and DNP Programs Programs of Study

Evidence Supporting Clinician Acceptance of a Standardized Handoff Process Meghan Lane-Fall, MD,

Information Blocking Task Force Meeting Andrew Truscott, co-chair Michael Adcock, co-chair March

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us