Phonological trends in the lexicon — Practicum Michael Becker University of Massachusetts Amherst michael.becker@phonologist.org EVELIN 2012 MIT / UNICAMP Campinas, Brazil 1 / 20
Practicum overview • • Practicum overview Formulating a falsifiable hypothesis Lexicon study • Lexicon study Experimental design ◦ Building a lexicon Building and running an experiment ◦ Data exploration with regular expressions Where to go next ◦ Regression modeling • Working with audio materials ◦ Recording ◦ Praat work ◦ Scripting and automation • Experimental design ◦ Formulating a testable hypothesis ◦ Online experiments / web interface ◦ Regression modeling • Comparing the lexicon and the experiment 2 / 20
• Practicum overview Lexicon study • Building a lexicon • Example: Portuguese plurals • Dealing with text files • Lexical statistics Experimental design Building and running an Lexicon study experiment Where to go next 3 / 20
Building a lexicon • • Practicum overview List of paradigms Lexicon study • Word list • Building a lexicon • Custom list • Example: Portuguese plurals • Opportunistic data collection • Dealing with text files • Lexical statistics Experimental design Building and running an experiment Where to go next 4 / 20
Building a lexicon • • Practicum overview List of paradigms Lexicon study • Building a lexicon ◦ Turkish: TELL (Inkelas et al. 2000) • Example: Portuguese ◦ plurals Hebrew: LLHN (Bolozky & Becker 2006) • Dealing with text files ◦ Russian: Usachev (2004), based on Zaliznjak (1977) • Lexical statistics ◦ Others? Experimental design Building and running an experiment Not very common, very useful — why? Where to go next • Word list • Custom list • Opportunistic data collection 4 / 20
Building a lexicon • • Practicum overview List of paradigms Lexicon study • Word list • Building a lexicon • Example: Portuguese ◦ English: CMU ( http://www.speech.cs.cmu.edu/cgi-bin/cmudict ), plurals • Dealing with text files CELEX (Baayen et al. 1995, not free) • Lexical statistics ◦ French: Lexique ( http://www.lexique.org/ ) Experimental design ◦ Portuguese: LABEL-LEX ( http://label.ist.utl.pt/en/labellex_en.php ) Building and running an ◦ Many others. experiment Where to go next Googling for e.g., "Kabardian word list" usually helps. Asking around is a good idea too. You can use the word list to prepare a list of stems, and then add the other morphological category manually. It’s a lot of work, but it can help generate ideas. • Custom list • Opportunistic data collection 4 / 20
Building a lexicon • • Practicum overview List of paradigms Lexicon study • Word list • Building a lexicon • Custom list • Example: Portuguese plurals • Dealing with text files ◦ If available, use a paper dictionary. Scanning + OCR can save • Lexical statistics a lot of work. Hire research assistants to help. Experimental design ◦ Building and running an Use corpora and/or search engines to expand your empirical experiment scope. Where to go next In recent years, Google has become less useful for such searches. • Opportunistic data collection 4 / 20
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Extract the [w]-final words • Lexical statistics • Collect judgments Experimental design • Building and running an Coding experiment Where to go next 5 / 20
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Lexical statistics ◦ The original file Experimental design ◦ A series of regular expression substitutions Building and running an Result: spelling + IPA (mostly) experiment ◦ Where to go next • Extract the [w]-final words • Collect judgments • Coding 5 / 20
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Lexical statistics ◦ The original file Experimental design Building and running an experiment N a Where to go next N aacheniano N aal N aaleniano N aba N ababá N ababalhamento N ababosamento ◦ A series of regular expression substitutions Result: spelling + IPA (mostly) ◦ 5 / 20 • Extract the [w]-final words
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Lexical statistics ◦ The original file Experimental design ◦ A series of regular expression substitutions Building and running an experiment Where to go next For example: eõ]) → $1z$2 ([aeiouáéíóú㘠eõ])s([aeiouáéíóú㘠ss → s Learn more about regular expressions! http://pt.wikipedia.org/wiki/Expressão_regular We automated the substitutions with a Perl script. ◦ Result: spelling + IPA (mostly) • Extract the [w]-final words 5 / 20 • Collect judgments
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Lexical statistics ◦ The original file Experimental design ◦ A series of regular expression substitutions Building and running an Result: spelling + IPA (mostly) experiment ◦ Where to go next N a " a N aacheniano aa S eni " ano N aal a " aw N aaleniano aaleni " ano N aba " aba N ababá abab " a N ababalhamento ababa L am " ˜ eto N ababosamento ababozam " ˜ eto • Extract the [w]-final words 5 / 20 • Collect judgments
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Extract the [w]-final words • Lexical statistics Experimental design Again, a regular expression: w$ ◦ Building and running an experiment ◦ No need for programming — a text editor with support for Where to go next regular expressions is good too: Notepad++ (Windows), TextWrangler (Mac) + OpenOffice/LibreOffice ◦ We got a list of 5742 words — mostly nouns and adjectives. • Collect judgments • Coding 5 / 20
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Extract the [w]-final words • Lexical statistics • Collect judgments Experimental design Building and running an experiment ◦ Do we really need ALL the adjectives that end in [aw], [ew]...? Where to go next ◦ The monosyllables are manageable, so we take all of them. We asked three people to supply plurals for them. ◦ We want a good sample of polysyllables. ◦ Randomize, and choose the sizable portion. Excel trick: items in one column, =rand() in a second column, and sort by the random number. We asked one person to supply plurals for our sample of polysyllables. 5 / 20 • Coding
Example: Portuguese plurals • Practicum overview How did we get from the word-list of LABEL-LEX to a corpus of Lexicon study Portuguese plurals? • Building a lexicon • Example: Portuguese • plurals Transform from spelling to IPA • Dealing with text files • Extract the [w]-final words • Lexical statistics • Collect judgments Experimental design • Building and running an Coding experiment Where to go next Some words did’t have a plural → excluded ◦ ◦ 0 = faithful, 1 = alternating, .5 = optional ◦ [malis], [ab R ilis] coded as faithful Items with one than one rating → averaged ◦ 5 / 20
Dealing with text files • Practicum overview Text is the bread and butter of computing. Lexicon study • Building a lexicon • Text file vs. binary file • Example: Portuguese • plurals Plain text editors • Dealing with text files • Unicode • Lexical statistics • Regular expressions Experimental design Building and running an experiment Where to go next 6 / 20
Lexical statistics • • Practicum overview Descriptive statistics Lexicon study • Inferential statistics • Building a lexicon • Limits of logistic regressions • Example: Portuguese plurals • Dealing with text files • Lexical statistics Experimental design Building and running an experiment Where to go next 7 / 20
Recommend
More recommend