1
play

1 Notes on the corpus Notes on the corpus Sources of noisebut the - PDF document

Tagalog (Western Austronesian, Philippines) A frequency effect conditioned by phonological grammar Kie Zuraw, UCLA kie@ucla.edu QITL June 2006 1 2 Tagalog (Western Austronesian, Tagalog tapping Philippines) Schachter & Otanes 1972


  1. Tagalog (Western Austronesian, Philippines) A frequency effect conditioned by phonological grammar Kie Zuraw, UCLA kie@ucla.edu QITL June 2006 1 2 Tagalog (Western Austronesian, Tagalog tapping Philippines) Schachter & Otanes 1972 (Tagalog Reference Grammar ) : � About 16 million native speakers � Spanish and English loans have introduced a contrast between (Ethnologue ) , plus many second-language d and r ([ � ]) speakers d isko ‘disc’ r isko ‘risk’ kanto d ‘limp’ kanto r ‘singer’ � Roman alphabet introduced during Spanish se d a ‘silk’ se r a ‘wax’ rule; encodes distinctions not represented in But in native words, there’s (near-)complementary distribution, � older writing and probably not contrastive with r between vowels and d elsewhere until recently (d vs. r; u vs. o; i vs. e) dali r i ‘finger’ d apat ‘should’ liko d ‘back’ is d a ‘fish’ ka d kad ‘unfurled’ 3 4 Tagalog tapping Overview of talk � Tapping is morphologically governed: required in d → r / V__V one environment, forbidden in another, optional in a d optionally becomes r between vowels third. – I’ll propose a prosodic account of this. laka d ‘walk’ laka r -an ‘to be walked on’ � In the environment where tapping is variable, it’s tied to word-frequency facts. – I’ll argue that this reflects the outcome of lexical access. � Grammar needs to be able to refer to outcome of lexical access. � Data on variation are taken from a written corpus, compiled from the Web. 5 6 1

  2. Notes on the corpus Notes on the corpus � Sources of noise—but the data nevertheless seem � “Probably Tagalog” web pages identified by fairly clean automated queries to Google Web APIs – second-language users of Tagalog service, then collected. – typing errors – erroneously identified pages (especially from other � ~100,000 web pages Philippine languages) – page-internal repetition � ~20,000,000 Tagalog words � Do spelling choices reliably reflect writers’ preferred pronunciations? Probably depends on the � Variety of genres, especially blogs, phenomenon in question—future lab research discussion forums, and newspaper articles. planned on this. 7 8 Notes on the corpus Stem+suffix: obligatory tapping � Thanks to: laka d ‘walk’ laka r -an ‘be walked on’ – Rosie Jones for supplying seed corpus, from tama d ‘lazy’ tama r -in ‘be lazy about’ CorpusBuilder project (Ghani, Jones, & Mladeni � Inclusion criteria: basic from allsuffixedforms.svd 1400 2004), which inspired the construction of this 1200 corpus Histogram : how many 1000 – Undergraduate R.A. Ivan Tam for programming suffixed words (of 800 Count frequency � 10) display 600 – Undergraduate R.A. Nikki Foster for data entry 400 each possible rate of from dictionary (English 1987) 200 tapping? 0 0 .2 .4 .6 .8 1 rate of tapping 9 10 almost all tap Prefix+stem: optional tapping Prefix+stem: optional tapping d umi ‘dirt’ ma- r umi ‘dirty’ Prefixed words (freq. � 10) that Prefixed words (freq. � 10) d ahon ‘leaf’ ma- d ahon ‘leafy’ occur in dictionary identified by morphological parser Inclusion criteria: N oD uplicatesMinimumF requencyN ativ 140 120 100 Count 80 60 40 20 0 0 .2 .4 .6 .8 1 rate of tapping 11 12 some don’t tap some do tap 2

  3. Frequency effect within prefixed words Conditioning factors? (dictionary words only) words more frequent than their roots words less frequent than their roots � Stress is not predictive Inclusion criteria: NoDupNativeOnlyNoMinim um from AllDictIte Inclusion criteria: NoDupNativeOnlyNoMinim um from AllDictIte 45 160 � Vowel quality is not predictive 40 140 35 120 30 � But frequency is promising... 100 Count 25 Count 80 20 60 15 40 10 20 5 0 0 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 rate of tapping rate of tapping majority tap majority don’t tap (Effect of relative frequency exists independent of raw frequency.) 13 14 Frequency effect within prefixed words Hay’s explanation (of similar effects in English) (all words identified by morph. parser) words more frequent than their roots words less frequent than their roots Two routes compete in processing of complex words: decomposed route un unhappy happy direct route unhappy majority tap majority don’t tap (Hay 2003, Causes and Consequences of Word Structure ) 15 16 The faster route wins Frequency effect—first approximation Speed is determined by, among other factors, resting Assume tapping inapplicable if VdV sequence comes activation level (approximated by frequency). from two different accessed lexical units. marumi more frequent marumi > dumi madahon less frequent madahon < dahon than dumi: direct route � direct route than dahon: decomposed � indirect route wins route wins � tapping � no tapping dumi dahon ma r umi ma d ahon maDumi ma ma maDahon 17 18 3

  4. Frequency ratio Word frequency alone Tapped words Tapped words have higher have higher raw frequency ratio. frequency. 19 20 Root frequency alone Affix frequency No real difference—see Hay & Baayen 2001 for discussion. Tapped words have less-frequent affixes. 21 22 Affix productivity Affix productivity But... Baayen’s P : if any � Lüdeling & Evert 2003 caution against using difference, it’s in the non- predicted direction. P to compare processes with different token frequencies. � Lüdeling, Evert & Heid 2000 show that P requires a well-processed corpus and differentiation of homophonous processes. 23 24 4

  5. Affix productivity Effect of morphology I plan to look at some other measures of affix Recall that suffixed words tap no matter what. productivity: � Hay & Baayen’s parsing line Somehow, suffixed words are treated as a unit � Lüdeling/Evert vocabulary growth curves regardless of lexical access route. 25 26 Proposed prosodic structures Prefix-suffix asymmetries (similar to Nespor & Vogel 1986, Peperkamp 1997 for Italian) Common cross-linguistically for rules to apply � Alignment constraint: accessed lexical unit initiates prosodic word � Otherwise, constraint against recursion prefers simple structure. more readily across a stem-suffix boundary � Tapping applies to VdV sequence iff not interrupted by p-word than a prefix-stem boundary. boundary p-word Peperkamp 1997 ( Prosodic Words ) cites Two choices for p-word p-word prefixed word, Choctaw, Polish, Hungarian, Indonesian, depending on prefix stem prefix stem Japanese, Korean, French (p. 55). outcome of lexical ma r umi ma d ahon access: direct route: indirect route: tapping applies tapping doesn’t apply 27 28 Proposed prosodic structures Stem+stem: no tapping � Most are d -initial stems: dala- d ala ‘load carried’ Minimality constraint: p-word must dominate at least one foot, � and foot must dominate at least two syllables � Few are d -final stems: aga d -agad ‘at once’ Suffixes are monosyllabic and so can’t head a p-word � Inclusion criteria: attested from reduplicatedFormsVariantsC 140 � Adjoining the suffix is no help nearly all don’t 120 p-word p-word tap 100 Only one choice Count 80 p-word p-word p-word for suffixed word, Probably not a 60 regardless of reduplicative outcome of lexical 40 stem suffix stem suffix stem suffix identity effect: access: 20 laka r an ka-aga d -aga r -an 0 0 .2 .4 .6 .8 1 29 30 ka- r aga- d agan rate of tapping tapping applies 5

  6. Proposed prosodic structure Local summary � Frequency effect is allowed to surface only in � Heading constraint: each stem must head a p-word prefixed words. � Otherwise, morphology determines outcome. � There is a grammar (it’s not all processing), p-word p-word but it can refer to the units accessed during Only one choice for compounding lexical retrieval, not just to syntactic units. reduplication, stem stem regardless of outcome of lexical dala d ala access: 31 32 tapping doesn’t apply Online vs. lexicalized Online vs. lexicalized � Recall the polarized behavior of prefixed � This is very different from what Baroni (1998, words: 2001) found for Northern Italian intervocalic s -voicing, which is otherwise similar to Tagalog tapping. � Baroni documented robust variation within (frequency threshold of 7, to item, within speaker be consistent with upcoming slide...) 33 34 Online vs. lexicalized Enclitics � So is the frequency effect active online, or One more environment for tapping: merely lexicalized? � Enclitics daw, din can be raw, rin after vowel- � For established prefixed words, perhaps they final words (and, less frequently, after are lexicalized (and unestablished words are consonant-final words) too infrequent in the corpus to see if they ako rin ~ ako din ‘me too’ vary). � So far, I’ve looked only at bigrams with din/rin � But in two other realms, there seems to be (‘too’), not daw/raw (reported speech) an online frequency effect... 35 36 6

Recommend


More recommend