1 Illustration of measurements Boston Corpus: Structural hypothesis - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Illustration of measurements Boston Corpus: Structural hypothesis - - PDF document

The problem Modeling stress assignment in English noun-noun compounds: compounds in English are stressed on the left-hand member (e.g. blckboard, wtchmaker ). a quantitative perspective nuclear stress rule vs. compound stress rule


slide-1
SLIDE 1

1

Modeling stress assignment in English noun-noun compounds: a quantitative perspective Gero Kunter, Ingo Plag, Sabine Lappe & Maria Braun

Universität Siegen

Conference Quantitative Investigations in Theoretical Linguistics 2, 1-2 June 2005, Osnabrück
  • compounds in English are stressed on the left-hand member (e.g.

bláckboard, wátchmaker).

  • nuclear stress rule vs. compound stress rule (Chomsky and Halle

1968:17)

  • many unexplained exceptions, and cross-variety variation (e.g. BrE vs.

AmE) Boston márathon Penny Láne summer níght aluminum fóil may flówers silk tíe In general:

  • claims on compound stress are largely based on anecdotal evidence

and introspection

  • no systematic large-scale empirical evidence available yet

The problem

  • 1. The structural hypothesis
(e.g. Giegerich 2004, Bloomfield 1933, Lees 1963, Marchand 1969 or Payne/Huddleston 2002)
  • modifier-head structures are regularly stressed on the RIGHT

constituent (steel brídge)

  • argument-head structures are always LEFT-stressed (ópera singer)
  • left stress on modifier-head structures is due to lexicalization

(ópera glasses)

  • 2. The semantic hypothesis
(e.g. Fudge 1984, Ladd 1984, Liberman and Sproat 1992, Olsen 2000, 2001)

stress assignment according to semantic categories

  • 3. The analogical hypothesis
(e.g. Schmerling 1971, Liberman and Sproat 1992, Plag 2006)

stress assignment in analogy to similar compounds in the lexicon

Three approaches

  • Plag (2006, experimental study):

all three types of factor interact in compound stress assignment in complex ways.

  • this paper: corpus study testing the three hypotheses more

thoroughly

  • many more different word types
  • many more tokens
  • many more semantic relations
  • computational modeling of analogical effects
  • Data
  • Boston University Radio Speech Corpus (Ostendorf et al. 1996)

(N = 4410, V = 2476, AmE)

  • CELEX lexical data base (Baayen et al. 1995)

(N = 4491, V = N, BrE)

Testing the hypotheses

The device is attached to a plastic wristband . It looks like a watch. It functions like an electronic probation officer . When a computerized call is made to a former prisoner's home phone , that person answers by plugging in the device. The wristband can be removed only by breaking its clasp, and if that's done the inmate is immediately returned to jail. The description conjures up images of big brother watching. But Jay Ash, deputy superintendent of the Hampton County jail in Springfield, says the surveillance system is not that sinister.

Boston Corpus: Example

Step 1 Measure mean fundamental frequency (F0) of the main stressed vowels of the two members, respectively, and calculate the difference (left F0 minus right F0, logarithmically transformed into semitones (ST), ’pitch difference‘)

Procedure

(cf. Farnetani et al. 1988, Ingram et al. 2003, Plag 2006) Step 2 Look for statistically significant pitch differences between distinct kinds of compound wrístband home phóne +5.39 ST

  • 0.97 ST

Example: Left-headed compounds (such as attorney géneral) should have a significantly smaller pitch difference than right-headed compounds (e.g. wrístband)

slide-2
SLIDE 2

2

Illustration of measurements

Right-headed vs. left-headed compounds in Boston Corpus

  • 10
  • 5
5 10 15 20 t (4408) = 4.91, p < 0.01, Cohen‘s d = 0.80

left- headed right- headed 0.052 3.332 mean pitch difference in semitones

wrístband attorney géneral

Boston Corpus: Structural hypothesis

Argument-head vs. modifier-head compounds significant difference, but large overlap between the two groups effect size is very small

t (4089) = 2.36, p < 0.05, Cohen’s d = 0.01

  • 10
  • 5
5 10 15 20

modifier- head argument- head 3.250 3.736 mean pitch difference in semitones

Boston Corpus: Structural hypothesis

A closer look at argument-head vs. modifier-head compounds morphology argument-head modifier-head

  • f head
  • er

law makers house speaker

  • ing

fundraising spring training

  • ion

jury selection health education conversion tax increase litmus test (also, with low frequency: -age, -al, -ance, …)

Argument-Head Modifier-Head
  • 10
  • 5
5 10 15 con (N=572)

Boston Corpus: Structural hypothesis

Interaction of structure and morphology of head

F (9, 4062) = 2.89 p < 0.01 R² = 0.015

not significant

  • nly very small tendency

for highly frequent compounds to be more left-stressed no difference between AH or MH compounds

F (1, 4069) < 1

Boston Corpus: Lexicalization effect?

Two ways of quantifying lexicalization

  • Frequency

Higher frequency should correlate with higher degree of lexicalization

  • Spelling

Lexicalized compounds are more prone to one-word spellings

  • nly very small tendency

for highly frequent compounds to be more left-stressed no difference between AH or MH compounds

F (1, 4069) < 1

Boston Corpus: Lexicalization effect?

F (1, 4071) = 15.58, p < 0.001, R² = 0.004

  • 10
  • 5
5 10 15 1 2 3 4 5
  • relation between pitch

difference and Google frequency shows an S- shaped distribution

  • typical of categorical

changes Pitch difference by Google frequency

slide-3
SLIDE 3

3

Boston Corpus: Lexicalization effect?

Spelling and lexicalization Assumptions:

  • one-word spellings are

indicative of lexicalization

  • high frequency is indicative
  • f lexicalization

Prediction: compounds spelled as one word should have higher frequency than those spelled as two words

5 10 15 20

Results:

  • expected effect
  • large effect size

=> spelling is an indicator of lexicalization

t (3388) = 15.58, p < 0.001, Cohen´s d = 0.89

Boston Corpus: Lexicalization effect?

Interaction between structure and spelling Predictions:

  • Modifier-Head compounds

spelled as one word should be more left-stressed than Modifier-head compounds spelled as two words

  • no effect of that kind with

Argument-Head compounds

3.0 3.5 4.0 4.5 5.0 Argument-H Modifier-He

F (3, 4030) = 12.79, p < 0.001, R² = 0.009

Results:

  • Modifier-Head compounds

spelled as one word are indeed more left-stressed

  • spelling of Argument-Head

compounds does not interact with stress position

  • only very weak effect
  • significant effect of argument vs. modifier only with a subset of

potential compounds (i.e. –er as righthand head morphemes)

  • a measurable lexicalization effect (based on frequency and

spelling)

  • effect sizes are all very small – a lot of the variation is unaccounted

for under this hypothesis The structural hypothesis is not well supported by the data

Boston Corpus: Structural hypothesis

A summary

Boston Corpus: Semantic hypothesis

Methodological problems

  • Semantic categories and semantic relations mentioned in the

literature (such as ‚N2 is a material‘, ‘N2 is located at N1’) are hard to test due to their being generally ill-defined

  • Items are often ambiguous (i.e. show more than one relation)
  • The number of potentially relevant semantic categories and

relations is unclear Our methodology

  • We used a set of 18 semantic relations (based mainly on Levi

1978), also widely used in studies on compound interpretation (e.g. Gagné & Shoben 1997, Gagné 2001)

  • Semantic classification was done by two independent raters –
  • nly those data are analyzed where the two ratings agreed

The literature on rightward stress makes use of either

categories referring to constituents or the compound as a whole

  • r

categories referring to semantic relation

Boston Corpus: Semantic hypothesis

Rightward stress is predicted if...

  • N1 refers to a period or point in time (morning edition)
  • N2 is a geographical term (Boston area)
  • N2 is a type of thoroughfare (Sesame Street)
  • N1 and N2 form a proper noun (Tufts University)

(e.g. Fudge 1984: 144ff, Liberman & Sproat 1992)

Boston Corpus: Semantic hypothesis

Categories referring to constituents or the compound as a whole

slide-4
SLIDE 4

4

Boston Corpus: Semantic hypothesis

Categories referring to constituents or the compound as a whole

F (7, 4130) = 9.19, p < 0.01, R² = 0.0136

  • 15
  • 10
  • 5
5 10 15 20
  • 15
  • 10
  • 5
5 10 15 20 N2 is a GEOGRAPHICAL TERM? pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 N2 is a THOROUGHFARE? pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 Compound is a PROPER NOUN?
  • 15
  • 10
  • 5
5 10 15 20 N1 is a PROPER NOUN? pitch difference in semitones

Boston Corpus: Semantic hypothesis

Categories referring to semantic relation Rightward stress is predicted if...

  • N2 DURING N1 (summer vacations)
  • N2 IS LOCATED AT N1 (Newton residents)
  • N2 IS MADE OF N1 (canvas bags)
  • N1 MAKES N2 (Weld plan)

(e.g. Fudge 1984: 144ff, Liberman & Sproat 1992) additional categories (18 in total):

  • N1 HAS N2 (wheel chair)
  • N2 USES N1 (breath test)
  • N2 FOR N1 (adult prisons)
  • N2 CAUSES N1 (AIDS virus)

Boston Corpus: Semantic hypothesis

Categories referring to semantic relation

F (7, 2036) = 20.53, p < 0.01, R² = 0.063

  • 15
  • 10
  • 5
5 10 15 20
  • 15
  • 10
  • 5
5 10 15 20 N2 LOCATED AT/IN N1 pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 N1 MAKES N2 pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 N2 IS MADE OF N1 pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 N1 HAS N2
  • 15
  • 10
  • 5
5 10 15 20 N2 USES N1 pitch difference in semitones
  • 15
  • 10
  • 5
5 10 15 20 N2 FOR N1 pitch difference in semitones
  • Some predictions are correct
  • Some predictions are wrong (i.e. no effect found)
  • Some effects are found where no prediction is made
  • A lot of the variation is unaccounted for under this hypothesis

The semantic hypothesis is not well supported by the data

Boston Corpus: Semantic hypothesis

A summary

Analogical modeling is not yet possible at the moment, due to gradient stress measurements

(But see Kunter/Plag (2006) on how this can be done)

Boston Corpus: Analogical hypothesis

Contents

Oxford Advanced Learner's Dictionary (1974): 41,000 lemmata Longman Dict. of Contemp. Engl. (1978): 53,000 lemmata COBUILD corpus (92%) 17.9 million word tokens

  • verall: 52,446 lemmata

representing 160,594 wordforms Position of stress is given for each entry in the data base

CELEX: General overview

stress position

left right

90% 10%

NNN compounds = 4491

slide-5
SLIDE 5

5

CELEX: Structural hypothesis

Argument-head vs. modifier-head compounds significant difference is in the direction predicted by the hypothesis (i.e. more left stress with argument- head compounds) but: vast majority of modifier- head compounds is also left-stressed, which goes against the hypothesis

χ ² = 8.55, df = 1, p < 0.01, φ = 0.05 stress position

modifier-head argument-head left right

structure morphology

argument-head modifier-head con er ing ion left right left right

CELEX: Structural hypothesis

Interaction of structure and morphology of head

  • same significant interaction

as in BURSC

  • significant effect of

argument vs. modifier only with a subset of potential compounds (i.e. –er as righthand head morphemes)

  • other interactions are not

significant

logit regression, null dev. = 396.64, df = 680; residual dev. = 354.23, df = 673

CELEX: Lexicalization effect?

Frequency and stress position Assumptions:

  • lexicalized compounds prefer left-

stress

  • lexicalized compounds are more

frequent Prediction: left-stressed compounds should have higher frequency than right- stressed compounds Results:

  • Google log frequencies are not

different for left- or right-stressed compounds

  • no interaction of stress position and

structure (F (1, 4467) = 2.47, p = 0.12)  stress position is not related to frequency

5 10 15 20

t (4470) = 1.097, p = 0.27

CELEX: Lexicalization effect?

  • the more lexicalized (in

terms of spelling), the more frequent is left stress

  • no difference between

argument-head and modifier-head compounds ⇒ evidence for general lexicalization effect on stress

spelling stress position
  • ne word
hyphenated two words left right χ ² = 512.08, df = 2, p < 0.01

Spelling and stress position

  • significant effect of argument vs. modifier only with a subset of

potential compounds (i.e. –er as right-hand head morpheme)

  • BUT: the vast majority of compounds do not behave in accordance

to the hypothesis

  • measurable general lexicalization effect (only w.r.t. spelling)

The structural hypothesis is not supported by the data

CELEX: Structural hypothesis

A summary

CELEX: Semantic hypothesis

Categories referring to constituents or the compound as a whole

logit regression, null dev. = 2784.0, df = 4125; residual dev. = 2693.1, df = 4120 N1 refers to POINT OF TIME? no yes left right N2 is a GEOGRAPHICAL TERM? no yes left right N2 is a THOROUGHFARE? no yes left right N1 is a PROPER NOUN? no yes left right Compound is a PROPER NOUN? no yes left right
slide-6
SLIDE 6

6

CELEX: Semantic hypothesis

Categories referring to semantic relation

N2 DURING N1 no yes left right N2 LOCATED AT/IN N1 no yes left right N2 IS MADE OF N1 no yes left right N1 IS N2 no yes left right N2 FOR N1 no yes left right N2 ÍS NAMED AFTER N1 no yes left right logit regression, null dev. = 1149.16, df = 1629; residual dev. = 967.41, df = 1614

CELEX: Semantic hypothesis

Summary

  • Some predictions go in the right direction, but leave lots of data

unexplained

  • Some predictions are wrong (i.e. no effect found)
  • Some effects are found where no prediction is made
  • A lot of the variation is unaccounted for

The semantic hypothesis is not well supported by the data

CELEX: Analogical hypothesis

Specific hypothesis: Stress in compounds is determined by the stress pattern of the majority of similar instances that are stored in memory. Example: cárpet beater is assigned left stress because the most similar exemplar stored in memory, éggbeater, also has left stress.

CELEX: Analogical hypothesis

The data Compounds whose left and right members occur more than once in the corpus, i.e. for which the model has information about constituent families of the members. N = 2643 (Ntotal = 4491) The model Memory-based learner (TiMBL 5.1, Daelemans et al. 2004) How does TiMBL work?

evaluation of input against nearest neighbours INPUT OUTPUT action, painting, noarg,
  • ing,
semcat1, stress: left finger, painting, noarg, -ing, semcat1, stress: left wall, painting, arg,
  • ing,
semcat1, stress: left country, party, noarg, nosuff, semcat2, stress: left cottage, hospital, noarg, nosuff, semcat3, life, work, noarg, semcat2, nosuff, stress: right ...

}

stress left: 2x stress right: 0x
  • il,
painting, noarg,
  • ing, semcat1
stress left INSTANCE-BASED MEMORY SET OF NEAREST NEIGHBOURS stress: right action, painting, noarg,
  • ing,
semcat1, stress: left finger, painting, noarg, -ing, semcat1, stress: left
  • il painting
  • íl painting

How does TiMBL perform?

  • 94 % overall accuracy
  • predictive accuracy for right stresses: 20-25%
slide-7
SLIDE 7

7

Which features does TiMBL find useful?

Any given set of features does the same thing as any other set of features (about 94 % accuracy): semantic categories, semantic relations, proper-noun status, morphological structure, argument-head status, left and/or right member ⇒ No abstract features needed, left and right member can do the job Left and right member are in fact better predictors than the other features, because

  • nly when we leave out left and/or right member do we find a

significant drop in performance

(left member: Yate’s χ2 = 7.42, p < 0.01, right member: Yate’s χ2 = 4.83, p < 0.05, left and right member: Yate’s χ2 = 4.83, p < 0.05)

The results of analogical modeling

Accuracy

  • pretty good overall predictive accuracy
  • better predictive accuracy than under any other hypothesis

Predictors none of the grammatical/semantic features proposed in the literature improves predictive accuracy Theoretical implication constituent families (and thus analogy) play an important role in compound stress assignment

Summary and Conclusion

  • Corpus data do not confirm the categorical stress assignment

rules found in the literature

  • Compound stress is much more variable than previously thought
  • Argument structure effects are restricted to compounds ending

in –er

  • There are only small effects of only some of the semantic

categories proposed in the literature

  • Assignment of rightward stress is problematic for TiMBL. But

TiMBL is better than any other “rule“ found in the literature

  • Analogical effects based on constituent families play an

important role in compound stress assignment Acknowledgements

  • Christian Grau, Christina Kellenter, and Taivi Rüüberg for their help with annotating

the data

  • Harald Baayen and Mark Pluymaekers for statistical training and support, and for

critical comments and suggestions

  • Heinz Giegerich for critical comments and support
  • Deutsche Forschungsgemeinschaft for funding this research (Grant PL151/5-1)

Thank you very much for your attention!

Baayen, Harald, R. H., R. Piepenbrock, and L. Gulikers (1995) The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Bloomfield, Leonard (1933) Language. Chicago: Holt. Chomsky, Noam and Morris Halle (1968) The Sound Pattern of English. New York: Harper and Row. Daelemans, Walter, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch (2004) TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Technical Report 04-02, available from http://ilk.uvt.nl/downloads/pub/papers/ilk0402.pdf. Farnetani, Edda and Cosi, Piero (1988) English compound versus non-compound noun phrases in discourse: An acoustic and perceptual study, Language and Speech 31, 157-180. Fudge, Eric C. (1984) English word-stress. London: George Allen & Unwin. Gagné, Christina (2001) Relation and lexical priming during the interpretation of noun-noun combinations. Journal of Experimental Psychology: Learning, Memory and Cognition 27: 236-254. Gagné, Christina and Edward J. Shoben (1997) Influence of thematic relations on the comprehension of modifier-noun
  • combinations. Journal of Experimental Psychology: Learning, Memory and Cognition 23: 71-87.
Giegerich, Heinz (2004) Compound or phrase? English noun-plus-noun constructions and the stress criterion, English Language and Linguistics 8, 1–24. Ingram, John, Thi Anh Thu Nguyen and Rob Pensalfini (2003) An acoustic analysis of compound and phrasal stress patterns in Australian English, submitted for publication. Kunter, Gero, and Ingo Plag (2006) What is compound stress? On the phonetics and phonology of prominence relations in English noun-noun constructs. Paper presented at the University of Edinburgh, May 23, 2006. Ladd, D. Robert (1984) English compound stress, in Gibbon, Dafydd & Helmut Richter (eds.) Intonation, accent and rhythm. Berlin: Mouton de Gruyter, 253-266. Levi, Judith N. (1978) The syntax and semantics of complex nominals. New York: Academic Press. Lees, Robert B. (1963) The Grammar of English Nominalizations. The Hague: Mouton. Liberman, Mark and Richard Sproat (1992) The stress and structure of modified noun phrases in English, in Sag, Ivan A. and Anna Szabolcsi (eds.) Lexical Matters. Stanford: Center for the Study of Language and Information, 131-181. Marchand, Hans (21969) The Categories and Types of Present-day English Word-formation, München: Beck. Olsen, Susan (2000) Compounding and stress in English: A closer look at the boundary between morphology and syntax, Linguistische Berichte 181, 55-69. Olsen, Susan (2001) Copulative compounds: a closer look at the interface between syntax and morphology, in Booij, Geert E. and Jaap van Marle (eds.) Yearbook of Morphology 2000, Dordrecht/Boston/London: Kluwer, 279-320. Ostendorf, Mari, Patti Price, and Stefanie Shattuck-Hufnagel (1996) Boston University Radio Speech Corpus. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Payne, John, and Rodney Huddleston (2002) Nouns and noun phrases. In Huddleston, Rodney & Geoffrey K. Pullum, The Cambridge grammar of the English language. Cambridge: Cambridge, University Press. 323–524. Plag, Ingo (2006) The variability of compound stress in English: structural, semantic and analogical factors, English Language and Linguistics 10.1, 143-172. Schmerling, Susan F. (1971). A stress mess. Studies in the Linguistic Sciences 1: 52-65.

References