making sense of word sense
play

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft - PowerPoint PPT Presentation

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS) Gottingen Rebecca J. Passonneau Nancy Ide Vikas Bhardwaj Vassar College Ansaf Salleb Aouissi Outline The word sense conundrum The MASC


  1. Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) Gottingen Rebecca J. Passonneau Nancy Ide Vikas Bhardwaj Vassar College Ansaf Salleb ‐ Aouissi

  2. Outline • The word sense conundrum • The MASC Project • WordNet and sense annotation • MASC annotation rounds • Round 2: Multiple trained annotators • Interannotator agreement and beyond • Round 2: Mechanical turkers • Machine learning from labels versus features • Conclusion 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 2

  3. Word sense conundrum • Adam Kilgariff , 2003, I don’t believe in word senses – Abstractions from corpus clusters – Corpus citations . . . are the basic objects in the ontology • James Pustejovsky, 1991, The generative lexicon – No fixed set of conceptual primitives – A fixed number of generative devices – Lexical semantics is an interface between commonsense knowledge and linguistic form 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 3

  4. Zipf’s Law An epiphenomenon of . . . • Words (types or tokens) • Senses • Many other phenomena (Newman, M. E. J.; 2005): city population, books sold, net worth, . . . 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 4

  5. Granularity Concepts versus comparisons of experience • Infinite divisibility of reality: how fine ‐ grained should a cluster be? – WordNet senses for primitive, Adj : 1. Belonging to an early stage of development 2. Characteristic of an ancestral type 3. Preliterate or non ‐ industrial societies 4. Created by one without formal training • Shared experience: the basis of social reality, ways of verbalizing social reality 1 ‐ 3. Anthropology 4. Art history 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 5

  6. Corpus ‐ based Sense Classes • Ontological questions: deferred – How are clusters used as basic ontological objects? – How is commonsense knowledge represented • Identify same/different contexts, within limits: – Same? . . . a primitive granite boar, carved in prehistoric times, . . . has a primitive Easter ‐ island look, – Different? Bin Laden’s training camps were primitive . . . . . . one or more of the primitive gluing or ungluing operations 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 6

  7. Outline • The word sense conundrum • The MASC Project • WordNet and sense annotation • MASC annotation rounds • Round 2: Multiple trained annotators • Interannotator agreement and beyond • Round 2: Mechanical turkers • Machine learning from labels versus features • Conclusion 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 7

  8. American National Corpus • 100 Million Words • Completely unrestricted • Post 1990 American English • Many genres 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 8

  9. MASC: Manually Annotated Sub ‐ Corpus Participants http://www.anc.org/MASC/ • Nancy Ide (PI, NSF CRI; Vassar College) • Collin Baker (ICSI; FrameNet) • Christiane Fellbaum (Princeton; WordNet) • Rebecca J. Passonneau (Columbia Univ.) Size: 500,000 Words, • Manually validated automatic • Manual annotations Selected annotations • Token, Sentence, Lemma (Validated) • Named entities (Validated) • WordNet (Manual: 1.5 Million Word Sentence Corpus) • FrameNet (Manual: 150K Words) 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 9

  10. MASC Corpus • Three releases – MASC I: 82K words, release date May, 2010 – MASC I ‐ II: 142 K words, release date March, 2011 – MASC I ‐ III: 500K words, release date July, 2011 • Fourteen types of annotation – Manually validated automatic, e.g., NP Chunks – Manual, e.g, word sense • Twenty genres, evenly balanced • Freely available from MASC website, and from: – LDC – NLTK 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 10

  11. Genre ords % of corpus 1 Court transcript 20817 4 2 Debate transcript 32325 6 3 Email 20470 4 Figures in bold 4 Essay 25590 5 indicate that the 5 Fiction 25681 5 texts have not 6 Gov't documents 24605 5 yet been chosen 7 Journal 25635 5 8 Letters 24750 5 9 Newspaper/ newswire 17951 4 10 Non-fiction 25182 5 11 Spoken 25783 5 12 Technical 25426 5 13 Travel guides 26708 5 14 Twitter 24180 5 15 Blog 5 2 5 0 0 0 16 ficlets 2 5 0 0 0 5 17 movie script 28240 6 18 poetry 2 5 0 0 0 5 19 spam 2 5 0 0 0 5 20 jokes 5 2 5 0 0 0 Total 498343 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 11

  12. Outline • The word sense conundrum • The MASC Project • WordNet and sense annotation • MASC annotation rounds • Round 2: Multiple trained annotators • Interannotator agreement and beyond • Round 2: Mechanical turkers • Machine learning from labels versus features • Conclusion 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 12

  13. Word Sense Annotation Goals • Freely available word sense corpus • Harmonize WordNet and FrameNet • Investigate moderately polysemous words (avg.=7) • Large sentence ‐ based corpus – 100 words, balanced for part ‐ of ‐ speech – 1000 sentences per word – Avg. sente3nce length in MASC I > 20 words – 2 million word corpus, representing 700 senses • Provide measures of interannotator agreement – Chance corrected coefficients – Krippendorff’s Alpha 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 13

  14. WordNet Sense Information SENSEID: a unique identifier SYNSET: a list of synonymous senses (SENSEIDS) DEFINITION: a phrase EXAMPLES: list of glosses FREQUENCY COUNT: integer Nouns have domain, . . . etc Verbs have verb group, . . . etc Adjectives have attributes, . . . etc 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 14

  15. WordNet Senses: time (noun) 8 WordNet senses used 1. (time1, clip2) (an instance or single occasion for some event) "this time he succeeded"; "he called four times"; "he could do ten at a clip" 2. (a period of time considered as a resource under your control and sufficient to accomplish something) "take time to smell the roses"; "I didn't have time to finish"; "it took more than half my time" 3. (an indefinite period (usually marked by specific attributes or activities)) "he waited a long time"; "the time of year for planting"; "he was a great actor in his time " 4. (a suitable moment) "it is time to go" 5. (the continuum of experience in which events pass from the future through the present to the past) 6. (a person's experience on a particular occasion) "he had a time holding back the tears"; "they had a good time together" 7. (time7, clock_time1) (a reading of a point in time as given by a clock) "do you know what time it is?"; "the time is 10 o'clock" 8. (time8, fourth_dimension1), (the fourth coordinate that is required (along with three spatial dimensions) to specify a physical event) 2 WordNet senses not used 9. (time9, meter4, metre3), (rhythm as given by division into parts of equal duration) 10. (time10, prison term1 , sentence3), (the period of time a prisoner is imprisoned) "he served a prison term of 15 months"; "his sentence was 5 to 10 years"; "he is doing time in the county jail" 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 15

  16. Senses of time ‐ N Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 16

  17. Senses of time ‐ N Sense Num Definitions 1 171 An instance or single occasion for an event When When the bride the bride and groom and groom came together ame together for for the first the first time time 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 17

  18. Senses of time ‐ N Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment A time for A time for a youngster a youngster to enjoy to enjoy the fun and the fun and benefits benefits of of camp . . . camp . . . 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926 24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 18

Recommend


More recommend