Synergies in learning words and their referents Mark Johnson 1 , - PowerPoint PPT Presentation

Synergies in learning words and their referents Mark Johnson 1 , Katherine Demuth 1 , Michael Frank 2 and Bevan Jones 3 1 Macquarie University 2 Stanford University 3 University of Edinburgh NIPS 2010 1/15

Two hypotheses about language acquisition 1. Pre-programmed staged acquisition of linguistic components I “Semantic bootstrapping” : semantics is learnt �rst, and used to predict syntax (Pinker 1984) I “Syntactic bootstrapping” : syntax is learnt �rst, and used to predict semantics (Gleitman 1991) I Conventional view of lexical acquisition , e.g., Kuhl (2004) – child �rst learns the phoneme inventory, which it then uses to learn – phonotactic cues for word segmentation, which are used to learn – phonological forms of words in the lexicon, … 2. Interactive acquisition of all linguistic components together I corresponds to joint inference for all components of language I stages in language acquisition might be due to: – child’s input may contain more information about some components – some components of language may be learnable with less data 2/15

Synergies: an advantage of interactive learning • An interactive learner can take advantage of synergies in acquisition I partial knowledge of component A provides information about component B I partial knowledge of component B provides information about component A • A staged learner can only take advantage of one of these dependencies • An interactive learner can bene�t from a positive feedback cycle between A and B • This paper investigates whether there are synergies in learning how to segment words and learning the referents of words 3/15

Prior work: mapping words to referents • Input to learner: I word sequence: Is that the pig? I objects in nonlinguistic context: ��, �� • Learning objectives: I identify utterance topic: �� I identify word-topic mapping: pig �→ �� 4/15

. Frank et al (2009) “topic models” as PCFGs . . • Pre�x each sentence with possible Sentence topic marker , e.g., �� | �� . . • PCFG rules designed to choose a Topic pig . . . topic from possible topic marker and Topic pig Word pig propagate it through sentence . . . pig • Each word is either generated from Topic pig Word ∅ . . . sentence topic or null topic ∅ the Topic pig Word ∅ • Simple grammar modi�cation . . that requires at most one topical word per Topic pig Word ∅ sentence �� | �� is • Bayesian inference for PCFG rules and trees corresponds to Bayesian inference for word and sentence topics using topic model (Johnson 2010) 5/15

Prior work: segmenting words in speech • Running speech does not contain “pauses” between words ⇒ child needs to learn how to segment utterances into words • Elman (1990) and Brent et al (1996) studied segmentation using an arti�cial corpus I child-directed utterance: Is that the pig? I broad phonemic representation: ɪz ðæt ðə pɪg I input to learner: � ɪ △ z △ ð △ æ △ t △ ð △ ə △ p △ ɪ △ g � • Learner’s task is to identify which potential boundaries correspond to word boundaries 6/15

. Brent (1999) unigram model as adaptor grammar . . . • Adaptor grammars (AGs) are CFGs in Words which a subset of nonterminals are . . adapted Words Word I AGs learn probability of entire . . . Phons subtrees of adapted nonterminals Word . . . . (Johnson et al 2007) Phon Phons Phons I AGs are hierarchical Dirichlet or . . . . Pitman-Yor Processes ð Phon Phon Phons I Prob. of adapted subtree ∝ . . number of times tree was previously generated ə p Phon Phons + α × PCFG prob. of generating tree . ɪ Phon • AG for unigram word segmentation : g Words → Word | Word Words Word → Phons Phons → Phon | Phon Phons (Adapted nonterminals indicated by underlining) 7/15

. Prior work: Collocation AG (Johnson 2008) • Unigram model doesn’t capture interword dependencies ⇒ tends to undersegment (e.g., ɪz ðæt ðəpɪg ) • Collocation model “explains away” some interword dependencies ⇒ more accurate word segmentation Sentence → Colloc + Sentence Colloc → Word + Word → Phon + Colloc Colloc . Word Word Word Word ɪ z ð æ t ð ə p ɪ g . . • Kleene “+” abbreviates right-branching rules . . . . • Unadapted internal nodes suppressed in trees . . . . . . . . . . 8/15

. AGs for joint segmentation and referent-mapping • Easy to combine topic-model PCFG with word segmentation AGs • Input consists of unsegmented phonemic forms pre�xed with possible topics: �� | �� ɪ z ð æ t ð ə p ɪ g • E.g., combination of Frank “topic model” Sentence and unigram segmentation model I equivalent to Jones et al (2010) Topic pig . Topic pig Word pig . • Easy to de�ne other p ɪ g Topic pig Word ∅ combinations of topic models . . and segmentation models ð ə Topic pig Word ∅ . . . . . ð æ t Topic pig Word ∅ . . . . �� | �� ɪ z . . . . . 9/15 . . .

. Collocation topic model AG . . Sentence . . Topic pig . . . . Topic pig Colloc pig . . . Word ∅ . . . . . Topic pig Colloc ∅ Word pig Word ∅ . . . . . �� | �� ð ə p ɪ g Word ∅ ɪ z ð æ t • Collocations are either “topical” or not • Easy to modify this grammar so I at most one topical word per sentence, or I at most one topical word per topical collocation 10/15

Experimental set-up • Input consists of unsegmented phonemic forms pre�xed with possible topics: �� | �� ɪ z ð æ t ð ə p ɪ g I Child-directed speech corpus collected by Fernald et al (1993) I Objects in visual context annotated by Frank et al (2009) • Bayesian inference for AGs using MCMC (Johnson et al 2009) I Uniform prior on PYP a parameter I “Sparse” Gamma (100 , 0 . 01) on PYP b parameter • For each grammar we ran 8 MCMC chains for 5,000 iterations I collected word segmentation and topic assignments at every 10th iteration during last 2,500 iterations ⇒ 2,000 sample analyses per sentence I computed and evaluated the modal (i.e., most frequent) sample analysis of each sentence 11/15

Does non-linguistic context help segmentation? Model word segmentation segmentation topics token f-score unigram not used 0.533 unigram any number 0.537 unigram one per sentence 0.547 collocation not used 0.695 collocation any number 0.726 collocation one per sentence 0.719 collocation one per collocation 0.750 • Not much improvement with unigram model I consistent with results from Jones et al (2010) • Larger improvement with collocation model I most gain with one topical word per topical collocation (this constraint cannot be imposed on unigram model) 12/15

Does better segmentation help topic identi�cation? • Task: identify object (if any) this sentence is about Model sentence referent segmentation topics accuracy f-score unigram not used 0.709 0 unigram any number 0.702 0.355 unigram one per sentence 0.503 0.495 collocation not used 0.709 0 collocation any number 0.728 0.280 collocation one per sentence 0.440 0.493 collocation one per collocation 0.839 0.747 • The collocation grammar with one topical word per topical collocation is the only model clearly better than baseline 13/15

Does better segmentation help topic identi�cation? • Task: identify head nouns of NPs referring to topical objects (e.g. pɪg �→ �� in input �� | �� ɪ z ð æ t ð ə p ɪ g ) Model topical word segmentation topics f-score unigram not used 0 unigram any number 0.149 unigram one per sentence 0.147 collocation not used 0 collocation any number 0.220 collocation one per sentence 0.321 collocation one per collocation 0.636 • The collocation grammar with one topical word per topical collocation is best at identifying head nouns of referring NPs 14/15

Conclusions and future work • Adaptor Grammars can express a variety of useful HDP models I generic AG inference code makes it easy to explore models • There seem to be synergies a learner could exploit when learning word segmentation and word-object mappings I incorporating word-topic mapping improves segmentation accuracy (at least with collocation grammars) I improving segmentation accuracy improves topic detection and acquisition of topical words Caveat: results seem to depend on details of model • Future work: I extend expressive power of AGs (e.g., phonology, syntax) I richer data (e.g., more non-linguistic context) I more realistic data (e.g., phonological variation) 15/15

Synergies in learning words and their referents Mark Johnson 1 , - PowerPoint PPT Presentation

Synergies in learning words and their referents Mark Johnson 1 , Katherine Demuth 1 , Michael Frank 2 and Bevan Jones 3 1 Macquarie University 2 Stanford University 3 University of Edinburgh NIPS 2010 1/15 Two hypotheses about language

WELL Overview WELL Overview & LEED Synergies & LEED Synergies SUMETA SACHDEVA , LEED AP

Modeling Coreference in Contexts with Three Referents Jet Hoek, Andrew Kehler & Hannah Rohde

Synergies and conflicts on the Synergies and conflicts on the landscape of domestic energy

Synergies in learning syllables and words or Adaptor grammars: a class of nonparametric Bayesian

WFIRST -ALMA Synergies Al Wootten, NRAO WFIRST June 2017 1 WFIRST and RMS Instrumentation

A new step for the Alliance Accelerating synergies Analysts conference call by Renault and Nissan

LSST+Euclid: galaxy shape measurement synergies Robert L. Schuhmann (IfA Edinburgh, U Manchester)

1 The MMAA Project & Partners 2 Artist Services Magazines 3 The Power of Synergies 4

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

MORPHOLOGY A Study of the internal structure of words and the relationships among words

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Token to Words Expanding identified token to words numbers+type = word list

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

First-order logic Chapter 8 Chapter 8 1 Outline Why FOL? Syntax and semantics of FOL

Direct Reflection for Free! Joomy Korkut Princeton University @cattheory February 25th, 2019

Reliable End-to-End Data Transmission in Wireless Sensor Networks Wolf-Bastian Pttner, March

Computational Semantics and Pragmatics Autumn 2013 Raquel Fernndez Institute for Logic,

Computational Models for Attribute Meaning in Adjectives and Nouns Matthias Hartung

The Duality of Time and Information in Concurrency and Branching Time Vaughan Pratt Stanford

Validating Constructive Meta-Theory with Rogue Aaron Stump Assistant Professor Dept. of

Structural Symmetries of the Lifted Representation of Classical Planning Tasks Silvan Sievers 1

Synergies in learning words and their referents Mark Johnson 1 , - PowerPoint PPT Presentation

Synergies in learning words and their referents Mark Johnson 1 , Katherine Demuth 1 , Michael Frank 2 and Bevan Jones 3 1 Macquarie University 2 Stanford University 3 University of Edinburgh NIPS 2010 1/15 Two hypotheses about language

WELL Overview WELL Overview &amp; LEED Synergies &amp; LEED Synergies SUMETA SACHDEVA , LEED AP

Modeling Coreference in Contexts with Three Referents Jet Hoek, Andrew Kehler &amp; Hannah Rohde

Synergies and conflicts on the Synergies and conflicts on the landscape of domestic energy

Synergies in learning syllables and words or Adaptor grammars: a class of nonparametric Bayesian

WFIRST -ALMA Synergies Al Wootten, NRAO WFIRST June 2017 1 WFIRST and RMS Instrumentation

A new step for the Alliance Accelerating synergies Analysts conference call by Renault and Nissan

LSST+Euclid: galaxy shape measurement synergies Robert L. Schuhmann (IfA Edinburgh, U Manchester)

1 The MMAA Project &amp; Partners 2 Artist Services Magazines 3 The Power of Synergies 4

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

MORPHOLOGY A Study of the internal structure of words and the relationships among words

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Token to Words Expanding identified token to words numbers+type = word list

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

First-order logic Chapter 8 Chapter 8 1 Outline Why FOL? Syntax and semantics of FOL

Direct Reflection for Free! Joomy Korkut Princeton University @cattheory February 25th, 2019

Reliable End-to-End Data Transmission in Wireless Sensor Networks Wolf-Bastian Pttner, March

Computational Semantics and Pragmatics Autumn 2013 Raquel Fernndez Institute for Logic,

Computational Models for Attribute Meaning in Adjectives and Nouns Matthias Hartung

The Duality of Time and Information in Concurrency and Branching Time Vaughan Pratt Stanford

Validating Constructive Meta-Theory with Rogue Aaron Stump Assistant Professor Dept. of

Structural Symmetries of the Lifted Representation of Classical Planning Tasks Silvan Sievers 1

WELL Overview WELL Overview & LEED Synergies & LEED Synergies SUMETA SACHDEVA , LEED AP

Modeling Coreference in Contexts with Three Referents Jet Hoek, Andrew Kehler & Hannah Rohde

1 The MMAA Project & Partners 2 Artist Services Magazines 3 The Power of Synergies 4