Guiding Unsupervised Grammar Induction Using Contrastive Estimation ∗ Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University 3400 North Charles Street, Baltimore, MD 21218 USA { nasmith,jason } @cs.jhu.edu Abstract modeling pose different challenges and are evaluated differ- ently. We regard traditional natural language grammar induc- We describe a novel training criterion for proba- tion evaluated against a treebank (also known as unsupervised bilistic grammar induction models, contrastive es- parsing) as just another task ; we call it M ATCH L INGUIST . timation [Smith and Eisner, 2005], which can be A grammar induced for punctuation restoration or language interpreted as exploiting implicit negative evidence modeling for speech recognition might look strange to a lin- and includes a wide class of likelihood-based ob- guist, yet do better on those tasks. By the same token, tra- jective functions. This criterion is a generaliza- ditional treebank-style linguistic annotations may not be the tion of the function maximized by the Expectation- best kind of syntax for language modeling. Maximization algorithm [Dempster et al. , 1977]. But without fully-observed data, how might one tell a CE is a natural fit for log-linear models, which can learner to focus on one task or another? We propose that this include arbitrary features but for which EM is com- is conveyed in the choice of an objective function that guides putationally difficult. We show that, using the same a statistical learner toward the right kinds of grammars for features, log-linear dependency grammar models the task at hand. We offer a flexible class of “contrastive” ob- trained using CE can drastically outperform EM- jective functions within which something appropriate may be trained generative models on the task of match- designed for existing and novel tasks. ing human linguistic annotations (the M ATCH L IN - In this paper, we evaluate our learned models on M ATCH - GUIST task). The selection of an implicit negative L INGUIST , which is a crucial task for natural language en- evidence class—a “neighborhood”—appropriate to gineering. Automatic natural language grammar induction a given task has strong implications, but a good would bridge the gap between resource limitations (anno- neighborhood one can target the objective of gram- tated treebanks are expensive, domain-specific, and language- mar induction to a specific application. specific) and the promise of exploiting syntactic structure in many applications. We argue that M ATCH L INGUIST , just like other tasks, requires guidance. 1 Introduction For example, M ATCH L INGUIST is decidedly different from the task that is explicitly solved by the Expectation- Grammars are formal objects with many applications. They Maximization algorithm [Dempster et al. , 1977]: M AXI - become particularly interesting when they allow ambiguity MIZE L IKELIHOOD . EM tries to fit the numerical parameters (cf. programming language grammars), introducing the no- of a (fixed) statistical model of hidden structure to the train- tion that one grammar may be preferable to another for a par- ing data. To recover traditional or useful syntactic structure, ticular use. Given an induced grammar, a researcher could try it is not enough to maximize training data likelihood [Car- to apply it cleverly to her task and then measure its helpful- roll and Charniak, 1992, inter alia ], and EM is notorious for ness on that task. This paper turns that scenario around. mediocre results. Our results suggest that part of the reason Given a task, our question is how to induce a grammar— EM performs badly is that it offers very little guidance to the from unannotated data—that is especially appropriate for the learner. The alternative we propose is contrastive estimation . task. Different grammars are likely to be better for differ- It is within the same statistical modeling paradigm as EM, but ent tasks. In natural language engineering, for example, ap- generalizes it by defining a notion of learner guidance. plications like automatic essay grading, punctuation correc- Contrastive estimation makes use of a set of examples that tion, spelling correction, machine translation, and language are similar in some way to an observed example (its neigh- borhood ), but mostly perturbed or damaged in a particular ∗ This work was supported by a Fannie and John Hertz Founda- way. CE requires the learner to move probability mass to tion Fellowship to the first author and NSF ITR grant IIS-0313193 a given example, taking only from the example’s neighbor- to the second author. The views expressed are not necessarily en- hood. The neighborhood of a particular example is defined by dorsed by the sponsors. The authors also thank colleagues at CLSP the neighborhood function ; different neighborhood functions and two anonymous reviewers for comments on this work.
Recommend
More recommend