Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution Sam Wiseman 1 Alexander M. Rush 1,2 Stuart M. Shieber 1 Jason Weston 2 1 School of Engineering and Applied Sciences 2 Facebook AI Research Harvard University New York, NY, USA Cambridge, MA, USA jase@fb.com { swiseman,srush,shieber } @seas.harvard.edu Abstract conjunctions will lead to improved performance, which is problematic as systems attempt to scale We introduce a simple, non-linear with new data and features. mention-ranking model for coreference In this work, we propose a data-driven resolution that attempts to learn distinct model for coreference that does not require pre- feature representations for anaphoricity specifying any feature relationships. Inspired by detection and antecedent ranking, which recent work in learning representations for nat- we encourage by pre-training on a pair ural language tasks (Collobert et al., 2011), we of corresponding subtasks. Although we explore neural network models which take only use only simple, unconjoined features, the raw, unconjoined features as input, and attempt to model is able to learn useful representa- learn intermediate representations automatically. tions, and we report the best overall score In particular, the model we describe attempts to on the CoNLL 2012 English test set to create independent feature representations useful date. for both detecting the anaphoricity of a mention (that is, whether or not a mention is anaphoric) and 1 Introduction ranking the potential antecedents of an anaphoric mention. Adequately capturing anaphoricity in- One of the major challenges associated with re- formation has long been thought to be an impor- solving coreference is that in typical documents tant aspect of the coreference task (see Ng (2004) the number of mentions (syntactic units capable and Section 7), since a strong non-anaphoric sig- of referring or being referred to) that are non- nal might, for instance, discourage the erroneous anaphoric – that is, that are not coreferent with prediction of an antecedent for a non-anaphoric any previous mention – far exceeds the number mention even in the presence of a misleading head of mentions that are anaphoric (Kummerfeld and match. Klein, 2013; Durrett and Klein, 2013). We furthermore attempt to encourage the learn- This preponderance of non-anaphoric mentions ing of the desired feature representations by pre- makes coreference resolution challenging, partly training the model’s weights on two correspond- because many basic coreference features, such as ing subtasks, namely, anaphoricity detection and those looking at head, number, or gender match antecedent ranking of known anaphoric mentions. fail to distinguish between truly coreferent pairs Overall our best model has an absolute gain of and the large number of matching but nonethe- almost 2 points in CoNLL score over a similar less non-coreferent pairs. Indeed, several au- but linear mention-ranking model on the CoNLL thors have noted that it is difficult to obtain good 2012 English test set (Pradhan et al., 2012), and performance on the coreference task using sim- of over 1.5 points over the state-of-the-art coref- ple features (Lee et al., 2011; Fernandes et al., erence system. Moreover, unlike current state-of- 2012; Durrett and Klein, 2013; Kummerfeld and the-art systems, our model does only local infer- Klein, 2013; Bj¨ orkelund and Kuhn, 2014) and, as ence, and is therefore significantly simpler. a result, state-of-the-art systems tend to use lin- ear models with complicated feature conjunction 1.1 Problem Setting schemes in order to capture more fine-grained in- teractions. While this approach has shown suc- We consider here the mention-ranking (or cess, it is not obvious which additional feature “mention-synchronous”) approach to coreference 1416 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing , pages 1416–1426, Beijing, China, July 26-31, 2015. c � 2015 Association for Computational Linguistics
resolution (Denis and Baldridge, 2008; Bengtson functions. In the next section, we will extend these and Roth, 2008; Rahman and Ng, 2009), which models to operate over learned non-linear repre- has been adopted by several recent coreference sentations. systems (Durrett and Klein, 2013; Chang et al., Linear mention-ranking models generally uti- 2013). Such systems aim to identify whether a lize the following scoring function mention is coreferent with an antecedent mention, s lin ( x, y ) � w T φ ( x, y ) , or whether it is instead non-anaphoric (the first mention in the document referring to a particular where φ : X × Y → R d is a pairwise feature func- entity). This is accomplished by assigning a score tion defined on a mention and a potential an- to the mention’s potential antecedents as well as tecedent, and w is a learned parameter vector. to the possibility that it is non-anaphoric, and To add additional flexibility to the model, lin- then predicting the greatest scoring option. We ear mention ranking models may duplicate indi- furthermore assume the more realistic “system vidual features in φ , with one version being used mention” setting, where it is not known a priori when predicting an antecedent for x , and another which mentions in a document participate in when predicting that x is non-anaphoric (Durrett coreference clusters, and so (all) mentions must and Klein, 2013). Such a scheme effectively gives be automatically extracted, typically with the aid rise to the following piecewise scoring function of automatically detected parse trees. u T � φ a ( x ) Formally, we denote the set of automatically de- � � if y � = ǫ tected mentions in a document by X . For a men- φ p ( x,y ) s lin+ ( x, y ) � v T φ a ( x ) if y = ǫ tion x ∈ X , let A ( x ) denote the set of mentions , appearing before x ; we refer to this set as x ’s po- where φ a : X → R d a is a feature function defined tential antecedents. Additionally let the symbol on a mention and its context, φ p : X × Y → R d p ǫ denote the empty antecedent, to which we will view x as referring when x is non-anaphoric. 1 De- is a pairwise feature function defined on a mention and a potential antecedent, and parameters u and noting the set A ( x ) ∪ { ǫ } by Y ( x ) , a mention- v replace w . Above, we have made an explicit dis- ranking model defines a scoring function s ( x, y ) : tinction between pairwise features ( φ p ) and those X × Y → R , and predicts the antecedent of x to be y ∗ = arg max y ∈Y ( x ) s ( x, y ) . strictly on x and its context ( φ a ), and moreover as- sumed that our features need not examine potential It is common to be quite liberal when extracting antecedents when predicting y = ǫ . mentions, taking, essentially, every noun phrase or We refer to the basic, unconjoined features used pronoun to be a candidate mention, so as not to for φ a and φ p as raw features. Figure 2 shows prematurely discard those that might be coreferent two versions of these features, a base set B ASIC (Lee et al., 2011; Fernandes et al., 2012; Chang and an extended set B ASIC +. The B ASIC set are et al., 2012; Durrett and Klein, 2013). For in- the raw features used in BCS, and B ASIC + in- stance, the Berkeley Coreference System (herein cludes additional raw features used in other recent BCS) (Durrett and Klein, 2013), which we use coreference sytems. For instance, B ASIC + addi- for mention extraction in our experiments, recov- tionally includes features suggested by Recasens ers approximately 96.4% of the truly anaphoric et al. (2013) to be useful for anaphoricity, such mentions in the CoNLL 2012 training set, with as the number of a mention, its named entity sta- an almost 3.5:1 ratio of non-anaphoric mentions tus, and its animacy, as well as number and gen- to anaphoric mentions among the extracted men- der information. We additionally include bilexi- tions. cal head features, which are used in many well- 2 Mention Ranking Models performing systems (for instance, that of Fernan- des et al. (2012)). The structural simplicity of the mention-ranking framework puts much of the burden on the scor- 2.1 Problems with Raw Features ing function s ( x, y ) . We begin by consider- Many authors have observed that, taken individu- ing mention-ranking systems using linear scoring ally, raw features tend to not be particularly pre- dictive for the coreference task. We examine 1 We make this stipulation for modeling convenience; it is this phenomenon empirically in Figure 1. These not intended to reflect any linguistic fact. 1417
Recommend
More recommend