A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax Christo Kirov 1 John Sylak-Glassman 1 Rebecca Knowles 1 , 2 Ryan Cotterell 1 , 2 Matt Post 1 , 2 , 3 1 Center for Language and Speech Processing 2 Department of Computer Science 3 Human Language Technology Center of Excellence Johns Hopkins University kirov@gmail.com, { jcsg, rknowles, rcotter2 } @jhu.edu, post@cs.jhu.edu Abstract (i.e., the word order) (Harley, 2015). Theoreti- cally, this common underlying semantics should A traditional claim in linguistics is that all allow syntactic structure to be transformed into human languages are equally expressive— morphological structure and vice versa. We ex- able to convey the same wide range of plore the veracity of this claim computationally meanings. Morphologically rich lan- by asking the following: Can we develop a tag- guages, such as Czech, rely on overt in- ger for English that uses the signal available in flectional and derivational morphology to English-only syntactic structure to recover the rich convey many semantic distinctions. Lan- semantic distinctions conveyed by morphology in guages with comparatively limited mor- Czech? Can we, for example, accurately detect phology, such as English, should be able which English contexts would have a Czech trans- to accomplish the same using a combi- lation that employs the accusative case marker? nation of syntactic and contextual cues. Traditionally, morphological analysis and tag- We capitalize on this idea by training a ging is a task that has been limited to morphologi- tagger for English that uses syntactic fea- cally rich languages (MRLs) (Hajiˇ c, 2000; Dr´ abek tures obtained by automatic parsing to re- and Yarowsky, 2005; M¨ uller et al., 2015; Buys cover complex morphological tags pro- and Botha, 2016). In order to build a rich mor- jected from Czech. The high accuracy phological tagger for a morphologically poor lan- of the resulting model provides quantita- guage (MPL) like English, we need some way to tive confirmation of the underlying lin- build a gold standard set of richly tagged English guistic hypothesis of equal expressivity, data for training and testing. Our approach is to and bodes well for future improvements in project the complex morphological tags of Czech downstream HLT tasks including machine words directly onto the English words they align translation. to in a large parallel corpus. After evaluating the validity of these projections, we develop a neural 1 Introduction network tagging architecture that takes as input a number of English features derived from off-the- Different languages use different grammatical shelf dependency parsing and attempts to recover tools to convey the same meanings. For ex- the projected Czech tags. ample, to indicate that a noun functions as a direct object, English—a morphologically poor A tagger of this sort is interesting in many ways. language—places the noun after the verb, while Whereas the best NLP tools are typically available Czech—a morphologically rich language—uses for English, morphological tagging at this gran- an accusative case suffix. Consider the follow- ularity has until now been applied almost exclu- ing two glossed Czech sentences: ryba jedla (“the sively to MRLs. The task is also scientifically in- fish ate”) and oni jedli rybu (“they ate the fish”). teresting, in that it takes semantic properties that The key insight is that the morphology of Czech are latent in the syntactic structure of English and (i.e., the case ending -u ), carries the same seman- transforms them into explicit word-level annota- tic content as the syntactic structure of English tions. Finally, such a tool has potential utility in a 112 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 112–117, Valencia, Spain, April 3-7, 2017. c � 2017 Association for Computational Linguistics
Subtag Values PTB Expected UM Match % G ENDER FEM , MASC , NEUT NN SG 87.8 N UMBER SG , DU , PL NNP SG 73.9 C ASE NOM , GEN , DAT , ACC , VOC , ESS , INS NNS PL 83.3 P ERSON 1, 2, 3 NNPS PL 65.1 T ENSE FUT , PRS , PST JJR CMPR 89.0 G RADE CMPR , SPRL JJS SPRL 79.3 N EGATION POS , NEG RBR CMPR 76.3 V OICE ACT , PASS RBS SPRL 68.7 VBZ SG 91.3 Table 1: The subset of the UniMorph Schema used here. VBZ 3 90.7 VBZ PRS 89.4 VBG PRS 55.9 VBP PRS 87.2 range of downstream tasks, such as machine trans- VBD PST 93.9 lation into MRLs (Sennrich and Haddow, 2016). VBN PST 78.7 Average Match % 80.7 2 Projecting Morphological Tags Table 2: To evaluate the validity of projecting morpholog- ical tags from Czech onto English text, we compare these Training a system to tag English text with multi- projected features to features obtained from the original PTB dimensional morphological tags requires a corpus tags (listed on the left). The expected UniMorph (UM) sub- tag (center) is from a manual ‘translation’ of PTB tags into of English text annotated with those tags. Since UniMorph tags. The match percentage indicates how often no such corpora exist, we must construct one. the feature projected from a UniMorph ‘translation’ of the Past work (focused on translating out of English original PCEDT annotation of Czech matches the feature that would be expected subtag. Note that the core part-of-speech into MRLs) assigned a handful of morphologi- must agree as a precondition for further evaluation. cal annotations using manually-developed heuris- tics (Dr´ abek and Yarowsky, 2005; Avramidis and Koehn, 2008), but this is hard to scale. We there- See Figure 1 for a comparison of the PCEDT, Uni- fore instead look to obtain rich morphological tags Morph, and PTB tag systems for a Czech word and its aligned English translation. by projecting them (Yarowsky et al., 2001) from a language (such as Czech) where such rich tags The PCEDT also contains automatically gener- have already been annotated. ated word alignments produced by using GIZA++ We use the Prague Czech–English Dependency (Och and Ney, 2003) to align the Czech and En- Treebank (PCEDT) (Hajiˇ glish sides of the treebank. We use these align- c et al., 2012), a com- plete translation of the Wall Street Journal por- ments to project morphological tags from the tion of the Penn Treebank (PTB) (Marcus et al., Czech words to their English counterparts through 1993). Each word on the Czech side of the the following process. For every English word, PCEDT was originally hand-annotated with com- if the word is aligned to a single Czech word, plex 15-dimensional morphological tags contain- take its tag. If the word is mapped to multiple ing positional subtag values for morphological cat- Czech words, take the annotation from the align- egories specific to Czech. 1 We manually mapped ment point belonging to the intersection of the these tags to the UniMorph Schema tagset (Sylak- two underlying GIZA++ models used to produce the many-many alignment. 2 If no such alignment Glassman et al., 2015), which provides a uni- versal, typologically-informed annotation frame- point is found, take the leftmost aligned word. Un- aligned English words get no annotation. work for representing morphological features of inflected words in the world’s languages. Uni- 3 Validating Projections Morph tags are in principle up to 23-dimensional, but tags are not positionally dependent, and not If we believe that we can project semantic distinc- every dimension needs to be specified. Table 1 tions over bitext, we must ensure that the elements shows the subset of UniMorph subtags used here. linked by projection in both source and target lan- PTB tags have no formal internal subtag structure. guages carry roughly the same meaning. This is 1 For our purposes, a morphological tag is a complex, difficult to automate, and no gold-standard dataset multiclass entity comprising the morphological features that or metric has been developed. Thus, we offer the a word bears across many different inflectional categories following approximate evaluation. (e.g., CASE , NUMBER , and so on). We call these features sub- tags , and each takes one of several values (e.g., PRS ‘present’ 2 This intersection is marked as int.gdfa in the PCEDT. in the TENSE category of the UniMorph Schema). 113
Recommend
More recommend