MULTIFLEX - a Formalism and a Tool for the Computational Morphology of Multi-Word Units Agata SAVARY Université François Rabelais Laboratoire d’Informatique IUT de Blois IPI PAN Seminar Warszawa, Jan 5, 2007
Multi-Word Units (MWUs) • MWUs = hard to define and controversial linguistic objects called, in various contexts compounds, frozen expressions, complex terms, multi- word named entities, burkinostka (pl.), etc. • Numerous linguistic and pragmatic definitions (Benveniste 1974, Downing 1977, Levi 1978, Bauer 1983, Gross 1990, Anscombre 1990, Silberztein 1993, Cadiot 1992, Gross 1996, Derwojedowa & Rudolf 2003) and applications ( Sparck-Jones & Tait 1984, Smadja 1993, Silberztein 1993, Daille 1994, Enguehard & Pantera 1994, Jacquemin 2001, Paumier 2003) • Major features invoked in the bibliography (based on controversial elementary notions and measures, here in italic): – Composed of two or more words – Showing some degree of non-compositionality – Having unique and constant references 2
MWUs: pragmatic defjnition • MWU = a contiguous sequence of graphical units which, for some application-dependent reasons, has to be listed, described (morphologically, syntactically, semantically, etc.) and analyzed as a unit • Examples: – (en.) battle of nerves, Papua New Guinea, -calculus – (fr.) porte-avions, Windows 3.11, liaison multiple par satellite – (pl.) pranie mózgu, Rondo De Gaulle’a, trzy po trzy 3
Infmectional Morphology of MWUs • What is a MWU’s morphological class (noun, adjective, etc.) and what inflection categories (number, gender, etc.), with fixed or variable values, are relevant to it? E.g. pranie mózgu is a noun, it has a masculine gender and it inflects for number and case. • What are the exceptions to the inflection categories determined above ? E.g. wybory powszechne is a noun but does not have a singular form. • What are the inflectional characteristics (base form, morphological class, inflection paradigm) of each single constituent of the MWU? E.g. porte is an uninflected verb in porte-avions and an inflected noun in porte-fenêtre . • How do inflected forms of single constituents combine in the inflection process of the whole compound? E.g. – battle cry battle crie s – battle royal battle royal s , or battle s royal (*battles royals) – battle of nerves battle s of nerves 4
State of the Art (1/2) • Morphological analysis: stemming or lemmatizing of the constituent words • Problems: – cross-roads * cross-road (should be: cross-roads ) – court martials court ? ( martial is not an individual English noun) – des deux-chevaux ? (non standard French nominal construction) • Morphological generation: grammar-based or bag-of-words approaches • Problems: – notary public notary public s (should also be: notar ies public and notar ies public s ) – battle cry battle cr ies , *battle s cry, *battle s cr ies 5
State of the Art (2/2) • Xerox – lexical transducers allowing compounding and unification. Elegant and mathematically well defined model. Comparative study with Multiflex is progress. • Greek DELA – all possible combinations of all inflected forms of the single constituents + restriction filters. Drawbacks: a graphical unit has a fixed definition, some restrictions on forms cannot be described, heterogeneous and non generic rules (hard to adapt to a different language), separators are impossible to describe. • Intex – a MWU’s inflection formalism extends the simple word morphology with new operators like “go to the end of the first word, add an s” , etc. Drawbacks: Redundant description of single words’ morphology. 6
Example of a Formalism for the Morphology of Single-Word :ms A72: x :mp :fs 2lle s :fp nouveau,A72 beau,A72 7
My aim Propose a („universal”) formalism that allows to explicitly and precisely describe all inflected forms of a MWU. An inflectional paradigm for a MWU should be independent from the inflectional paradigms of the single constituents. ? CN23: battle of nerves, CN23 man-of-war, CN23 8
MULTIFLEX A Formalism and a Tool 9
Morphological description on the language level 10
The alphabet Aa Aa Aa Aà Ąą Bb List of alphabet Àà Bb Cc characters with Aâ Cc Dd upper/lowercase equivalences and Ââ Ćć Ee sorting keys. Aä Dd Ff Ää Ee Gg Bb Ęę Hh Cc Éé Ii Cç Ëë Jj … … … French Polish Serbian (encoded in ascii only) 11
Infmectional classes, categories and values French Category name: Nb <CATEGORIES> Possible values for this category: s, p Nb: s, p Gen: m, f <CLASSES> Class name: noun noun: (Nb,<var>),(Gen,<fixed>) Possible categories for this class (variable or fixed): adj:(Nb,<var>),(Gen,<var>) • Nb (a noun inflects in number) • Gen (a noun has a gender but adv: does not inflect in gender (?)) … 12
Infmectional classes, categories and values Multi-character names are admitted for classes and categories Polish <CATEGORIES> Nb:sing,pl Gen:pers_masc,anim_masc,inanim_masc,fem,neut Case:Nom,Gen,Dat,Acc,Inst,Loc,Voc <CLASSES> noun:(Nb,<var>),(Case,<var>),(Gen,<fixed>) adj:(Nb,<var>),(Case,<var>),(Gen,<var>) 13
Morphological description on the level of a multi-word unit : infmection graphs 14
Infmection Graphs for MWUs: battle royal battle royal $1 $2 $3 battle royal (battle royal, [Nb=s]), (battle royal s , [Nb=p]), (battle s royal, [Nb=p]) The whole MWU is in singular The 1st constituent remains intact Stop box Entry box The 3rd constituent gets Each path describes one or more inflected into plural inflected forms of a MWU 15
Infmection Graphs for MWUs: bateau-mouche Gender category has a fixed value in this class bateau - mouche (noun). In this paradigm it is masculine. $1 $2 $3 bateau-mouche (bateau-mouche, [Gen=m,Nb=s]) (bateau x- mouche s , [Gen=m,Nb=p]) Inflection features for a single constituent may be a partial set. Here the gender is This class (noun) inflects not specified: it is the same as this for number. constituent has in the base form of the MWU. This allows the same graph to apply to e.g. homme politique . 16
Unifjcation variables homme politique bateau - mouche $1 $2 $3 $1 $2 $3 bateau-mouche (bateau-mouche, [Gen=m,Nb=s]) (bateau x- mouche s , [Gen=m,Nb=p]) homme politique (homme politique, [Gen=m,Nb=s]) (homme s politique s , [Gen=m,Nb=p]) The instantiation of a variable is identical for all its appearances in one path The unification variable $n may be instantiated to any value of its category’s The generated forms are identical domain (here: Nb: s ,p). as for the previous graph. 17
Unifjcation variables and value inheritance The previous graph is limited to MWU in masculine only. However bateau-mouche and homme politique inflect basically on the same way as moissoneuse-batteuse , liaison numérique , etc: they inflect in number only, and in order to get the plural we need to put the 1st and the 3rd constituent into plural. bateau-mouche (bateau-mouche, [Gen= m ,Nb=s]) (bateau x- mouche s , [Gen= m ,Nb=p]) liaison numérique (liaison numérique, [Gen= f ,Nb=s]) (liaison s numérique s , [Gen= f ,Nb=p]) The double equal sign (==) means that variable $g has a fixed value in the whole path. It is to be unified to the gender value that the first constituent The generated forms has in the base form of the MWU (e.g masculine for are identical as for bateau , and feminine for liaison ). The whole MWU the previous graph. inherits this value. 18
Graph size reduction via unifjcation variables: pranie mózgu pranie mózgu (brain washing) $1 $2 $3 pranie mózgu (pranie mózgu, [Gen=neut,Nb=sing,Case=Nom]), (pranie mózg ów , [Gen=neut,Nb=sing,Case=Nom]), (prani a mózgu, [Gen=neut,Nb=sing,Case=Gen]), (prani a mózg ów , [Gen=neut,Nb=sing,Case=Gen]), (prani u mózgu, [Gen=neut,Nb=sing,Case=Dat]), (prani u mózg ów , [Gen=neut,Nb=sing,Case=Dat]), (prani e mózgu, [Gen=neut,Nb=sing,Case=Acc]), (prani e mózg ów , [Gen=neut,Nb=sing,Case=Acc]), (prani em mózgu, [Gen=neut,Nb=sing,Case=Inst]), (prani em mózg ów , [Gen=neut,Nb=sing,Case=Inst]), (prani u mózgu, [Gen=neut,Nb=sing,Case=Loc]), (prani u mózg ów , [Gen=neut,Nb=sing,Case=Loc]), (prani e mózgu, [Gen=neut,Nb=sing,Case=Voc]), (prani e mózg ów , [Gen=neut,Nb=sing,Case=Voc]) (prani a mózgu, [Gen=neut,Nb=pl,Case=Nom]), (prani a mózg ów , [Gen=neut,Nb=pl,Case=Nom]), (pra ń mózgu, [Gen=neut,Nb=pl,Case=Gen]), (pra ń mózg ów , [Gen=neut,Nb=pl,Case=Gen]), (prani om mózgu, [Gen=neut,Nb=pl,Case=Dat]), (prani om mózg ów , [Gen=neut,Nb=pl,Case=Dat]), (prani a mózgu, [Gen=neut,Nb=pl,Case=Acc]), (prani a mózg ów , [Gen=neut,Nb=pl,Case=Acc]), (prani ami mózgu, [Gen=neut,Nb=pl,Case=Inst]), (prani ami mózg ów , [Gen=neut,Nb=pl,Case=Inst]), (prani ach mózgu, [Gen=neut,Nb=pl,Case=Loc]), (prani ach mózg ów , [Gen=neut,Nb=pl,Case=Loc]), (prani a mózgu, [Gen=neut,Nb=pl,Case=Voc]), (prani a mózg ów , [Gen=neut,Nb=pl,Case=Voc]) With no use of unification variables the inflection graph would have to contain 28 different paths. 19
Graph size reduction via unifjcation variables: pranie mózgu pranie mózgu $1 $2 $3 The 1st and the 2nd constituent inflect in number independently from each other The whole MWU inherits its gender, number and case from the 1st constituent 20
Recommend
More recommend