✬ ✩ Computational Linguistics MORPHOLOGY – TRANSDUCERS Martin Rajman Martin.Rajman@epfl.ch and Jean-C´ edric Chappelier Jean-Cedric.Chappelier@epfl.ch Artificial Intelligence Laboratory ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 1/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Objectives of this lecture ➥ Present morphology, important part of NLP ➥ Introduce transducers, tools for computational morphology ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 2/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Contents ➥ Morphology ➥ Transducers ➥ Operations and Regular Expressions on Transducers ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 3/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Morphology Study of the internal structure and the variability of the words in a language: ✏ verbs conjugation ✏ plurals ✏ nominalization (enjoy → enjoyment) ➜ inflectional morphology: preserves the grammatical category give given gave gives ... ➜ derivational morphology: change in category process processing processable processor processabilty ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 4/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Morphology (2) Interest: use a priori knowledge about word structure to decompose it into morphemes and produce additional syntactic and semantic information (on the current word) ☞ 2 morphemes processable → process- -able meaning: process possible role: root suffix semantic information: main less The importance and complexity of morphology vary from language to language Some information represented at the morphological level in English may be represented differently in other languages (and vice-versa). The paradigmatic/syntagmatic repartition changes from one language to another ✫ Example in Chinese: ate − → expressed as ”eat yesterday” ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 5/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Stems – Affixes Words are decomposed into morphemes: roots (or stems) and affixes. There are several kinds of affixes : ➊ prefixes: in - -credible ➋ suffixes: incred- - ible ➌ infixes: Example in Tagalog ( Philippines): hingi (to borrow) → h um ingi (agent of the action) In slang English! → ”fucking” in the middle of a word Man-fucking-hattan ➍ circumfixes: Example in German: sagen (to say) → ge sag t (said) ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 6/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Stems – Affixes (2) several affixes may be combined: examples in Turkish where you can have up to 10 (!) affixes. uygarlas ¸tıramadıklarimizdanmıs ¸sınızcasına uygar las ¸ tır ama dık lar imiz dan mıs ¸ sınız casına civilized +B EC +C AUS +N EG A BLE +PP ART +P L +P1P L +A BL +P AST +2P L +A S I F as if you are among those whom we could not cause to become civilized When only prefixes and suffixes are involved: concatenative morphology Some languages are not concatenative: • infixes • pattern-based morphology ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 7/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Example of semitic languages Pattern-based morphology In Hebrew, the verb morphology is based on the association of • a root, often made of 3 consonents, which indicates the main meaning, • and a vocalic structure (insertion ov vowels) that refines the meaning. Example: LMD (learn or teach) LAMAD → he was learning LUMAD → he was taught ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 8/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Computational Morphology Let us consider flexional morphology, for instance for verbs and nouns Noun flexions: plural General rule: +s but several exceptions (e.g. foxes, mice) Verb flexions: conjugations • tense, mode • regular/irregular ☞ How to handle flexions (comptutationaly)? ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 9/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Computational Morphology Example:surface form: is canonical representation at the lexicon level (formalization): be+3+s+Ind+Pres The objective of computational morphology tools is precisely to go from one to the other: • Analysis: Find the canonical representation corresponding to the surface form • Generation: Produce the surface form described by the canonical representation Challenge: have a ”good” implementation of these two transformations ✫ Tools: associations of strings → transducers ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 10/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ String associations (eaten, eat) ( X 1 , X ′ 1 ) (processed, process) . . . . . . ( X n , X ′ n ) (thought, think) Example: ( abc , ABC ) Easy situation: ∀ i, | X i | = | X ′ i | ⇒ represented as a sequence of character transductions ( abc , ABC ) = (a,A)(b,B)(c,C) ☞ strings on a new alphabet: strings of character couples Not so easy: If ∃ i, | X i | � = | X ′ i | ⇒ requires the introduction of empty string ε Example: ( ab , ABC ) ≃ ( ε ab , ABC ) = ( ε , A )( a , B )( b , C ) ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 11/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Dealing with ε Where to put the ε ? Example:(ab,ABC) ≃ ( ε ab, ABC) but also (ab,ABC) ≃ (a ε b, ABC) or (ab,ABC) ≃ (ab ε , ABC) General case: n ( with m < n ) m Hard problem in general → need for a convention ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 12/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Transducer (definition) Let Σ 1 and Σ 2 be two enumerable sets (alphabets), and � � Σ = (Σ 1 ∪ { ε } ) × (Σ 2 ∪ { ε } ) \ { ( ε, ε ) } A transducer is a DFSA on Σ Σ 1 : ”left” language Σ 2 : ”right” language : upper language : lower language : input language : output language ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 13/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Example initial state 1 final state(s) a:b a b:a b 0 b:a b: ε a 2 Some transductions: (bb,b) [0,0,2] (ababb,baab) [0,1,2,0,0,2] ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 14/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Different usages of a transducer ( abba , baaa ) ∈ Σ ∗ ? ➊ association checking bbab → ? ➋ Generation: string 1 → string 2 ? → ba ➌ Analysis: string 2 → string 1 ➊ : easy: ( = FSA: nothing special) What about ➋ and ➌ ? ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 15/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Transduction Walk through the FSA following one or the other element of the couple (projections) ❢ ☞ not deterministic in general! The fact that a transducer is a deterministic (couple-)FSA does not at all imply that the automaton resulting from one projection or the other is also deterministic! non-deterministic evaluation ⇒ The projection is not constant time (in general) backtracking on ”wrong” solutions When a transducer is deterministic with respect to one projection or the other, it is called a sequential transducer A transducer in not sequential in general. In particular if one language or the other (upper or lower) is not finite, it is not sure that a sequential transducer can be produced. ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 16/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Transduction (2) Example: bbab → ? 0 b : b b : ε 0 2 b : b b : ε b : b 0 2 1 a : b a : a a : b 1 0 2 b : a b : b b : ε b : a 2 0 2 1 bbab → bbba bbab → ba ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 17/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
✬ ✩ Transduction (3) Example:? → ba 0 b : b a : b b : ε 0 1 2 b : ε a : a b : a (FAIL) 2 2 2 a : a a : a 0 1 b : ε (FAIL) 2 aa → ba ab → ba bbab → ba ✫ ✪ LIA M. Rajman Computational Linguistics Course (EPFL-MsCS) 18/24 I&C J.-C. Chappelier ÉC OLE PO L Y TEC H NIQ U E FÉ DÉR A LE D E LA USAN NE
Recommend
More recommend