✬ ✩ Introduction to Natural Language Processing MORPHOLOGY – TRANSDUCERS Martin Rajman Martin.Rajman@epfl.ch and Jean-C´ edric Chappelier Jean-Cedric.Chappelier@epfl.ch Artificial Intelligence Laboratory ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 1/24 I&C J.-C. Chappelier
✬ ✩ Objectives of this lecture ➥ Present morphology, important part of NLP ➥ Introduce transducers, tools for computational morphology ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 2/24 I&C J.-C. Chappelier
✬ ✩ Contents ➥ Morphology ➥ Transducers ➥ Operations and Regular Expressions on Transducers ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 3/24 I&C J.-C. Chappelier
✬ ✩ Morphology Study of the internal structure and the variability of the words in a language: ✏ verbs conjugation ✏ plurals ✏ nominalization (enjoy → enjoyment) ➜ inflectional morphology: preserves the grammatical category give given gave gives ... ➜ derivational morphology: change in category process processing processable processor processabilty ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 4/24 I&C J.-C. Chappelier
✬ ✩ Morphology (2) Interest: use a priori knowledge about word structure to decompose it into morphemes and produce additional syntactic and semantic information (on the current word) processable → ☞ 2 morphemes process- -able meaning: process possible role: root suffix semantic information: main less The importance and complexity of morphology vary from language to language Some information represented at the morphological level in English may be represented differently in other languages (and vice-versa). The paradigmatic/syntagmatic repartition changes from one language to another ✫ Example in Chinese: ate − → expressed as ”eat yesterday” ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 5/24 I&C J.-C. Chappelier
✬ ✩ Stems – Affixes Words are decomposed into morphemes: roots (or stems) and affixes. There are several kinds of affixes : ➊ prefixes: in - -credible ➋ suffixes: incred- - ible ➌ infixes: Example in Tagalog ( Philippines): hingi (to borrow) → h um ingi (agent of the action) In slang English! → ”fucking” in the middle of a word Man-fucking-hattan ➍ circumfixes: Example in German: sagen (to say) → ge sag t (said) ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 6/24 I&C J.-C. Chappelier
✬ ✩ Stems – Affixes (2) several affixes may be combined: examples in Turkish where you can have up to 10 (!) affixes. uygarlas ¸tıramadıklarimizdanmıs ¸sınızcasına uygar las ¸ tır ama dık lar imiz dan mıs ¸ sınız casına civilized +B EC +C AUS +N EG A BLE +PP ART +P L +P1P L +A BL +P AST +2P L +A S I F as if you are among those whom we could not cause to become civilized When only prefixes and suffixes are involved: concatenative morphology Some languages are not concatenative: • infixes • pattern-based morphology ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 7/24 I&C J.-C. Chappelier
✬ ✩ Example of semitic languages Pattern-based morphology In Hebrew, the verb morphology is based on the association of • a root, often made of 3 consonents, which indicates the main meaning, • and a vocalic structure (insertion ov vowels) that refines the meaning. Example: LMD (learn or teach) LAMAD → he was learning LUMAD → he was taught ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 8/24 I&C J.-C. Chappelier
✬ ✩ Computational Morphology Let us consider flexional morphology, for instance for verbs and nouns Noun flexions: plural General rule: +s but several exceptions (e.g. foxes, mice) Verb flexions: conjugations • tense, mode • regular/irregular ☞ How to handle flexions (comptutationaly)? ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 9/24 I&C J.-C. Chappelier
✬ ✩ Computational Morphology Example:surface form: is canonical representation at the lexicon level (formalization): be+3+s+Ind+Pres The objective of computational morphology tools is precisely to go from one to the other: • Analysis: Find the canonical representation corresponding to the surface form • Generation: Produce the surface form described by the canonical representation Challenge: have a ”good” implementation of these two transformations ✫ Tools: associations of strings → transducers ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 10/24 I&C J.-C. Chappelier
✬ ✩ String associations (eaten, eat) ( X 1 , X ′ 1 ) (processed, process) . . . . . . ( X n , X ′ n ) (thought, think) Easy situation: ∀ i, | X i | = | X ′ i | Example: ( abc , ABC ) ⇒ represented as a sequence of character transductions ( abc , ABC ) = (a,A)(b,B)(c,C) ☞ strings on a new alphabet: strings of character couples Not so easy: If ∃ i, | X i | � = | X ′ i | ⇒ requires the introduction of empty string ε Example: ( ab , ABC ) ≃ ( ε ab , ABC ) = ( ε , A )( a , B )( b , C ) ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 11/24 I&C J.-C. Chappelier
✬ ✩ Dealing with ε Where to put the ε ? Example:(ab,ABC) ≃ ( ε ab, ABC) but also (ab,ABC) ≃ (a ε b, ABC) or (ab,ABC) ≃ (ab ε , ABC) General case: n ( with m < n ) m Hard problem in general → need for a convention ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 12/24 I&C J.-C. Chappelier
✬ ✩ Transducer (definition) Let Σ 1 and Σ 2 be two enumerable sets (alphabets), and � � Σ = (Σ 1 ∪ { ε } ) × (Σ 2 ∪ { ε } ) \ { ( ε, ε ) } A transducer is a DFSA on Σ Σ 1 : ”left” language Σ 2 : ”right” language : upper language : lower language : input language : output language ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 13/24 I&C J.-C. Chappelier
✬ ✩ Example initial state 1 final state(s) a:b a 0 b:a b b:a b: ε a 2 Some transductions: (bb,b) [0,0,2] (ababb,baab) [0,1,2,0,0,2] ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 14/24 I&C J.-C. Chappelier
✬ ✩ Different usages of a transducer ( abba , baaa ) ∈ Σ ∗ ? ➊ association checking ➋ Generation: string 1 → string 2 bbab → ? ➌ Analysis: string 2 → string 1 ? → ba ➊ : easy: ( = FSA: nothing special) What about ➋ and ➌ ? ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 15/24 I&C J.-C. Chappelier
✬ ✩ Transduction Walk through the FSA following one or the other element of the couple (projections) ❢ ☞ not deterministic in general! The fact that a transducer is a deterministic (couple-)FSA does not at all imply that the automaton resulting from one projection or the other is also deterministic! non-deterministic evaluation ⇒ The projection is not constant time (in general) backtracking on ”wrong” solutions When a transducer is deterministic with respect to one projection or the other, it is called a sequential transducer A transducer in not sequential in general. In particular if one language or the other (upper or lower) is not finite, it is not sure that a sequential transducer can be produced. ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 16/24 I&C J.-C. Chappelier
✬ ✩ Transduction (2) Example: bbab → ? 0 b : ε b : b 0 2 b : ε b : b b : b 0 2 1 a : b a : a a : b 1 0 2 b : ε b : a b : b b : a 2 0 2 1 bbab → bbba bbab → ba ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 17/24 I&C J.-C. Chappelier
✬ ✩ Transduction (3) Example:? → ba 0 b : ε b : b a : b 0 1 2 b : ε a : a b : a (FAIL) 2 2 2 a : a a : a 0 1 b : ε (FAIL) 2 aa → ba ab → ba bbab → ba ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 18/24 I&C J.-C. Chappelier
✬ ✩ Operations and Regular Expressions on Transducers ➮ All FSA regular expressions: concatenation, or, Kleene closure ( * ), ... Example:(concatenation) ” a:b c:a ” recognizes ac and produces ba ➮ cross-product of regular languages: E 1 ⊗ E 2 recognizes L 1 × L 2 example: a+ ⊗ b+ → ( a n , b m ) ∀ n ≥ 1 , m ≥ 1 !! this is � = ( a ⊗ b ) + ➮ Composition of transducers: T = T 1 ◦ T 2 ( X 1 , X 2 ) ∈ T ⇐ ⇒ ∃ Y : ( X 1 , Y ) ∈ T 1 and ( Y, X 2 ) ∈ T 2 ➮ Reduction: extraction of the upper or the lower FSA ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 19/24 I&C J.-C. Chappelier
✬ ✩ (Other) examples of applications (morphology) ★ text-to-speech (grapheme to phoneme transduction) ★ specific lexicon representation (composition of some access and inverse fonctions) ★ filters (remove/add/modify marks; e.g. HTML) ★ text segmentation ✫ ✪ M. Rajman LIA Introduction to Natural Language Processing (CS-431) 20/24 I&C J.-C. Chappelier
Recommend
More recommend