Two plots from last time Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides by Philipp Koehn) 17 September 2019 Sharon Goldwater ANLP Lecture 2 17 September 2019 Sharon Goldwater ANLP Lecture 2 1 How Many Different Words? Today’s Lecture 10,000 sentences from the Europarl corpus • What is morphology, how does it differ across languages, and why Language Different words does it matter for NLP? English 16k French 22k • What’s the difference between a stem, lemma, and affix? Dutch 24k Italian 25k Portuguese 26k • What are the characteristics of derivational and inflectional Spanish 26k morphology? Danish 29k Swedish 30k • What is an FSM, and what is the relationship between FSMs and German 32k Greek 33k regular languages? Finnish 55k Why the difference? Morphology. Sharon Goldwater ANLP Lecture 2 2 Sharon Goldwater ANLP Lecture 2 3
Interlude/reminder: types and tokens What is morphology? The word word is ambiguous. The study of wordforms and word formation. • Word type : “10k sentences from English Europarl have 16k • Structured relationships between words: different words” (unique strings, lexical items) play, played, replay, player • Word token : “English Europarl has 54m words” (possibly played, walked, jumped repeated instances) • Units of meaning ( morphemes ) and their ordering a cat and a brown dog chased a black dog : ( morphotactics ): 10 tokens, 7 types. de+salin+ate+ion but not ate+salin+ion+de Sharon Goldwater ANLP Lecture 2 4 Sharon Goldwater ANLP Lecture 2 5 Why does morphology matter? Why does morphology matter? Example (Russian): • Information retrieval: return pages with related forms. zhenshin a devochk e dala knigu woman+NOM girl+DAT gave book+ACC • Language modelling: make predictions about unseen words ‘the woman gave the girl a book’ • Machine translation and language understanding: signals vs. differences in meaning (might be expressed using word order zhenshin e devochk a dala knigu in other languages). woman+DAT girl+NOM gave book+ACC ‘the girl gave the woman a book’ A noun’s case marking (a kind of morphology) indicates its role in the sentence, where English uses word order and prepositions. Sharon Goldwater ANLP Lecture 2 6 Sharon Goldwater ANLP Lecture 2 7
Morphemes: Stems and Affixes Stems vs. Lemmas • Two types of morphemes • Lemma: the canonical form or dictionary form of a set of words – stems: small , cat , walk – fly , flies , flew and flying all have the lemma fly . – affixes: +ed , un+ – walk , walks , walked and walking all have the lemma walk . – walker , walkers have the lemma walker . • Four types of affixes – suffix – prefix – infix – circumfix Sharon Goldwater ANLP Lecture 2 8 Sharon Goldwater ANLP Lecture 2 9 Stems vs. Lemmas Suffix • Lemma: the canonical form or dictionary form of a set of words • Plural of nouns cat+s – fly , flies , flew and flying all have the lemma fly . – walk , walks , walked and walking all have the lemma walk . • Comparative and superlative of adjectives – walker , walkers have the lemma walker . small+er • Stem: definitions can vary, but often: the part of the word that • Formation of adverbs is common to all its variants great+ly – stem of produce , production is produc . – stem of walk , walks , walked , walking , walker , walkers is walk . • Verb tenses – Do fly , flies , flew , flying have a common stem fl ? walk+ed Or maybe only fly and flying share a stem: fly . • All inflectional morphology in English uses suffixes Decision may depend on application. Sharon Goldwater ANLP Lecture 2 10 Sharon Goldwater ANLP Lecture 2 11
Prefix Other types of morphology Mainly in non-English languages; check textbook or online. • In English: these typically change the meaning • Infixes • Adjectives un+friendly • Circumfixes dis+interested • Reduplication • Verbs re+consider • Root and pattern • Some language use prefixing much more widely Sharon Goldwater ANLP Lecture 2 12 Sharon Goldwater ANLP Lecture 2 13 Not that easy... Irregular Forms • Affixes are not always simply attached • Some words have irregular forms: – is, was, been • In writing, some letters may be changed/added/removed – eat, ate, eaten – walk+ed – go, went, gone – frame+d – emit+ted • Irregular forms tend to be the most frequent (and vice versa) – carr(–y)+ied • In speaking, some sounds may be changed/added/removed – Compare the final sound: cats [s] vs dogs [z] vs foxes [ @ z] Sharon Goldwater ANLP Lecture 2 14 Sharon Goldwater ANLP Lecture 2 15
Inflectional Morphology Forms of the German the • In English, we inflect Case Singular Plural – nouns for count (plural: +s ) and for possessive case ( +’s ) male fem. n. male fem. n. – verbs for tense ( +ed , +ing) and a special 3rd person singular nominative ( s ubject) der die das die die die present form ( +s ) genitive ( p ossessive) des der des der der der – adjectives in comparative ( +er ) and superlative ( +est ) forms. dative ( i ndirect o bject) dem der dem den den den accusative (direct o bject) den die das die die die • In German, we inflect – nouns for count and case Phrase/role: [The A]/ s put [the B]/ o [of the C]/ p [on the D]/ io – verbs for tense, person, and count – adjectives for count, case, gender, and definiteness Not only many different forms, – determiners for count, case and gender but each form is highly ambiguous. Sharon Goldwater ANLP Lecture 2 16 Sharon Goldwater ANLP Lecture 2 17 Inflectional vs. Derivational Morphology Derivational Morphology • Inflectional morphology typically • Changing the part of speech, e.g. noun to verb – does not change basic meaning or part of speech word → wordify – expresses grammatical features or relations between words – applies to all words of the same part of speech • Is it a real word? • Derivational morphology • Consulting Google (a few years ago): – may change the part of speech or meaning of a word – 8,840 hits: e.g., wordify mugs, tshirts and magnets – is not driven by syntactic relations outside the word – may be “picky”: drama+(t)ize but not traged(-y)+ize • Google now returns over 75k hits. (Why?) – applies closer to the stem; whereas inflection occurs at word edges: govern+ment+s , centr+al+ize+d Sharon Goldwater ANLP Lecture 2 18 Sharon Goldwater ANLP Lecture 2 19
Derivational Morphology Derivational Morphology • Turning wordification into a ideology: • Changing the verb back to a noun wordify → wordification (8k hits on Google) wordification → wordificationism (was just 1 hit:) • A person/thing who engages in wordification I think you’re confusing the term “Democracy” with “Capitalism”; I think you mean “Has Capitalism failed”? wordification → wordificator (was 8 hits, now 21k: another app!) No. It hasn’t. I agree, Hambone; I’m just trying to correct the • A person/thing who wordifies wordificationism. wordify → wordifier (1500 hits on Google) Where in the world did you get the word “wordificationism”? Not in the Merriam-Webster dictionary, • What is the difference between a wordifier and a wordificator ? not in the Thesaurus... Sharon Goldwater ANLP Lecture 2 20 Sharon Goldwater ANLP Lecture 2 21 Derivational Morphology Compounds • An adherent of wordificationism • Creating new words by merging multiple words wordificationism → wordificationist • (Somewhat) rare in English • Used to have 0 hits on Google, now you get these slides! home work → homework • We created a new word! web site → website • More common in other languages (like German) Sharon Goldwater ANLP Lecture 2 22 Sharon Goldwater ANLP Lecture 2 23
Acronyms/Initialisms Morphology differs across languages • Usually a trade-off between morphology and syntax (word order) • Wikileaks / Guardian, document 2007-081-100110-0444: – Some languages have no verb tenses OGA operating in TF Catamount sector moved into Malekshay for operation. LN Shum Khan ran at the sight of → use explicit time references ( yesterday ) the approaching CFA’s. CF utilized the escalation of force – Case inflection determines roles of noun phrase doctrine and shouted to stop, fired warning shots and then → use fixed word order instead fired to wound. The LN was hit in the ankle and treated → use prepositional phrases instead of cased noun phrases by Element medics on scene. It was determined through discussions with local Elders that the man was a deaf mute • Examples from the World Atlas of Language Structures (wals.info) that was nervous of the CF operation. Solatia was made in – prefixes vs. suffixes the form of supplies and the Element mission progressed – cases (zero to more than ten) – past tense remoteness distinctions Sharon Goldwater ANLP Lecture 2 24 Sharon Goldwater ANLP Lecture 2 25 Sharon Goldwater ANLP Lecture 2 26 Sharon Goldwater ANLP Lecture 2 27
Recommend
More recommend