Lecture 9: Part of Speech Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1
This lecture v Parts of speech (POS) v POS Tagsets CS6501 Natural Language Processing 2
Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3
POS examples v N noun chair, bandwidth, pacing v V verb study, debate, munch v ADJ adjective purple, tall, ridiculous v ADV adverb unfortunately, slowly v P preposition of, by, to v PRO pronoun I, me, mine v DET determiner the, a, that, those CS6501 Natural Language Processing 4
Parts of Speech v A.k.a. parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... v Lots of debate within linguistics about the number, nature, and universality of these CS6501 Natural Language Processing 5
POS Tagging v The process of assigning a part-of-speech to each word in a collection (sentence). WORD tag the DET koala N put V the DET keys N on P the DET table N CS6501 Natural Language Processing 6
Why is POS Tagging Useful? v First step of a vast number of practical tasks v Parsing v Need to know if a word is an N or V before you can parse v Information extraction v Finding names, relations, etc. v Speech synthesis/recognition v OBject obJECT v OVERflow overFLOW v DIScount disCOUNT v CONtent conTENT v Machine Translation CS6501 Natural Language Processing 7
Open and Closed Classes v Closed class: a small fixed membership v Prepositions: of, in, by, … v Pronouns: I, you, she, mine, his, them, … v Usually function words (short common words which play a role in grammar) v Open class: new ones can be created v English has 4: Nouns, Verbs, Adjectives, Adverbs v Many languages have these 4, but not all! CS6501 Natural Language Processing 8
Open Class Words v Nouns v Proper nouns (Boulder, Granby, Eli Manning) v Common nouns (the rest). v Count nouns and mass nouns v Count: have plurals, get counted: goat/goats, one goat, two goats v Mass: don’t get counted (snow, salt, communism) (*two snows) v Verbs v In English, have morphological affixes (eat/eats/eaten) CS6501 Natural Language Processing 9
Closed Class Words Examples : v prepositions: on, under, over, … v particles: up, down, on, off, … v determiners: a, an, the, … v pronouns: she, who, I, .. v conjunctions: and, but, or, … v auxiliary verbs: can, may should, … v numerals: one, two, three, third, … CS6501 Natural Language Processing 10
Prepositions from CELEX CELEX: online dictionary Frequency counts are from COBUILD 16-billion-word corpus CS6501 Natural Language Processing 11
English Particles CS6501 Natural Language Processing 12
Conjunctions CS6501 Natural Language Processing 13
Choosing a Tagset v Could pick very coarse tagsets v N, V, Adj, Adv, Other v More commonly used set is finer grained v E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB, WP$, VBG v Brown cropus, 87 tags. v Prague Dependency Treebank (Czech) v 4452 tags v AAFP3----3N----: (nejnezajímav ě j š ím) Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial] CS6501 Natural Language Processing 14
Penn TreeBank POS Tagset CS6501 Natural Language Processing 15
Using the Penn Tagset v The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. CS6501 Natural Language Processing 16
Universal Tag set v ~ 12 different tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, “.”, X CS6501 Natural Language Processing 17
POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB These examples from Dekang Lin CS6501 Natural Language Processing 18
How Hard is POS Tagging? CS6501 Natural Language Processing 19
POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 20
Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 21
Recommend
More recommend