  1. Statistical Natural Language Processing Part of speech tagging Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017

  2. POS tags and tagsets POS tagging Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, syntactic and morphological properties classes of words PUNC . NOUN arrow DET an ADP like VERB fmies NOUN Time Part of speech tagging 1 / 24 • Part of speech (POS or PoS) tags are morphosyntactic • The words belonging to the same POS class share some

  3. POS tags and tagsets prepositon in, since, past, ago (?) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, for a long time. With minor difgerences, this list of categories has been around interjection uh, ouch, hey conjunction and, or, since determiner a, the, some POS tagging pronoun I, they, mine adverb well, fast, nicely adjective blue, happy, nice verb go, read, eat noun apple, chair, book what you learn in (primary?) school Traditional POS tags 2 / 24

  4. POS tags and tagsets POS tagging When we say ‘traditional’ … linguistic traditions Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 24 • The POS tags were around for thousands of years • POS tags in modern linguistics are based on Greek/Latin • But others, e.g., Sanskrit linguists, also proposed POS tags • The choice POS tags are often language dependent

  5. POS tags and tagsets POS tagging What are the POS tags good for difgerently based on their POS tag Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 24 • Linguistic theory • Parsing • Speech synthesis: pronounce lead , wind , object , insult • The same goes for machine translation • Information retrieval: if wug is a noun, also search for wugs • Text classifjcation: improves many tasks

  6. POS tags and tagsets POS tagging Open vs. closed class words Open class words (e.g., nouns) are productive – new words coined are often in these classes – we often cannot rely on a fjxed lexicon – they are typically ‘content’ words Closed class words (e.g., determiners) are generally static – the lexicon does not grow – they are typically ‘function’ words Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 24 • This distinction is often language dependent

  7. POS tags and tagsets POS tagging Some issues with traditional POS tags languages – book , water and Marry are all nouns, but The book is here * The Marry is here We have water * We have book Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 6 / 24 • Not all POS tags are observed in (or theorized for) all • Often fjner granularity is necessary

  8. POS tags and tagsets POS tagging POS tagsets in practice example: Penn treebank tagset Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 7 / 24

  9. POS tags and tagsets PPOSAT PIAT attributive indefjnite kein [Mensch], irgendein [Glas] PIDAT attributive indefjnite [ein] wenig [Wasser], PPER irrefmexive personal pronoun ich, er, ihm, mich, dir PPOSS substituting possessive pronoun meins, deiner attributive possessive pronoun substituting indefjnite pronoun mein [Buch], deine [Mutter] PRELS substituting relative pronoun [der Hund,] der PRELAT attributive relative pronoun [der Mann ,] dessen [Hund] … … … Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 keiner, viele, man, niemand PIS POS tagging weil, daß, damit, wenn, ob POS tagsets in practice example 2: STTS tagset POS description examples … … … KOUI subordinating conjunction um [zu leben], anstatt [zu fragen] KOUS subordinating conjunction KON dieser, jener coordinative conjunction und, oder, aber KOKOM particle of comparison, no clause als, wie NN noun Tisch, Herr, [das] Reisen NE proper noun Hans, Hamburg, HSV PDS substituting demonstrative 8 / 24

  10. POS tags and tagsets POS tagging Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 9 / 24 application POS tagset choices • The choice tagsets depends on the language and • Example tag set sizes (for English) – Brown corpus, 87 tags – Penn treebank 45 tags – BNC, 61 tags • Difgerences can be large, for Chinese Penn treebank has 34 tags, but tagsets with about 300 tags exist • For other languages, the choice varies roughly between about 10 to a few hundred

  11. POS tags and tagsets POS tagging Shift towards more ‘universal’ tag sets to – compare alternative approaches – use the same tools on difgerent languages of data sets Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 10 / 24 • The variation in POS tagset choices often makes it diffjcult • There has been a recent trend for ‘universal’ tag sets • The result is a smaller POS tag set (back to the tradition) • But often supplemented with morphological features

  12. POS tags and tagsets PART particle Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, X other VERB verb SYM symbol conjunction SCONJ subordinating PUNCT punctuation PROPN proper noun PRON pronoun NUM numeral POS tagging NOUN noun INTJ interjection DET determiner conjunction CCONJ coordinating AUX auxiliary ADV adverb ADP adposition ADJ adjective example: Universal Dependencies tag set POS tagsets in recent practice 11 / 24

  13. POS tags and tagsets verbs typically have tense , aspect , modality voice features Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, language (typology) adjectives typically have degree nouns typically have a number feature POS tagging information for the word common in (non-English) NLP Morphological features 12 / 24 • Annotating words with morphological features has been • Morphological features give additional sub-categorization • For example • Morphological feature sets change depending on the

  14. POS tags and tagsets an Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, PUNC . num=sing NOUN arrow def=ind DET ADP POS tagging like tense=pres pers=3 num=sing VERB fmies num=sing NOUN Time an example Morphological features 13 / 24

  15. POS tags and tagsets DET Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, lem. POS tagging is essentially an ambiguity resolution prob- . PUNC PUNC . apple NOUN NOUN arrow an DET POS tagging an like VERB ADP like fmies NOUN VERB fmies fruit NOUN NOUN Time POS tags are ambiguous 14 / 24

  16. POS tags and tagsets – The old man the boats Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, families – The complex houses married and single soldiers and their – The horse raced passed the barn fell 15 / 24 POS tagging VERB we will back them ADV take it back NOUN on our back ADJ the back door More examples POS tag ambiguity • Some words are highly ambiguous • The garden-path sentences are often POS ambiguities

  17. POS tags and tagsets POS tagging POS tagging: strategies POS tagging can be solved in a number of difgerent methods Typical statistical approaches involve sequence learning methods: – Hidden Markov models – Conditional random fjelds – (Recurrent) neural networks Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 16 / 24 • Rule-based methods: ‘constraint grammar’ (CG) • Transformation based: Brill tagger • Machine-learning approaches

  18. POS tags and tagsets POS tagging Rule-based POS tagging typical approach each word words in the context may remain Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 17 / 24 • Using a tag lexicon, start with assigning all possible tags to • Eliminate tags based on hand-crafted rules • Rules typically rely on the words and (potential) tags of the • Result is not always full disambiguation, some ambiguity • Some probabilistic constraints may also be applied

  19. POS tags and tagsets and the previous word is not a verb like ‘ consider ’ Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 3 eliminates cases like I consider that funny. 2 eliminates non-sentence fjnal ADV else eliminate ADV 5 then eliminate SCONJ 4 3 POS tagging and the following word is at the sentence boundary 2 if the next word is ADJ, ADV 1 An example rule for disambiguation (very simplifjed): SCONJ we know that it is bad an example Rule-based POS tagging 18 / 24 • Among others, the word that can be ADV it is not that bad

  20. POS tags and tagsets POS tagging Transformation based tagging – Start with assigning the most probably POS tag to all words – Apply a set of rules (similar to CG) from more specifjc to less specifjc Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 19 / 24 • The idea: • The rules are learned

  21. POS tags and tagsets POS tagging Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, c hange VERB to ADP if preceding word is tagged as VERB Apply rule: ‘ like’ is more likely to be a VERB than ADP PUNC PUNC . DET NOUN arrow DET DET an ADP VERB like VERB VERB fmies NOUN NOUN Time An example Transformation based learning 20 / 24 • Start with most likely POS tags:


