statistical natural language processing
play

Statistical Natural Language Processing 4 / 26 the lexicon does not - PDF document

Statistical Natural Language Processing 4 / 26 the lexicon does not grow Closed class words (e.g., determiners) are generally static they are typically content words we often cannot rely on a fjxed lexicon new words coined


  1. Statistical Natural Language Processing 4 / 26 – the lexicon does not grow Closed class words (e.g., determiners) are generally static – they are typically ‘content’ words – we often cannot rely on a fjxed lexicon – new words coined are often in these classes Open class words (e.g., nouns) are productive Open vs. closed class words ML approaches Rule-based and TBL POS tags and tagsets Summer Semester 2019 Ç. Çöltekin, SfS / University of Tübingen Ç. Çöltekin, difgerently based on their POS tag What are the POS tags good for ML approaches Rule-based and TBL POS tags and tagsets 3 / 26 Part of speech tagging SfS / University of Tübingen – they are typically ‘function’ words SfS / University of Tübingen linguistic traditions SfS / University of Tübingen Summer Semester 2019 SfS / University of Tübingen Ç. Çöltekin, example: Penn treebank tagset POS tagsets in practice ML approaches Rule-based and TBL POS tags and tagsets 6 / 26 Summer Semester 2019 Ç. Çöltekin, Summer Semester 2019 * We have book We have water * The Mary is here The book is here – book , water and Mary are all nouns, but languages Some issues with traditional POS tags ML approaches Rule-based and TBL POS tags and tagsets 5 / 26 Ç. Çöltekin, Summer Semester 2019 When we say ‘traditional’ … ADP 1 / 26 Summer Semester 2019 SfS / University of Tübingen Ç. Çöltekin, syntactic and morphological properties classes of words PUNC . NOUN arrow DET an like ML approaches VERB fmies NOUN Time Part of speech tagging ML approaches Rule-based and TBL POS tags and tagsets Summer Semester 2019 Seminar für Sprachwissenschaft University of Tübingen Çağrı Çöltekin Rule-based and TBL POS tags and tagsets Traditional POS tags what you learn in (primary?) school ML approaches Rule-based and TBL POS tags and tagsets 2 / 26 Summer Semester 2019 SfS / University of Tübingen Ç. Çöltekin, for a long time. With minor difgerences, this list of categories has been around interjection uh, ouch, hey 7 / 26 conjunction and, or, since pronoun I, they, mine noun apple, chair, book verb go, read, eat adjective blue, happy, nice adverb well, fast, nicely determiner a, the, some prepositon in, since, past, ago (?) • Part of speech (POS or PoS) tags are morphosyntactic • The words belonging to the same POS class share some • POS tags in modern linguistics are based on Greek/Latin • But others, e.g., Sanskrit linguists, also proposed POS tags • Linguistic theory • Parsing • Speech synthesis: pronounce lead , wind , object , insult • The same goes for machine translation • Information retrieval: if wug is a noun, also search for wugs • Text classifjcation: improves some tasks • This distinction is often language dependent, • As a back-ofg strategy for some language models • Not all POS tags are observed in (or theorized for) all • Often fjner granularity is necessary

  2. POS tags and tagsets VERB 12 / 26 POS tags and tagsets Rule-based and TBL ML approaches Morphological features an example Time NOUN num=sing fmies num=sing SfS / University of Tübingen pers=3 tense=pres like ADP an DET def=ind arrow NOUN num=sing . Summer Semester 2019 Ç. Çöltekin, Ç. Çöltekin, Ç. Çöltekin, Rule-based and TBL PART particle PRON pronoun PROPN proper noun PUNCT punctuation SCONJ subordinating conjunction SYM symbol VERB verb X other SfS / University of Tübingen language (typology) Summer Semester 2019 11 / 26 POS tags and tagsets Rule-based and TBL ML approaches Morphological features common in (non-English) NLP information for the word nouns typically have number and case feature verbs typically have tense , aspect , modality voice features adjectives typically have degree PUNC SfS / University of Tübingen INTJ interjection ML approaches PUNC . Part of speech tagging is essentially an ambiguity resolution problem. Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2019 14 / 26 POS tags and tagsets Rule-based and TBL POS tag ambiguity . More examples ADJ the back door NOUN on our back ADV take it back VERB we will back them – The old man the boats – The complex houses married and single soldiers and their families Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2019 PUNC apple Summer Semester 2019 VERB 13 / 26 POS tags and tagsets Rule-based and TBL ML approaches POS tags are ambiguous Time NOUN NOUN fruit fmies NOUN NOUN fmies like ADP VERB like an DET DET an arrow NOUN NOUN noun NUM numeral DET determiner PPER substituting demonstrative dieser, jener PIS substituting indefjnite pronoun keiner, viele, man, niemand PIAT attributive indefjnite kein [Mensch], irgendein [Glas] PIDAT attributive indefjnite [ein] wenig [Wasser], irrefmexive personal pronoun Hans, Hamburg, HSV ich, er, ihm, mich, dir PPOSS substituting possessive pronoun meins, deiner PPOSAT attributive possessive pronoun mein [Buch], deine [Mutter] PRELS substituting relative pronoun [der Hund,] der PRELAT PDS proper noun conjunction um [zu leben], anstatt [zu fragen] ML approaches POS tagsets in practice example 2: STTS tagset POS description examples … … … KOUI subordinating conjunction KOUS NE subordinating conjunction weil, daß, damit, wenn, ob KON coordinative conjunction und, oder, aber KOKOM particle of comparison, no clause als, wie NN noun Tisch, Herr, [das] Reisen attributive relative pronoun [der Mann ,] dessen [Hund] … 10 / 26 Rule-based and TBL … Shift towards more ‘universal’ tag sets to – compare alternative approaches – use the same tools on difgerent languages of data sets Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2019 POS tags and tagsets 9 / 26 Rule-based and TBL ML approaches POS tagsets in recent practice example: Universal Dependencies tag set ADJ adjective ADP adposition ADV adverb AUX auxiliary CCONJ coordinating POS tags and tagsets ML approaches Summer Semester 2019 SfS / University of Tübingen … Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2019 8 / 26 POS tags and tagsets Rule-based and TBL ML approaches POS tagset choices application 15 / 26 Ç. Çöltekin, • The choice of tagsets depends on the language and • Example tag set sizes (for English) – Brown corpus, 87 tags – Penn treebank 45 tags – BNC, 61 tags • Difgerences can be large, for Chinese Penn treebank has 34 tags, but tagsets with about 300 tags exist • For other languages, the choice varies roughly between about 10 to a few hundred • The variation in POS tagset choices often makes it diffjcult • There has been a recent trend for ‘universal’ tag sets • The result is a smaller POS tag set (back to the tradition) • But often supplemented with morphological features • Annotating words with morphological features has been • Morphological features give additional sub-categorization • For example • Morphological feature sets change depending on the • Some words are highly ambiguous • The garden-path sentences are often POS ambiguities

Recommend


More recommend