in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence labeling Lecture 7, 28 Sept Today 3 Tagged text and tag sets Tagging as sequence labeling HMM-tagging Discriminative tagging


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Tagging and sequence labeling Lecture 7, 28 Sept

  3. Today 3  Tagged text and tag sets  Tagging as sequence labeling  HMM-tagging  Discriminative tagging  Neural sequence labeling

  4. Tagged text and tagging 4 [('They', 'PRP'), ('saw', 'VBD'), ('a', 'DT'), ('saw', 'NN'), ('.', '.')] [('They', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('saw', 'VB'), ('.', '.')] [('They', 'PRP'), ('saw', 'VBD'), ('a', 'DT'), ('log', 'NN')]  In tagged text each token is assigned a “part of speech” (POS) tag  A tagger is a program which automatically ascribes tags to words in text  From the context we are (most often) able to determine the tag.  But some sentences are genuinely ambiguous and hence so are the tags.

  5. Various POS tag sets 5  A tagged text is tagged according to a fixed small set of tags.  There are various such tag sets.  Brown tagset:  Original: 87 tags  Versions with extended tags <original>-<more>  Comes with the Brown corpus in NLTK  Penn treebank tags: 35+9 punctuation tags  Universal POS Tagset, 12 tags,

  6. Universal POS tag set (NLTK) 6 Tag Meaning English Examples ADJ adjective new, good, high, special, big, local ADP adposition on, of, at, with, by, into, under ADV adverb really, already, still, early, now CONJ conjunction and, or , but, if, while, although DET determiner, article the, a, some, most, every, no, which NOUN noun year , home, costs, time, Africa NUM numeral twenty-four , fourth, 1991, 14:24 PRT particle at, on, out, over per , that, up, with PRON pronoun he, their , her , its, my, I, us VERB verb is, say, told, given, playing, would . punctuation marks . , ; ! X other ersatz, esprit, dunno, gr8, univeristy

  7. Penn treebank tags 7

  8. Original Brown tags, part 1 8

  9. Original Brown tags, part 2 9

  10. Original Brown tags, part 3 10

  11. Different tagsets - example 11 Brown Penn Universal treebank (‘ wsj ’) he she PPS PRP PRON I PPSS PRP PRON me him her PPO PRP PRON my his her PP$ PRP$ DET mine his hers PP$$ ? PRON

  12. Ambiguity rate 12

  13. How ambiguous are tags (J&M, 2.ed) 13 BUT: Not directly comparable because of different tokenization

  14. Back 14  earnings growth took a back/JJ seat  a small building in the back/NN  a clear majority of senators back/VBP the bill  Dave began to back/VB toward the door  enable the country to buy back/RP about debt  I was twenty-one back/RB then

  15. Today 15  Tagged text and tag sets  Tagging as sequence labeling  HMM-tagging  Discriminative tagging  Neural sequence labeling

  16. Tagging as Sequence Classification 16  Classification (earlier):  a well-defined set of observations, O  a given set of classes, S={s 1 , s 2 , …, s k }  Goal: a classifier,  , a mapping from O to S  Sequence classification:  Goal: a classifier,  , a mapping from sequences of elements from O to sequences of elements from S:  𝛿(𝑝 1 , 𝑝 2 , … 𝑝 𝑜 ) = (𝑡 𝑙1 , 𝑡 𝑙2 , … 𝑡 𝑙𝑜 )

  17. Baseline tagger 17  In all classification tasks establish a baseline classifier.  Compare the performance of other classifiers you make to the baseline.  For tagging, a natural baseline is the Most Frequent Class Baseline:  Assign each word the tag to which is occurred most frequent in the training set  For words unseen in the training set, assign the most frequent tag in the training set.

  18. Today 18  Tagged text and tag sets  Tagging as sequence labeling  HMM-tagging  Discriminative tagging  Neural sequence labeling

  19. Hidden Markov Model (HMM) tagger 19 Extension of language model Extension of Naive Bayes  Two layers:  NB assigns a class to each observation  Observed: the sequence of words  An HMM is a sequence  Hidden: the tags/classes where classifier: each word is assigned a class It assigns a sequence of classes to a sequence of words

  20. HMM is a probabilistic tagger Notation: 20 𝑜 = 𝑢 1 , 𝑢 2 , … 𝑢 𝑜 𝑜 = argmax 𝑜 |𝑥 1 𝑜 𝑢 1  The goal is to decide: Ƹ 𝑢 1 𝑄 𝑢 1 𝑜 𝑢 1 𝑜 𝑄 𝑢 1 𝑜 |𝑢 1 𝑜 𝑜 = argmax 𝑄 𝑥 1  Using Bayes theorem: Ƹ 𝑢 1 𝑜 𝑄 𝑥 1 𝑜 𝑢 1 𝑜 = argmax 𝑜 𝑄 𝑢 1 𝑜 |𝑢 1 𝑜  This simplifies to: Ƹ 𝑢 1 𝑄 𝑥 1 𝑜 𝑢 1 because the denominator is the same for all tag sequences

  21. Simplifying assumption 1 21  For the tag sequence, we apply the chain rule 𝑜 = 𝑄 𝑢 1 𝑄 𝑢 2 |𝑢 1 𝑄 𝑢 3 |𝑢 1 𝑢 2 … 𝑄 𝑢 𝑗 |𝑢 1 𝑗−1 … 𝑄 𝑢 𝑜 |𝑢 1 𝑜−1  𝑄 𝑢 1  We then assume the Markov (chain) assumption 𝑜 = 𝑄 𝑢 1 𝑄 𝑢 2 |𝑢 1 𝑄 𝑢 3 |𝑢 2 … 𝑄 𝑢 𝑗 |𝑢 𝑗−1 … 𝑄 𝑢 𝑜 |𝑢 𝑜−1  𝑄 𝑢 1  𝑜 𝑜 𝑜 ≈ 𝑄 𝑢 1 ෑ 𝑄 𝑢 1 𝑄 𝑢 𝑗 |𝑢 𝑗−1 = ෑ 𝑄 𝑢 𝑗 |𝑢 𝑗−1 𝑗=2 𝑗=1  Assuming a special start tag 𝑢 0 and 𝑄 𝑢 1 = 𝑄 𝑢 1 𝑢 0

  22. Simplifying assumption 2 22  Applying the chain rule 𝑜 𝑜 = ෑ 𝑜 |𝑢 1 𝑗−1 𝑢 1 𝑜 𝑄 𝑥 1 𝑄 𝑥 𝑗 |𝑥 1 𝑗=1 i.e., a word depends on all the tags and on all the preceding words 𝑜 ≈ 𝑄 𝑥 𝑗 |𝑢 𝑗 𝑗−1 𝑢 1  We make the simplifying assumption: 𝑄 𝑥 𝑗 |𝑥 1  i.e., a word depends only on the immediate tag, and hence 𝑜 𝑜 = ෑ 𝑜 |𝑢 1 𝑄 𝑥 1 𝑄 𝑥 𝑗 |𝑢 𝑗 𝑗=1

  23. 23

  24. Training 24  From a tagged training corpus, we can estimate the probabilities with Maximum Likelihood (as in Language Models and Naïve Bayes:) 𝐷 𝑢 𝑗−1 ,𝑢 𝑗  ෠ 𝑄 𝑢 𝑗 𝑢 𝑗−1 = 𝐷 𝑢 𝑗−1 𝐷 𝑥 𝑗 ,𝑢 𝑗  ෠ 𝑄 𝑥 𝑗 𝑢 𝑗 = 𝐷 𝑢 𝑗

  25. Putting it all together 25  From a trained model, it is straightforward to calculate the probability of a sentence with a tag sequence 𝑜 = 𝑄 𝑢 1 𝑜 𝑄 𝑥 1 𝑜 ≈ ς 𝑗=1 𝑜 , 𝑢 1 𝑜 |𝑢 1 𝑜 𝑜 𝑄 𝑢 𝑗 |𝑢 𝑗−1 ς 𝑗=1 𝑄 𝑥 1 𝑄 𝑥 𝑗 |𝑢 𝑗 𝑜 = ෑ 𝑄 𝑢 𝑗 |𝑢 𝑗−1 𝑄 𝑥 𝑗 |𝑢 𝑗 𝑗=1  To find the best tag sequence, we could – in principle – calculate this for all possible tag sequences and choose the one with highest score 𝑜 = argmax 𝑜 𝑄 𝑢 1 𝑜 |𝑢 1 𝑜  Ƹ 𝑢 1 𝑄 𝑥 1 𝑜 𝑢 1  Impossible in practice – There are too many

  26. Possible tag sequences 26 Tag Tag Tag Tag Tag  The number of possible tag ADJ ADJ ADJ ADJ ADJ sequences = ADP ADP ADP ADP ADP ADV ADV ADV ADV ADV  The number of paths through CONJ CONJ CONJ CONJ CONJ the trellis = DET DET DET DET DET  𝑛 𝑜 NOUN NOUN NOUN NOUN NOUN NUM NUM NUM NUM NUM  m is the number of tags in the set PRT PRT PRT PRT PRT  n is the number of tokens in the PRON PRON PRON PRON PRON VERB VERB VERB VERB VERB sentence . . . . .  Here: 12 5 ≈ 250,000. X X X X X Janet will back the bill

  27. Viterbi algorithm (dynamic programming) 27 Tag Tag Tag Tag Tag  Walk through the word sequence ADJ ADJ ADJ ADJ ADJ  For each word keep track of ADP ADP ADP ADP ADP ADV ADV ADV ADV ADV  all the possible tag sequences up to CONJ CONJ CONJ CONJ CONJ this word and the probability of DET DET DET DET DET each sequence NOUN NOUN NOUN NOUN NOUN  If two paths are equal from a NUM NUM NUM NUM NUM point on, then PRT PRT PRT PRT PRT PRON PRON PRON PRON PRON  The one scoring best at this point VERB VERB VERB VERB VERB will also score best at the end . . . . .  Discard the other one X X X X X Janet will back the bill

  28. Viterbi algorithm 28  A nice example of dynamic programming  Skip the details:  Viterbi is covered in IN2110  We will use preprogrammed tools in this course – not implement ourselves  HMM is not state of the art taggers

  29. HMM trigram tagger 29  Take two preceding tags into consideration 𝑜 ≈ ς 𝑗=1 𝑜  𝑄 𝑢 1 𝑄 𝑢 𝑗 |𝑢 𝑗−1 , 𝑢 𝑗−2  𝑜 𝑜 = ෑ 𝑜 , 𝑢 1 𝑄 𝑥 1 𝑄 𝑥 𝑗 |𝑢 𝑗 𝑄 𝑢 𝑗 |𝑢 𝑗−1 , 𝑢 𝑗−2 𝑗=1  Add two initial special states and one special end state

  30. Challenges for the trigram tagger 30  More complex  We have probably not seen all tag trigrams during training  (𝑜 + 2) × 𝑛 3  We must use back-off or  𝑜 words in the sequence interpolation to lower n-grams  𝑛 tags in the model  (can also be necessary for  Example bigram tagger)  12 tags and 6 words: 15,552  With 45 tags: 820,125  With 87 tags: 5,926,527

  31. Challenges for all (n-gram) taggers 31  How to tag words not seen  We will later on consider under training? discriminative taggers where morphological features may be  We assign them all the most added without changing the frequent tag ( noun ) model.  Or use the tag frequencies: 𝑄 𝑥 𝑢 = 𝑄(𝑢)  Better: use morphological features  Can be added as an extra module to an HMM-tagger

  32. Today 32  Tagged text and tag sets  Tagging as sequence labeling  HMM-tagging  Discriminative tagging  Neural sequence labeling

Recommend


More recommend