natural language processing
play

Natural Language Processing Info 159/259 Lecture 15: Review (Oct - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 15: Review (Oct 11, 2018) David Bamman, UC Berkeley Big ideas Classification Language modeling Naive Bayes, Logistic Markov assumption, regression, featurized, neural


  1. Distributed representations • Low-dimensional, dense word representations are extraordinarily powerful (and are arguably responsible for much of gains that neural network models have in NLP). • Lets your representation of the input share statistical strength with words that behave similarly in terms of their distributional properties (often synonyms or words that belong to the same class). 41

  2. Dense vectors from prediction • Learning low-dimensional representations of words by framing a predicting task: using context to predict words in a surrounding window • Transform this into a supervised prediction problem; similar to language modeling but we’re ignoring order within the context window

  3. Using dense vectors • In neural models (CNNs, RNNs, LM), replace the V- dimensional sparse vector with the much smaller K- dimensional dense one. • Can also take the derivative of the loss function with respect to those representations to optimize for a particular task. 43

  4. Zhang and Wallace 2016, “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

  5. Subword models • Rather than learning a single representation for each word type w, learn representations z for the set of ngrams 𝒣 w that comprise it [Bojanowski et al. 2017] • The word itself is included among the ngrams (no matter its length). • A word representation is the sum of those ngrams w = ∑ z g g ∈𝒣 w 45

  6. e(<wh) + e(whe) FastText 3-grams + e(her) + e(ere) + e(re>) + e(<whe) + e(wher) 4-grams + e(here) + e(ere>) e(where) = + e(<wher) 5-grams + e(where) + e(here>) + e(<where) 6-grams + e(where>) e(*) = embedding for * + e(<where>) word 46

  7. 1% = 
 100% = ~20M tokens ~1B tokens • Subword models need less data to get comparable performance. 47

  8. ELMo • Peters et al. (2018), “Deep Contextualized Word Representations” (NAACL) • Big idea: transform the representation of a word (e.g., from a static word embedding) to be sensitive to its local context in a sentence and optimized for a specific NLP task. • Output = word representations that can be plugged into just about any architecture a word embedding can be used.

  9. ELMo

  10. Parts of speech • Parts of speech are categories of words defined distributionally by the morphological and syntactic contexts a word appears in. -s -ed -ing before we Kim saw the elephant did walk walks walked walking dog slice slices sliced slicing idea believe believes believed believing of *ofs *ofed *ofing *of red *reds *redded *reding *goes

  11. fax, affluenza, subtweet, bitcoin, cronut, emoji, Nouns listicle, mocktail, selfie, skort Open class text, chillax, manspreading, photobomb, unfollow, Verbs google Adjectives crunk, amazeballs, post-truth, woke Adverbs hella, wicked Determiner Closed class Pronouns English has a new preposition, because internet Prepositions [Garber 2013; Pullum 2014] Conjunctions

  12. POS tagging Labeling the tag that’s correct for the context. NNP IN FW IN JJ SYM JJ VBZ VB LS VB VBZ NN NN VBP DT NN NN VBP NN DT NN Fruit flies like a banana Time flies like an arrow (Just tags in evidence within the Penn Treebank — more are possible!)

  13. Why is part of speech tagging useful?

  14. Sequence labeling x = { x 1 , . . . , x n } y = { y 1 , . . . , y n } • For a set of inputs x with n sequential time steps, one corresponding label y i for each x i • Model correlations in the labels y.

  15. HMM n +1 n � � P ( x 1 , . . . , x n , y 1 , . . . , y n ) ≈ P ( y i | y i − 1 ) P ( x i | y i ) i =1 i =1

  16. Hidden Markov Model P ( y ) = P ( y 1 , . . . , y n ) Prior probability of label sequence n +1 � P ( y 1 , . . . , y n ) ≈ P ( y i | y i − 1 ) i =1 • We’ll make a first-order Markov assumption and calculate the joint probability as the product the individual factors conditioned only on the previous tag.

  17. Hidden Markov Model P ( x | y ) = P ( x 1 , . . . , x n | y 1 , . . . , y n ) N � P ( x 1 , . . . , x n | y 1 , . . . , y n ) ≈ P ( x i | y i ) i =1 • Here again we’ll make a strong assumption: the probability of the word we see at a given time step is only dependent on its label

  18. Parameter estimation c ( y 1 , y 2 ) P ( y t | y t − 1 ) c ( y 1 ) MLE for both is just counting 
 (as in Naive Bayes) c ( x, y ) P ( x t | y t ) c ( y )

  19. Decoding • Greedy: proceed left to right, committing to the best tag for each time step (given the sequence seen so far) Fruit flies like a banana NN VB IN DT NN

  20. MEMM arg max P ( y | x, β ) General maxent form y n � arg max P ( y i | y i − 1 , x ) Maxent with first-order Markov assumption: Maximum Entropy y i =1 Markov Model

  21. Features feature example f ( t i , t i − 1 ; x 1 , . . . , x n ) x i = man 1 t i-1 = JJ 1 Features are scoped over the previous predicted i=n (last word of tag and the entire 1 sentence) observed input x i ends in -ly 0

  22. Viterbi decoding P ( y ) P ( x | y ) = P ( x, y ) Viterbi for HMM: max joint probability v t ( y ) = max u ∈ Y [ v t − 1 ( u ) × P ( y t = y | y t − 1 = u ) P ( x t | y t = y )] P ( y | x ) Viterbi for MEMM: max conditional probability v t ( y ) = max u ∈ Y [ v t − 1 ( u ) × P ( y t = y | y t − 1 = u, x, β )]

  23. MEMM Training n � P ( y i | y i − 1 , x, β ) i =1 Locally normalized — at each time step, 
 each conditional distribution sums to 1

  24. Label bias NN TO VB will to fight Because of this local n � P ( y i | y i − 1 , x, β ) normalization, P(TO | context) will always be 1 if x=“to” i =1 Toutanova et al. 2003

  25. Label bias NN TO VB will to fight That means our prediction for to can’t help us disambiguate will. We lose the information that MB + TO sequences rarely happen. Toutanova et al. 2003

  26. Conditional random fields • We can solve this problem using global normalization (over the entire sequences) rather than locally normalized factors. n � P ( y | x, β ) = P ( y i | y i − 1 , x, β ) MEMM i =1 exp( Φ ( x, y ) � β ) P ( y | x, β ) = CRF y � � Y exp( Φ ( x, y � ) � β ) �

  27. Recurrent neural network • For POS tagging, predict the tag from 𝓩 conditioned on the context DT NN VBD IN NN The dog ran into town

  28. Bidirectional RNN • A powerful alternative is make predictions conditioning both on the past and the future. • Two RNNs • One running left-to-right • One right-to-left • Each produces an output vector at each time step, which we concatenate

  29. Evaluation • A critical part of development new algorithms and methods and demonstrating that they work

  30. Experiment design training development testing size 80% 10% 10% evaluation; never look at it purpose training models model selection until the very end

  31. Accuracy N 1 � I [ˆ y i = y i ] N Predicted ( ŷ ) i =1 NN VBZ JJ � 1 if x is true I [ x ] NN 100 2 15 0 otherwise True (y) VBZ 0 104 30 JJ 30 40 70

  32. Precision Precision(NN) = � N i =1 I ( y i = ˆ y i = NN) Predicted ( ŷ ) � N NN VBZ JJ i =1 I (ˆ y i = NN) NN 100 2 15 True (y) VBZ 0 104 30 Precision : proportion of predicted class that are actually that JJ 30 40 70 class.

  33. Recall Recall(NN) = � N i =1 I ( y i = ˆ y i = NN) Predicted ( ŷ ) � N i =1 I ( y i = NN) NN VBZ JJ NN 100 2 15 True (y) VBZ 0 104 30 Recall : proportion of true class that are predicted to be that JJ 30 40 70 class.

  34. F score F = 2 × precision × recall precision + recall

  35. Why is syntax important?

  36. Context-free grammar • A CFG gives a formal way to define what meaningful constituents are and exactly how a constituent is formed out of other constituents (or words). It defines valid structure in a language. NP → Det Nominal NP → Verb Nominal

  37. Constituents Every internal node is a phrase my pajamas • in my pajamas • elephant in my pajamas • an elephant in my pajamas • shot an elephant in my pajamas • I shot an elephant in my pajamas • Each phrase could be replaced by another of the same type of constituent

  38. Evaluation Parseval (1991): Represent each tree as a collection of tuples: <l 1 , i 1 , j 1 >, …, <l n , i n , j n > • l k = label for kth phrase • i k = index for first word in z th phrase • j k = index for last word in kth phrase Smith 2017

  39. Evaluation I 1 shot 2 an 3 elephant 4 in 5 my 6 pajamas 7 • <S, 1, 7> • <NP, 1,1> • <VP, 2, 7> • <VP, 2, 4> • <NP, 3, 4> • <Nominal, 4, 4> • <PP, 5, 7> • <NP, 6, 7> Smith 2017

  40. Evaluation I 1 shot 2 an 3 elephant 4 in 5 my 6 pajamas 7 • <S, 1, 7> • <S, 1, 7> • <NP, 1,1> • <NP, 1,1> • <VP, 2, 7> • <VP, 2, 7> • <VP, 2, 4> • <NP, 3, 7> • <NP, 3, 4> • <Nominal, 4, 7> • <Nominal, 4, 4> • <Nominal, 4, 4> • <PP, 5, 7> • <PP, 5, 7> • <NP, 6, 7> • <NP, 6, 7> Smith 2017

  41. Evaluation Calculate precision, recall, F1 from these collections of tuples • Precision: number of tuples in predicted tree also in gold standard tree, divided by number of tuples in predicted tree • Recall: number of tuples in predicted tree also in gold standard tree, divided by number of tuples in gold standard tree Smith 2017

  42. Treebanks • Rather than create the rules by hand, we can annotate sentences with their syntactic structure and then extract the rules from the annotations • Treebanks: collections of sentences annotated with syntactic structure

  43. Penn Treebank → NP NNP NNP → NP , ADJP , NP-SBJ → S NP-SBJ VP → VP VB NP PP-CLR NP-TMP Example rules extracted from this single annotation

  44. PCFG • Probabilistic context-free grammar: each production is also associated with a probability. • This lets us calculate the probability of a parse for a given sentence; for a given parse tree T for sentence S comprised of n rules from R (each A → β ): n � P ( T, S ) = P ( β | A ) i

  45. Estimating PCFGs C ( A → β ) � P ( β | A ) = � γ C ( A → γ ) β (equivalently) P ( β | A ) = C ( A → β ) � C ( A ) β

  46. I shot an elephant in my pajamas NP, PRP 
 [0,1] VBD [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Does any rule generate PRP [5,6] VBD? NNS [6,7]

  47. I shot an elephant in my pajamas NP, PRP 
 ∅ [0,1] VBD [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Does any rule generate 
 [5,6] VBD DT? NNS [6,7]

  48. I shot an elephant in my pajamas NP, PRP 
 ∅ [0,1] VBD ∅ [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Two possible places look for that [5,6] split k NNS [6,7]

  49. I shot an elephant in my pajamas NP, PRP 
 ∅ [0,1] VBD ∅ [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Two possible places look for that [5,6] split k NNS [6,7]

  50. I shot an elephant in my pajamas NP, PRP 
 ∅ [0,1] VBD ∅ [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Two possible places look for that [5,6] split k NNS [6,7]

  51. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD ∅ [1,2] DT [2,3] NP, NN [3,4] IN [4,5] PRP$ Does any rule generate 
 [5,6] DT NN? NNS [6,7]

  52. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD ∅ [1,2] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Two possible places look for that [5,6] split k NNS [6,7]

  53. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD ∅ [1,2] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Two possible places look for that [5,6] split k NNS [6,7]

  54. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD VP 
 ∅ [1,2] [1,4] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Three possible places look for [5,6] that split k NNS [6,7]

  55. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD VP 
 ∅ [1,2] [1,4] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Three possible places look for [5,6] that split k NNS [6,7]

  56. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD VP 
 ∅ [1,2] [1,4] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Three possible places look for [5,6] that split k NNS [6,7]

  57. I shot an elephant in my pajamas NP, PRP 
 ∅ ∅ [0,1] VBD VP 
 ∅ [1,2] [1,4] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ Three possible places look for [5,6] that split k NNS [6,7]

  58. I shot an elephant in my pajamas NP, PRP 
 S 
 ∅ ∅ [0,1] [0,4] VBD VP 
 ∅ [1,2] [1,4] DT NP [2,3] [2,4] NP, NN [3,4] IN [4,5] PRP$ [5,6] NNS [6,7]

  59. I shot an elephant in my pajamas NP, PRP 
 S 
 ∅ ∅ ∅ ∅ [0,1] [0,4] VBD VP 
 ∅ ∅ ∅ [1,2] [1,4] DT NP ∅ ∅ [2,3] [2,4] NP, NN ∅ ∅ [3,4] IN ∅ [4,5] *in my *elephant in *elephant in my *an elephant in PRP$ *an elephant in my *shot an elephant in [5,6] *shot an elephant in my *I shot an elephant in *I shot an elephant in my NNS [6,7]

Recommend


More recommend