lecture 6 representing words
play

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text


  1. Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1

  2. Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 2

  3. Language model v Probability distributions over sentences (i.e., word sequences ) P(W) = P( π‘₯ " π‘₯ # π‘₯ $ π‘₯ % … π‘₯ ' ) v Can use them to generate strings P( π‘₯ ' ∣ π‘₯ # π‘₯ $ π‘₯ % … π‘₯ ')" ) v Rank possible sentences v P(β€œToday is Tuesday”) > P(β€œTuesday Today is”) v P(β€œToday is Tuesday”) > P(β€œToday is Los Angeles”) CS 6501: Natural Language Processing 3

  4. N-Gram Models v Unigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # 𝑄 π‘₯ $ … 𝑄(π‘₯ , ) v Bigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " 𝑄 π‘₯ $ |π‘₯ # … 𝑄(π‘₯ , |π‘₯ ,)" ) v Trigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " 𝑄 π‘₯ $ |π‘₯ # , π‘₯ " … 𝑄(π‘₯ , |π‘₯ ,)" π‘₯ ,)# ) v N-gram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " … 𝑄(π‘₯ , |π‘₯ ,)" π‘₯ ,)# … π‘₯ ,)0 ) CS 6501: Natural Language Processing 4

  5. Random language via n-gram v http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf Collection of n-gram v https://research.googleblog.com/2006/08/al l-our-n-gram-are-belong-to-you.html CS 6501: Natural Language Processing 5

  6. N-Gram Viewer https://books.google.com/ngrams ML in NLP 6

  7. How to represent words? v N-gram -- cannot capture word similarity v Word clusters v Brown Clustering v Part-of-speech tagging v Continuous space representation v Word embedding ML in NLP 7

  8. Brown Clustering v Similar to language model But, basic unit is β€œword clusters” v Intuition: similar words appear in similar context v Recap: Bigram Language Models v 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , = 𝑄 π‘₯ " π‘₯ 1 𝑄 π‘₯ # π‘₯ " … 𝑄 π‘₯ , π‘₯ ,)" 7 = Ξ  56" P(w : ∣ π‘₯ :)" ) π‘₯ 1 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 8

  9. Motivation example v ”a dog is chasing a cat” v 𝑄 π‘₯ 1 , β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯ 1 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

  10. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

  11. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 11

  12. Motivation example v Assume every word belongs to a cluster v β€œthe boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

  13. Motivation example v Assume every word belongs to a cluster v β€œa fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 13

  14. Brown Clustering v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 14

  15. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 1 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 15

  16. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 1 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , = 𝑄 𝐷(π‘₯ " ) 𝐷 π‘₯ 1 𝑄 𝐷(π‘₯ # ) 𝐷(π‘₯ " ) … 𝑄 𝐷 π‘₯ , 𝐷 π‘₯ ,)" 𝑄(π‘₯ " |𝐷 π‘₯ " 𝑄 π‘₯ # 𝐷 π‘₯ # … 𝑄(π‘₯ , |𝐷 π‘₯ , ) 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 6501 Natural Language Processing 16

  17. Model parameters 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) Parameter set 2: Parameter set 1: 𝑄(π‘₯ : |𝐷 π‘₯ : ) 𝑄(𝐷(π‘₯ : )|𝐷 π‘₯ :)" ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 π‘₯ : dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 17

  18. Model parameters 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 , π‘₯ ∈ 𝑑 πœ„ represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 18

  19. Log likelihood LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , πœ„, 𝐷 7 = log Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 7 = βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] v Maximizing LL( πœ„, 𝐷 ) can be done by alternatively update πœ„ and 𝐷 1. max \∈] 𝑀𝑀(πœ„, 𝐷) 2. max _ 𝑀𝑀(πœ„, 𝐷) 6501 Natural Language Processing 19

  20. max \∈] 𝑀𝑀(πœ„, 𝐷) LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , πœ„, 𝐷 7 = log Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 7 = βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] v 𝑄(𝑑′ ∣ 𝑑) = #(a b ,a) #a This part is the same as training a POS tagging model v 𝑄(π‘₯ ∣ 𝑑) = #(c,a) #a See section 9.2: http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf 6501 Natural Language Processing 20

  21. max _ 𝑀𝑀(πœ„, 𝐷) 7 max _ βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] e a,a b π‘ž 𝑑, 𝑑 N log ' ' = n βˆ‘ βˆ‘ e a e(a b ) + 𝐻 a6" aN6" See classnote here: where G is a constant http://web.cs.ucla.edu/~kwchang/teaching /NLP16/slides/classnote.pdf v Here, # a,a b π‘ž 𝑑, 𝑑 N = # a , π‘ž 𝑑 = #(a,a b ) βˆ‘ βˆ‘ #(a) οΏ½h οΏ½h,hb v 𝑑 : cluster of w : , 𝑑′ :cluster of w :)" e 𝑑 𝑑 N e a,a b e a e(a b ) = v (mutual information) e a 6501 Natural Language Processing 21

  22. Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷) v Cost? (can be improved to 𝑃( π‘Š $ ) ) O(|V|-k) 𝑃( π‘Š # ) 𝑃 ( π‘Š # ) = 𝑃( π‘Š k ) #Iters #pairs compute LL 6501 Natural Language Processing 22

  23. Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 " , 𝑑 # , 𝑑 $ , … 𝑑 l v For 𝑗 = 𝑛 + 1 … |π‘Š| v Create a new cluster 𝑑 lo" (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛 # + π‘œ , n=#words in corpus 6501 Natural Language Processing 23

  24. Example clusters (Brown+1992) 6501 Natural Language Processing 24

  25. Example Hierarchy (Miller+2004) 6501 Natural Language Processing 25

Recommend


More recommend