Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1
Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 2
Language model v Probability distributions over sentences (i.e., word sequences ) P(W) = P( π₯ " π₯ # π₯ $ π₯ % β¦ π₯ ' ) v Can use them to generate strings P( π₯ ' β£ π₯ # π₯ $ π₯ % β¦ π₯ ')" ) v Rank possible sentences v P(βToday is Tuesdayβ) > P(βTuesday Today isβ) v P(βToday is Tuesdayβ) > P(βToday is Los Angelesβ) CS 6501: Natural Language Processing 3
N-Gram Models v Unigram model: π π₯ " π π₯ # π π₯ $ β¦ π(π₯ , ) v Bigram model: π π₯ " π π₯ # |π₯ " π π₯ $ |π₯ # β¦ π(π₯ , |π₯ ,)" ) v Trigram model: π π₯ " π π₯ # |π₯ " π π₯ $ |π₯ # , π₯ " β¦ π(π₯ , |π₯ ,)" π₯ ,)# ) v N-gram model: π π₯ " π π₯ # |π₯ " β¦ π(π₯ , |π₯ ,)" π₯ ,)# β¦ π₯ ,)0 ) CS 6501: Natural Language Processing 4
Random language via n-gram v http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf Collection of n-gram v https://research.googleblog.com/2006/08/al l-our-n-gram-are-belong-to-you.html CS 6501: Natural Language Processing 5
N-Gram Viewer https://books.google.com/ngrams ML in NLP 6
How to represent words? v N-gram -- cannot capture word similarity v Word clusters v Brown Clustering v Part-of-speech tagging v Continuous space representation v Word embedding ML in NLP 7
Brown Clustering v Similar to language model But, basic unit is βword clustersβ v Intuition: similar words appear in similar context v Recap: Bigram Language Models v π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , = π π₯ " π₯ 1 π π₯ # π₯ " β¦ π π₯ , π₯ ,)" 7 = Ξ 56" P(w : β£ π₯ :)" ) π₯ 1 is a dummy word representing βbegin of a sentenceβ 6501 Natural Language Processing 8
Motivation example v βa dog is chasing a catβ v π π₯ 1 , βπβ, βπππβ, β¦ , βπππ’β = π βπβ π₯ 1 π βπππβ βπβ β¦ π βπππ’β βπβ v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 9
Motivation example v Assume every word belongs to a cluster v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 10
Motivation example v Assume every word belongs to a cluster v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 11
Motivation example v Assume every word belongs to a cluster v βthe boy is following a rabbitβ C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 12
Motivation example v Assume every word belongs to a cluster v βa fox was chasing a birdβ C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 13
Brown Clustering v Let π· π₯ denote the cluster that π₯ belongs to v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 14
Brown clustering model v P(βa dog is chasing a catβ) = P(C(βaβ)| π· 1 ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 15
Brown clustering model v P(βa dog is chasing a catβ) = P(C(βaβ)| π· 1 ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... v In general π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , = π π·(π₯ " ) π· π₯ 1 π π·(π₯ # ) π·(π₯ " ) β¦ π π· π₯ , π· π₯ ,)" π(π₯ " |π· π₯ " π π₯ # π· π₯ # β¦ π(π₯ , |π· π₯ , ) 7 = Ξ 56" P π· w : π· π₯ :)" π(π₯ : β£ π· π₯ : ) 6501 Natural Language Processing 16
Model parameters π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , 7 = Ξ 56" P π· w : π· π₯ :)" π(π₯ : β£ π· π₯ : ) Parameter set 2: Parameter set 1: π(π₯ : |π· π₯ : ) π(π·(π₯ : )|π· π₯ :)" ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 π· π₯ : dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 17
Model parameters π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , 7 = Ξ 56" P π· w : π· π₯ :)" π(π₯ : β£ π· π₯ : ) v A vocabulary set π v A function π·: π β {1, 2, 3, β¦ π } v A partition of vocabulary into k classes v Conditional probability π(πβ² β£ π) for π, π N β 1, β¦ , π v Conditional probability π(π₯ β£ π) for π, π N β 1, β¦ , π , π₯ β π π represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 18
Log likelihood LL( π, π· ) = log π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , π, π· 7 = log Ξ 56" P π· w : π· π₯ :)" π(π₯ : β£ π· π₯ : ) 7 = β 56" [log P π· w : π· π₯ :)" + log π(π₯ : β£ π· π₯ : ) ] v Maximizing LL( π, π· ) can be done by alternatively update π and π· 1. max \β] ππ(π, π·) 2. max _ ππ(π, π·) 6501 Natural Language Processing 19
max \β] ππ(π, π·) LL( π, π· ) = log π π₯ 1 , π₯ " , π₯ # , β¦ , π₯ , π, π· 7 = log Ξ 56" P π· w : π· π₯ :)" π(π₯ : β£ π· π₯ : ) 7 = β 56" [log P π· w : π· π₯ :)" + log π(π₯ : β£ π· π₯ : ) ] v π(πβ² β£ π) = #(a b ,a) #a This part is the same as training a POS tagging model v π(π₯ β£ π) = #(c,a) #a See section 9.2: http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf 6501 Natural Language Processing 20
max _ ππ(π, π·) 7 max _ β 56" [log P π· w : π· π₯ :)" + log π(π₯ : β£ π· π₯ : ) ] e a,a b π π, π N log ' ' = n β β e a e(a b ) + π» a6" aN6" See classnote here: where G is a constant http://web.cs.ucla.edu/~kwchang/teaching /NLP16/slides/classnote.pdf v Here, # a,a b π π, π N = # a , π π = #(a,a b ) β β #(a) οΏ½h οΏ½h,hb v π : cluster of w : , πβ² :cluster of w :)" e π π N e a,a b e a e(a b ) = v (mutual information) e a 6501 Natural Language Processing 21
Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing ππ(π, π·) v Cost? (can be improved to π( π $ ) ) O(|V|-k) π( π # ) π ( π # ) = π( π k ) #Iters #pairs compute LL 6501 Natural Language Processing 22
Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster π " , π # , π $ , β¦ π l v For π = π + 1 β¦ |π| v Create a new cluster π lo" (we have m+1 clusters) v Choose two cluster from m+1 clusters based on ππ π, π· and merge β back to m clusters v Carry out (m-1) final merges β full hierarchy v Running time O π π # + π , n=#words in corpus 6501 Natural Language Processing 23
Example clusters (Brown+1992) 6501 Natural Language Processing 24
Example Hierarchy (Miller+2004) 6501 Natural Language Processing 25
Recommend
More recommend