Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1

Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 2

Language model v Probability distributions over sentences (i.e., word sequences ) P(W) = P( 𝑥 " 𝑥 # 𝑥 $ 𝑥 % … 𝑥 ' ) v Can use them to generate strings P( 𝑥 ' ∣ 𝑥 # 𝑥 $ 𝑥 % … 𝑥 ')" ) v Rank possible sentences v P(“Today is Tuesday”) > P(“Tuesday Today is”) v P(“Today is Tuesday”) > P(“Today is Los Angeles”) CS 6501: Natural Language Processing 3

N-Gram Models v Unigram model: 𝑄 𝑥 " 𝑄 𝑥 # 𝑄 𝑥 $ … 𝑄(𝑥 , ) v Bigram model: 𝑄 𝑥 " 𝑄 𝑥 # |𝑥 " 𝑄 𝑥 $ |𝑥 # … 𝑄(𝑥 , |𝑥 ,)" ) v Trigram model: 𝑄 𝑥 " 𝑄 𝑥 # |𝑥 " 𝑄 𝑥 $ |𝑥 # , 𝑥 " … 𝑄(𝑥 , |𝑥 ,)" 𝑥 ,)# ) v N-gram model: 𝑄 𝑥 " 𝑄 𝑥 # |𝑥 " … 𝑄(𝑥 , |𝑥 ,)" 𝑥 ,)# … 𝑥 ,)0 ) CS 6501: Natural Language Processing 4

Random language via n-gram v http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf Collection of n-gram v https://research.googleblog.com/2006/08/al l-our-n-gram-are-belong-to-you.html CS 6501: Natural Language Processing 5

N-Gram Viewer https://books.google.com/ngrams ML in NLP 6

How to represent words? v N-gram -- cannot capture word similarity v Word clusters v Brown Clustering v Part-of-speech tagging v Continuous space representation v Word embedding ML in NLP 7

Brown Clustering v Similar to language model But, basic unit is “word clusters” v Intuition: similar words appear in similar context v Recap: Bigram Language Models v 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , = 𝑄 𝑥 " 𝑥 1 𝑄 𝑥 # 𝑥 " … 𝑄 𝑥 , 𝑥 ,)" 7 = Π 56" P(w : ∣ 𝑥 :)" ) 𝑥 1 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 8

Motivation example v ”a dog is chasing a cat” v 𝑄 𝑥 1 , “𝑏”, ”𝑒𝑝𝑕”, … , “𝑑𝑏𝑢” = 𝑄 ”𝑏” 𝑥 1 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑢” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

Motivation example v Assume every word belongs to a cluster v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

Motivation example v Assume every word belongs to a cluster v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 11

Motivation example v Assume every word belongs to a cluster v “the boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

Motivation example v Assume every word belongs to a cluster v “a fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 13

Brown Clustering v Let 𝐷 𝑥 denote the cluster that 𝑥 belongs to v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 14

Brown clustering model v P(“a dog is chasing a cat”) = P(C(“a”)| 𝐷 1 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 15

Brown clustering model v P(“a dog is chasing a cat”) = P(C(“a”)| 𝐷 1 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... v In general 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , = 𝑄 𝐷(𝑥 " ) 𝐷 𝑥 1 𝑄 𝐷(𝑥 # ) 𝐷(𝑥 " ) … 𝑄 𝐷 𝑥 , 𝐷 𝑥 ,)" 𝑄(𝑥 " |𝐷 𝑥 " 𝑄 𝑥 # 𝐷 𝑥 # … 𝑄(𝑥 , |𝐷 𝑥 , ) 7 = Π 56" P 𝐷 w : 𝐷 𝑥 :)" 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) 6501 Natural Language Processing 16

Model parameters 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , 7 = Π 56" P 𝐷 w : 𝐷 𝑥 :)" 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) Parameter set 2: Parameter set 1: 𝑄(𝑥 : |𝐷 𝑥 : ) 𝑄(𝐷(𝑥 : )|𝐷 𝑥 :)" ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 𝑥 : dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 17

Model parameters 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , 7 = Π 56" P 𝐷 w : 𝐷 𝑥 :)" 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 → {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 v Conditional probability 𝑄(𝑥 ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 , 𝑥 ∈ 𝑑 𝜄 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 18

Log likelihood LL( 𝜄, 𝐷 ) = log 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , 𝜄, 𝐷 7 = log Π 56" P 𝐷 w : 𝐷 𝑥 :)" 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) 7 = ∑ 56" [log P 𝐷 w : 𝐷 𝑥 :)" + log 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) ] v Maximizing LL( 𝜄, 𝐷 ) can be done by alternatively update 𝜄 and 𝐷 1. max \∈] 𝑀𝑀(𝜄, 𝐷) 2. max _ 𝑀𝑀(𝜄, 𝐷) 6501 Natural Language Processing 19

max \∈] 𝑀𝑀(𝜄, 𝐷) LL( 𝜄, 𝐷 ) = log 𝑄 𝑥 1 , 𝑥 " , 𝑥 # , … , 𝑥 , 𝜄, 𝐷 7 = log Π 56" P 𝐷 w : 𝐷 𝑥 :)" 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) 7 = ∑ 56" [log P 𝐷 w : 𝐷 𝑥 :)" + log 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) ] v 𝑄(𝑑′ ∣ 𝑑) = #(a b ,a) #a This part is the same as training a POS tagging model v 𝑄(𝑥 ∣ 𝑑) = #(c,a) #a See section 9.2: http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf 6501 Natural Language Processing 20

max _ 𝑀𝑀(𝜄, 𝐷) 7 max _ ∑ 56" [log P 𝐷 w : 𝐷 𝑥 :)" + log 𝑄(𝑥 : ∣ 𝐷 𝑥 : ) ] e a,a b 𝑞 𝑑, 𝑑 N log ' ' = n ∑ ∑ e a e(a b ) + 𝐻 a6" aN6" See classnote here: where G is a constant http://web.cs.ucla.edu/~kwchang/teaching /NLP16/slides/classnote.pdf v Here, # a,a b 𝑞 𝑑, 𝑑 N = # a , 𝑞 𝑑 = #(a,a b ) ∑ ∑ #(a) �h �h,hb v 𝑑 : cluster of w : , 𝑑′ :cluster of w :)" e 𝑑 𝑑 N e a,a b e a e(a b ) = v (mutual information) e a 6501 Natural Language Processing 21

Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(𝜄, 𝐷) v Cost? (can be improved to 𝑃( 𝑊 $ ) ) O(|V|-k) 𝑃( 𝑊 # ) 𝑃 ( 𝑊 # ) = 𝑃( 𝑊 k ) #Iters #pairs compute LL 6501 Natural Language Processing 22

Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 " , 𝑑 # , 𝑑 $ , … 𝑑 l v For 𝑗 = 𝑛 + 1 … |𝑊| v Create a new cluster 𝑑 lo" (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 𝜄, 𝐷 and merge ⇒ back to m clusters v Carry out (m-1) final merges ⇒ full hierarchy v Running time O 𝑊 𝑛 # + 𝑜 , n=#words in corpus 6501 Natural Language Processing 23

Example clusters (Brown+1992) 6501 Natural Language Processing 24

Example Hierarchy (Miller+2004) 6501 Natural Language Processing 25

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

in Virtual Environments Representing People Representing People Whats in this lecture?

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Words and Automata, Lecture 2 Dominique Perrin 31 octobre 2013 Dominique Perrin Words and

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

On representing semantic maps On representing semantic maps Ferdinand de Haan Ferdinand de Haan

Representing Clients with Diminished Representing Clients with Diminished Capacity in Civil Matters

they add to the energy mix and where is it happening? Representing the UK Hydrogen and Fuel Cell

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

LIU-PSB MD S AND S TUDIES : 2018 Simon Albright BE-RF-BR February 19, 2018 S IMON A LBRIGHT

THE UNIVERSAL TRANSACTIONAL MEMORY CONSTRUCTION Jons-Tobias Wamhoff and Christof Fetzer Dresden

Real-World Buffer Overflow Protection in User & Kernel Space Michael Dalton , Hari Kannan,

Operating Systems Julian Bradfield jcb+os@inf.ed.ac.uk JCMB-2610 Course Aims general

How to Make a Presentation SWEN-261 Introduction to Software Engineering Department of Software

More Fancy Talk about Rust The allocator strikes back Kai Blin Samba Team SambaXP 2019

Regression Testing: Down the Rabbit Hole Neil Studd, Towers Watson About Me 10 years of

NEBC Database Course 2008 Implementing a Relational Database Tim Booth : tbooth@ceh.ac.uk

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

in Virtual Environments Representing People Representing People Whats in this lecture?

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Words and Automata, Lecture 2 Dominique Perrin 31 octobre 2013 Dominique Perrin Words and

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

On representing semantic maps On representing semantic maps Ferdinand de Haan Ferdinand de Haan

Representing Clients with Diminished Representing Clients with Diminished Capacity in Civil Matters

they add to the energy mix and where is it happening? Representing the UK Hydrogen and Fuel Cell

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

LIU-PSB MD S AND S TUDIES : 2018 Simon Albright BE-RF-BR February 19, 2018 S IMON A LBRIGHT

THE UNIVERSAL TRANSACTIONAL MEMORY CONSTRUCTION Jons-Tobias Wamhoff and Christof Fetzer Dresden

Real-World Buffer Overflow Protection in User &amp; Kernel Space Michael Dalton , Hari Kannan,

Operating Systems Julian Bradfield jcb+os@inf.ed.ac.uk JCMB-2610 Course Aims general

How to Make a Presentation SWEN-261 Introduction to Software Engineering Department of Software

More Fancy Talk about Rust The allocator strikes back Kai Blin Samba Team SambaXP 2019

Regression Testing: Down the Rabbit Hole Neil Studd, Towers Watson About Me 10 years of

NEBC Database Course 2008 Implementing a Relational Database Tim Booth : tbooth@ceh.ac.uk

Real-World Buffer Overflow Protection in User & Kernel Space Michael Dalton , Hari Kannan,