Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1
This lecture v How to represent a word, a sentence, or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”? 6501 Natural Language Processing 2
6501 Natural Language Processing 3
How to represent a word v Naïve way: represent words as atomic symbols: student, talk, university v N-germ language model, logical analysis v Represent word as a “one-hot” vector [ 0 0 0 1 0 … 0 ] egg student talk university happy buy v How large is this vector? v PTB data: ~50k, Google 1T data: 13M v 𝑤 ⋅ 𝑣 =? 6501 Natural Language Processing 4
Issues? v Dimensionality is large; vector is sparse v No similarity 𝑤 '())* = [0 0 0 1 0 … 0 ] 𝑤 +(, = [0 0 1 0 0 … 0 ] 𝑤 -./0 = [1 0 0 0 0 … 0 ] 𝑤 '())* ⋅ 𝑤 +(, = 𝑤 '())* ⋅ 𝑤 -./0 = 0 v Cannot represent new words v Any idea? 6501 Natural Language Processing 5
Idea 1: Taxonomy (Word category) 6501 Natural Language Processing 6
What is “car”? >>> fromnltk.corpusimportwordnet as wn >>> wn.synsets('motorcar') [Synset('car.n.01')] >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> [synset.name() forsynsetin paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() forsynsetin paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 6501 Natural Language Processing 7
Word similarity? >>> right = wn.synset('right_whale.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> orca = wn.synset('orca.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>>right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')] Require human labor 6501 Natural Language Processing 8
Taxonomy (Word category) v Synonym, hypernym (Is-A), hyponym 6501 Natural Language Processing 9
Idea 2: Similarity = Clustering 6501 Natural Language Processing 10
Cluster n-gram model v Can be generated from unlabeled corpora v Based on statistics, e.g., mutual information Implementation of the Brown hierarchical word clustering algorithm. Percy Liang 6501 Natural Language Processing 11
Idea 3: Distributional representation "a word is characterized by the company it keeps” --Firth, John, 1957 v Linguistic items with similar distributions have similar meanings v i.e., words occur in the same contexts ⇒ similar meaning 6501 Natural Language Processing 12
Vector representation (word embeddings) v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept” v What are “basic concept”? v How to assign weights? v How to define the similarity/distance? 𝑤 0.23 = [0.8 0.9 0.1 0 … ] 𝑤 45662 = [0.8 0.1 0.8 0 … ] 𝑤 ())/* = [0.1 0.2 0.1 0.8 … ] royalty masculinity femininity eatable 6501 Natural Language Processing 13
An illustration of vector space model Royalty |D 2 -D 4 | w 2 w 4 w 3 Masculine W 5 W 1 Eatable 6501 Natural Language Processing 14
Semantic similarity in 2D v Home depot product 6501 Natural Language Processing 15
Capture the structure of words v Example from GloVe 6501 Natural Language Processing 16
How to use word vectors? 6501 Natural Language Processing 17
Pre-trained word vectors v Google Book https://code.google.com/archive/p/word2vec v 100 billion tokens, 300 dimension, 3M words v Glove project http://nlp.stanford.edu/projects/glove/ v Pre-trained word vectors of Wiki (6B), web crawl data (840B), twitter (27B) 6501 Natural Language Processing 18
Distance/similarity v Vector similarity measure ⇒ similarity in meaning v Cosine similarity 5 ||5|| is a unit vector 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 19
Distance/similarity v Vector similarity measure ⇒ similarity in meaning Linguistic Regularities in Sparse and Explicit v Cosine similarity Word Representations , Levy Goldberg, CoNLL 14 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length Choosing the right similarity metric is important v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 20
Word similarity DEMO v http://msrcstr.cloudapp.net/ 6501 Natural Language Processing 21
Word analogy v 𝑤 -(2 − 𝑤 ?@-(2 + 𝑤 52B/6 ∼ 𝑤 (52D 6501 Natural Language Processing 22
From words to phrases 6501 Natural Language Processing 23
Neural Language Models 6501 Natural Language Processing 24
How to “learn” word vectors? What are “basic concept”? How to assign weights? How to define the similarity/distance? Cosine similarity 6501 Natural Language Processing 25
Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Bag-of-word model: documents (clusters) as the basis for vector space 6501 Natural Language Processing 26
Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams 6501 Natural Language Processing 27
Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 28
Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams Pros and cons? v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden Cosine similarity? joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 29
Problems? v Number of basic concepts is large v Basis is not orthogonal (i.e., not linearly independent) v Some function words are too frequent (e.g., the) v Syntax has too much impact v E.g, TF-IDF can be applied v E.g, skip-gram: scaling by distance to target 6501 Natural Language Processing 30
Latent Semantic Analysis (LSA) v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., document-term matrix, skip-gram) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 31
Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors 6501 Natural Language Processing 32
Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors v For an 𝑛×𝑜 matrix 𝐵 , there exists a factorization such that 𝐵 = 𝑉Σ𝑊 L v 𝑉, 𝑊 are orthogonal matrices 6501 Natural Language Processing 33
Low-rank Approximation v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000) v SVD can be used to compute optimal low-rank approximation v Set smallest n-r singular value to zero v Similar words map to similar location in low dimensional space 6501 Natural Language Processing 34
Latent Semantic Analysis (LSA) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 35
LSA example v Original matrix C Example from Christopher Manning and Pandu Nayak, introduction to IR 6501 Natural Language Processing 36
LSA example v SVD: 𝐷 = 𝑉Σ𝑊 L 6501 Natural Language Processing 37
LSA example v Original matrix C v Dimension reduction 𝐷 ∼ 𝑉Σ𝑊 L 6501 Natural Language Processing 38
LSA example v Original matrix 𝐷 v.s. reconstructed matrix 𝐷 > v What is the similarity between ship and boat? 6501 Natural Language Processing 39
Recommend
More recommend