Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

This lecture v How to represent a word, a sentence, or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”? 6501 Natural Language Processing 2

6501 Natural Language Processing 3

How to represent a word v Naïve way: represent words as atomic symbols: student, talk, university v N-germ language model, logical analysis v Represent word as a “one-hot” vector [ 0 0 0 1 0 … 0 ] egg student talk university happy buy v How large is this vector? v PTB data: ~50k, Google 1T data: 13M v 𝑤 ⋅ 𝑣 =? 6501 Natural Language Processing 4

Issues? v Dimensionality is large; vector is sparse v No similarity 𝑤 '())* = [0 0 0 1 0 … 0 ] 𝑤 +(, = [0 0 1 0 0 … 0 ] 𝑤 -./0 = [1 0 0 0 0 … 0 ] 𝑤 '())* ⋅ 𝑤 +(, = 𝑤 '())* ⋅ 𝑤 -./0 = 0 v Cannot represent new words v Any idea? 6501 Natural Language Processing 5

Idea 1: Taxonomy (Word category) 6501 Natural Language Processing 6

What is “car”? >>> fromnltk.corpusimportwordnet as wn >>> wn.synsets('motorcar') [Synset('car.n.01')] >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> [synset.name() forsynsetin paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() forsynsetin paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 6501 Natural Language Processing 7

Word similarity? >>> right = wn.synset('right_whale.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> orca = wn.synset('orca.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>>right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')] Require human labor 6501 Natural Language Processing 8

Taxonomy (Word category) v Synonym, hypernym (Is-A), hyponym 6501 Natural Language Processing 9

Idea 2: Similarity = Clustering 6501 Natural Language Processing 10

Cluster n-gram model v Can be generated from unlabeled corpora v Based on statistics, e.g., mutual information Implementation of the Brown hierarchical word clustering algorithm. Percy Liang 6501 Natural Language Processing 11

Idea 3: Distributional representation "a word is characterized by the company it keeps” --Firth, John, 1957 v Linguistic items with similar distributions have similar meanings v i.e., words occur in the same contexts ⇒ similar meaning 6501 Natural Language Processing 12

Vector representation (word embeddings) v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept” v What are “basic concept”? v How to assign weights? v How to define the similarity/distance? 𝑤 0.23 = [0.8 0.9 0.1 0 … ] 𝑤 45662 = [0.8 0.1 0.8 0 … ] 𝑤 ())/* = [0.1 0.2 0.1 0.8 … ] royalty masculinity femininity eatable 6501 Natural Language Processing 13

An illustration of vector space model Royalty |D 2 -D 4 | w 2 w 4 w 3 Masculine W 5 W 1 Eatable 6501 Natural Language Processing 14

Semantic similarity in 2D v Home depot product 6501 Natural Language Processing 15

Capture the structure of words v Example from GloVe 6501 Natural Language Processing 16

How to use word vectors? 6501 Natural Language Processing 17

Pre-trained word vectors v Google Book https://code.google.com/archive/p/word2vec v 100 billion tokens, 300 dimension, 3M words v Glove project http://nlp.stanford.edu/projects/glove/ v Pre-trained word vectors of Wiki (6B), web crawl data (840B), twitter (27B) 6501 Natural Language Processing 18

Distance/similarity v Vector similarity measure ⇒ similarity in meaning v Cosine similarity 5 ||5|| is a unit vector 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 19

Distance/similarity v Vector similarity measure ⇒ similarity in meaning Linguistic Regularities in Sparse and Explicit v Cosine similarity Word Representations , Levy Goldberg, CoNLL 14 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length Choosing the right similarity metric is important v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 20

Word similarity DEMO v http://msrcstr.cloudapp.net/ 6501 Natural Language Processing 21

Word analogy v 𝑤 -(2 − 𝑤 ?@-(2 + 𝑤 52B/6 ∼ 𝑤 (52D 6501 Natural Language Processing 22

From words to phrases 6501 Natural Language Processing 23

Neural Language Models 6501 Natural Language Processing 24

How to “learn” word vectors? What are “basic concept”? How to assign weights? How to define the similarity/distance? Cosine similarity 6501 Natural Language Processing 25

Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Bag-of-word model: documents (clusters) as the basis for vector space 6501 Natural Language Processing 26

Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams 6501 Natural Language Processing 27

Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 28

Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams Pros and cons? v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden Cosine similarity? joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 29

Problems? v Number of basic concepts is large v Basis is not orthogonal (i.e., not linearly independent) v Some function words are too frequent (e.g., the) v Syntax has too much impact v E.g, TF-IDF can be applied v E.g, skip-gram: scaling by distance to target 6501 Natural Language Processing 30

Latent Semantic Analysis (LSA) v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., document-term matrix, skip-gram) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 31

Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors 6501 Natural Language Processing 32

Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors v For an 𝑛×𝑜 matrix 𝐵 , there exists a factorization such that 𝐵 = 𝑉Σ𝑊 L v 𝑉, 𝑊 are orthogonal matrices 6501 Natural Language Processing 33

Low-rank Approximation v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000) v SVD can be used to compute optimal low-rank approximation v Set smallest n-r singular value to zero v Similar words map to similar location in low dimensional space 6501 Natural Language Processing 34

Latent Semantic Analysis (LSA) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 35

LSA example v Original matrix C Example from Christopher Manning and Pandu Nayak, introduction to IR 6501 Natural Language Processing 36

LSA example v SVD: 𝐷 = 𝑉Σ𝑊 L 6501 Natural Language Processing 37

LSA example v Original matrix C v Dimension reduction 𝐷 ∼ 𝑉Σ𝑊 L 6501 Natural Language Processing 38

LSA example v Original matrix 𝐷 v.s. reconstructed matrix 𝐷 > v What is the similarity between ship and boat? 6501 Natural Language Processing 39

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v How to represent a word, a sentence, or a document? v

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

The Classic Vector Space Model Description, Advantages and Limitations of the Classic Vector

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

Part 10: Vector Space Classification Francesco Ricci 1 Content p Recap on nave Bayes p

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

27. Vector fields in space A vector field in space is given by + R F = P + Q

Web Information Retrieval Lecture 6 Vector Space Model Recap of the last lecture Parametric

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

The Geometry of Vector Spaces x E N : vector x belongs to an N -dimensional Euclidean space.

Last lecture Configuration Space Free-Space and C-Space Obstacles Minkowski Sums 1

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Arrays: computing with many numbers Some perspective We have so far (mostly) looked at what we

Lecture 1.1: Vector spaces Matthew Macauley Department of Mathematical Sciences Clemson

Brown University Vector Boot Camp Part 1: Vectors and Scalars Vector calculations are frequently

Vector Semantics Diyi Yang Slides from Dan Jurafsky and Michael Collins, and many others 1

NLA Reading Group Spring13 by smail Ar is a linear combination of the columns of

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Vector Spaces Linear Independence, Bases and Dimension Marco Chiarandini Department of

Vector Barrier Certificates and Comparison Systems Andrew Sogokon 1 Khalil Ghorbal 2 Yong Kiam Tan