lecture 6 vector space model
play

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v How to represent a word, a sentence, or a document? v


  1. Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

  2. This lecture v How to represent a word, a sentence, or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”? 6501 Natural Language Processing 2

  3. 6501 Natural Language Processing 3

  4. How to represent a word v Naïve way: represent words as atomic symbols: student, talk, university v N-germ language model, logical analysis v Represent word as a “one-hot” vector [ 0 0 0 1 0 … 0 ] egg student talk university happy buy v How large is this vector? v PTB data: ~50k, Google 1T data: 13M v 𝑤 ⋅ 𝑣 =? 6501 Natural Language Processing 4

  5. Issues? v Dimensionality is large; vector is sparse v No similarity 𝑤 '())* = [0 0 0 1 0 … 0 ] 𝑤 +(, = [0 0 1 0 0 … 0 ] 𝑤 -./0 = [1 0 0 0 0 … 0 ] 𝑤 '())* ⋅ 𝑤 +(, = 𝑤 '())* ⋅ 𝑤 -./0 = 0 v Cannot represent new words v Any idea? 6501 Natural Language Processing 5

  6. Idea 1: Taxonomy (Word category) 6501 Natural Language Processing 6

  7. What is “car”? >>> fromnltk.corpusimportwordnet as wn >>> wn.synsets('motorcar') [Synset('car.n.01')] >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> [synset.name() forsynsetin paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() forsynsetin paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 6501 Natural Language Processing 7

  8. Word similarity? >>> right = wn.synset('right_whale.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> orca = wn.synset('orca.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>>right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')] Require human labor 6501 Natural Language Processing 8

  9. Taxonomy (Word category) v Synonym, hypernym (Is-A), hyponym 6501 Natural Language Processing 9

  10. Idea 2: Similarity = Clustering 6501 Natural Language Processing 10

  11. Cluster n-gram model v Can be generated from unlabeled corpora v Based on statistics, e.g., mutual information Implementation of the Brown hierarchical word clustering algorithm. Percy Liang 6501 Natural Language Processing 11

  12. Idea 3: Distributional representation "a word is characterized by the company it keeps” --Firth, John, 1957 v Linguistic items with similar distributions have similar meanings v i.e., words occur in the same contexts ⇒ similar meaning 6501 Natural Language Processing 12

  13. Vector representation (word embeddings) v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept” v What are “basic concept”? v How to assign weights? v How to define the similarity/distance? 𝑤 0.23 = [0.8 0.9 0.1 0 … ] 𝑤 45662 = [0.8 0.1 0.8 0 … ] 𝑤 ())/* = [0.1 0.2 0.1 0.8 … ] royalty masculinity femininity eatable 6501 Natural Language Processing 13

  14. An illustration of vector space model Royalty |D 2 -D 4 | w 2 w 4 w 3 Masculine W 5 W 1 Eatable 6501 Natural Language Processing 14

  15. Semantic similarity in 2D v Home depot product 6501 Natural Language Processing 15

  16. Capture the structure of words v Example from GloVe 6501 Natural Language Processing 16

  17. How to use word vectors? 6501 Natural Language Processing 17

  18. Pre-trained word vectors v Google Book https://code.google.com/archive/p/word2vec v 100 billion tokens, 300 dimension, 3M words v Glove project http://nlp.stanford.edu/projects/glove/ v Pre-trained word vectors of Wiki (6B), web crawl data (840B), twitter (27B) 6501 Natural Language Processing 18

  19. Distance/similarity v Vector similarity measure ⇒ similarity in meaning v Cosine similarity 5 ||5|| is a unit vector 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 19

  20. Distance/similarity v Vector similarity measure ⇒ similarity in meaning Linguistic Regularities in Sparse and Explicit v Cosine similarity Word Representations , Levy Goldberg, CoNLL 14 5 ⋅ ; v cos 𝑣, 𝑤 = ||5||⋅||;|| v Word vector are normalized by length Choosing the right similarity metric is important v Euclidean distance ||𝑣 − 𝑤|| > v Inner product 𝑣 ⋅ 𝑤 v Same as cosine similarity if vectors are normalized 6501 Natural Language Processing 20

  21. Word similarity DEMO v http://msrcstr.cloudapp.net/ 6501 Natural Language Processing 21

  22. Word analogy v 𝑤 -(2 − 𝑤 ?@-(2 + 𝑤 52B/6 ∼ 𝑤 (52D 6501 Natural Language Processing 22

  23. From words to phrases 6501 Natural Language Processing 23

  24. Neural Language Models 6501 Natural Language Processing 24

  25. How to “learn” word vectors? What are “basic concept”? How to assign weights? How to define the similarity/distance? Cosine similarity 6501 Natural Language Processing 25

  26. Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Bag-of-word model: documents (clusters) as the basis for vector space 6501 Natural Language Processing 26

  27. Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams 6501 Natural Language Processing 27

  28. Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 28

  29. Back to distributional representation v Encode relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Skip-grams Pros and cons? v From taxonomy (e.g., WordNet, thesaurus) Input: Synonyms from a thesaurus Joyfulness: joy, gladden Sad: sorrow, sadden Cosine similarity? joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 0 0 0 Group 2: “sad” 0 0 1 1 0 Group 3: “affection” 0 0 0 0 1 6501 Natural Language Processing 29

  30. Problems? v Number of basic concepts is large v Basis is not orthogonal (i.e., not linearly independent) v Some function words are too frequent (e.g., the) v Syntax has too much impact v E.g, TF-IDF can be applied v E.g, skip-gram: scaling by distance to target 6501 Natural Language Processing 30

  31. Latent Semantic Analysis (LSA) v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., document-term matrix, skip-gram) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 31

  32. Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors 6501 Natural Language Processing 32

  33. Principle Component Analysis (PCA) v Decompose the similarity space into a set of orthonormal basis vectors v For an 𝑛×𝑜 matrix 𝐵 , there exists a factorization such that 𝐵 = 𝑉Σ𝑊 L v 𝑉, 𝑊 are orthogonal matrices 6501 Natural Language Processing 33

  34. Low-rank Approximation v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000) v SVD can be used to compute optimal low-rank approximation v Set smallest n-r singular value to zero v Similar words map to similar location in low dimensional space 6501 Natural Language Processing 34

  35. Latent Semantic Analysis (LSA) v Factorization v Apply SVD to the matrix to find latent components 6501 Natural Language Processing 35

  36. LSA example v Original matrix C Example from Christopher Manning and Pandu Nayak, introduction to IR 6501 Natural Language Processing 36

  37. LSA example v SVD: 𝐷 = 𝑉Σ𝑊 L 6501 Natural Language Processing 37

  38. LSA example v Original matrix C v Dimension reduction 𝐷 ∼ 𝑉Σ𝑊 L 6501 Natural Language Processing 38

  39. LSA example v Original matrix 𝐷 v.s. reconstructed matrix 𝐷 > v What is the similarity between ship and boat? 6501 Natural Language Processing 39

Recommend


More recommend