How about a radically different approach?
Ludwig Wittgenstein PI #43: "The meaning of a word is its use in the language"
Let's define words by their usages In particular, words are defined by their environments (the words around them) Zellig Harris (1954): If A and B have almost identical environments we say that they are synonyms .
What does ongchoi mean? Suppose you see these sentences: • Ong choi is delicious sautéed with garlic . • Ong choi is superb over rice • Ong choi leaves with salty sauces And you've also seen these: • …spinach sautéed with garlic over rice • Chard stems and leaves are delicious • Collard greens and other salty leafy greens Conclusion: ◦ Ongchoi is a leafy green like spinach, chard, or collard greens
Ong choi: Ipomoea aquatica "Water Spinach" Yamaguchi, Wikimedia Commons, public domain
We'll build a new model of meaning focusing on similarity Each word = a vector ◦Not just "word" or word45. Similar words are "nearby in space" not good bad to by dislike worst ‘s incredibly bad that now are worse a i you than with is incredibly good very good amazing fantastic wonderful terrific nice good
We define a word as a vector Called an "embedding" because it's embedded into a space The standard way to represent meaning in NLP Fine-grained model of meaning for similarity ◦ NLP tasks like sentiment analysis ◦ With words, requires same word to be in training and test ◦ With embeddings: ok if similar words occurred!!! ◦ Question answering, conversational agents, etc
We'll introduce 2 kinds of embeddings Tf-idf ◦ A common baseline model ◦ Sparse vectors ◦ Words are represented by a simple function of the counts of nearby words Word2vec ◦ Dense vectors ◦ Representation is created by training a classifier to distinguish nearby and far-away words
Review: words, vectors, and co-occurrence matrices
Term-document matrix Each document is represented by a vector of words As You Like It As You Like It Twelfth Night Twelfth Night Julius Caesar Julius Caesar Henry V Henry V battle battle 1 1 1 1 8 8 15 15 soldier soldier 2 2 2 2 12 12 36 36 fool fool 37 58 1 5 37 58 1 5 clown clown 5 5 117 117 0 0 0 0 Figure 6.2 Figure 6.3 The term-document matrix for four words in four Shakespeare plays. Each cell The term-document matrix for four words in four Shakespeare plays. The red
Visualizing document vectors 40 Henry V [5,15] 15 battle 10 Julius Caesar [1,8] 5 Twelfth Night [58,1] As You Like It [37,1] 5 10 15 20 25 30 35 40 45 50 55 60 fool
Vectors are the basis of information retrieval As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 5 117 0 0 Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The red Vectors are similar for the two comedies As You like It [1,2,37,5], Twelfth Night [1,2,58,117] Different than the history: Henry V [15,36,5,0] Comedies have more fools and clowns and fewer soldiers and battles.
Words can be vectors too As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 As You Like It Twelfth Night Julius Caesar Henry V soldier 2 2 12 36 battle 1 1 8 15 fool 37 58 1 5 soldier 2 2 12 36 clown 5 117 0 0 fool 37 58 1 5 Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell clown 5 117 0 0 Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell Battle is "the kind of word that occurs in Julius Caesar and Henry V" Clown is "the kind of word that occurs a lot in Twelfth Night and a bit in As You Like It"
More common: word-word matrix (or "term-context matrix") Two words are similar in meaning if their context vectors are similar sugar, a sliced lemon, a tablespoonful of apricot jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0
4 information 3 [6,4] result 2 digital [1,1] 1 1 2 3 4 5 6 data
Reminders from linear algebra N X dot-product ( ~ w ) = ~ v · ~ w = v i w i = v 1 w 1 + v 2 w 2 + ... + v N w N v , ~ i = 1 v N u u X v 2 vector length | ~ v | = t i i = 1
Cosine for computing similarity Sec. 6.3 N X v i w i v · ~ w ) = ~ w i = 1 cosine ( ~ w | = v , ~ v v | ~ v || ~ N N u u u X u X v 2 w 2 t t i i i = 1 i = 1 v i is the count for word v in context i w i is the count for word w in context i. a · ~ ~ b = | ~ a || b | cos θ ~ a · ~ ~ b = cos θ Cos( v,w ) is the cosine similarity of v and w ~ | ~ a || b |
Cosine as a similarity metric -1: vectors point in opposite directions +1: vectors point in same directions 0: vectors are orthogonal Frequency is non-negative, so cosine range 0-1 51
large data computer apricot 1 0 0 v • N digital 0 1 2 ∑ v i w i cos( v , w v w i = 1 w ) = v w = v • w = information 1 6 1 N N 2 2 ∑ ∑ v i w i i = 1 i = 1 Which pair of words is more similar? 1 + 0 + 0 1 cosine(apricot,information) = = = .16 1 + 0 + 0 1 + 36 + 1 38 cosine(digital,information) = 0 + 6 + 2 8 = .58 = 0 + 1 + 4 1 + 36 + 1 38 5 cosine(apricot,digital) = 0 + 0 + 0 = 0 1 + 0 + 0 0 + 1 + 4 52
Visualizing cosines (well, angles) Dimension 1: ‘large’ 3 2 apricot information 1 digital 1 2 3 4 5 6 7 Dimension 2: ‘data’
But raw frequency is a bad representation Frequency is clearly useful; if sugar appears a lot near apricot , that's useful information. But overly frequent words like the , it, or they are not very informative about the context Need a function that resolves this frequency paradox!
tf-idf: combine two factors tf: term frequency . Just raw frequency count (or possible log frequency) Idf: inverse document frequency: tf- Total # of docs in collection ✓ N ◆ idf i = log df i Words like "the" have very low idf # of docs that have word i tf-idf value for word i in document j: w i j = tf i j idf i
Summary: tf-idf Compare two words using tf-idf cosine to see if they are similar Compare two documents ◦ Take the centroid of vectors of all the words in the document ◦ Centroid document vector is: d = w 1 + w 2 + ... + w k k
Tf-idf is a sparse representation Tf-idf vectors are ◦ long (length |V|= 20,000 to 50,000) ◦ sparse (most elements are zero)
Alternative: dense vectors vectors which are ◦ short (length 50-1000) ◦ dense (most elements are non-zero) 58
Sparse versus dense vectors Why dense vectors? ◦ Short vectors may be easier to use as features in machine learning (less weights to tune) ◦ Dense vectors may generalize better than storing explicit counts ◦ They may do better at capturing synonymy: ◦ car and automobile are synonyms; but are distinct dimensions ◦ a word with car as a neighbor and a word with automobile as a neighbor should be similar, but aren't 59 ◦ In practice, they work better
Dense embeddings you can download! Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/
Word2vec Popular embedding method Very fast to train Code available on the web Idea: predict rather than count
Word2vec ◦Instead of counting how often each word w occurs near " apricot" ◦Train a classifier on a binary prediction task: ◦Is w likely to show up near " apricot" ? ◦We don’t actually care about this task ◦But we'll take the learned classifier weights as the word embeddings
Brilliant insight: Use running text as implicitly supervised training data! • A word s near apricot • Acts as gold ‘correct answer’ to the question • “Is word w likely to show up near apricot ?” • No need for hand-labeled supervision • The idea comes from neural language modeling • Bengio et al. (2003) • Collobert et al. (2011)
Word2Vec: Sk Gram Task Skip ip-Gr Word2vec provides a variety of options. Let's do ◦ "skip-gram with negative sampling" (SGNS)
Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples. 2. Randomly sample other words in the lexicon to get negative samples 3. Use logistic regression to train a classifier to distinguish those two cases 4. Use the weights as the embeddings 65 8/13/18
Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 target c3 c4 Asssume context words are those in +/- 2 word window 66 8/13/18
Skip-Gram Goal Given a tuple (t,c) = target, context ◦( apricot, jam ) ◦( apricot, aardvark ) Return probability that c is a real context word: P(+|t,c) P (−| t , c ) = 1− P (+| t , c ) 67 8/13/18
How to compute p(+|t,c)? Intuition: ◦ Words are likely to appear near similar words ◦ Model similarity with dot-product! ◦ Similarity(t,c) ∝ t · c Problem: ◦ Dot product is not a probability! ◦ (Neither is cosine)
Turning dot product into a probability The sigmoid lies between 0 and 1: 1 σ ( x ) = 1 + e − x
Turning dot product into a probability 1 P (+ | t , c ) = 1 + e − t · c P ( − | t , c ) = 1 − P (+ | t , c ) e − t · c = 1 + e − t · c
For all the context words: Assume all context words are independent k 1 Y P (+ | t , c 1: k ) = 1 + e − t · c i i = 1 k 1 X log P (+ | t , c 1: k ) = log 1 + e − t · c i i = 1
Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Training data: input/output pairs centering on apricot Asssume a +/- 2 word window 72 8/13/18
Skip-Gram Training Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 positive examples + • For each positive example, t c we'll create k negative apricot tablespoon examples. apricot of • Using noise words apricot preserves 73 • Any random word that isn't t apricot or 8/13/18
Skip-Gram Training Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 k=2 positive examples + negative examples - t c t c t c apricot aardvark apricot twelve apricot tablespoon apricot puddle apricot hello apricot of apricot where apricot dear apricot preserves 74 apricot coaxial apricot forever apricot or 8/13/18
Choosing noise words Could pick w according to their unigram frequency P(w) More common to chosen then according to p α (w) count ( w ) α α ( w ) = P w count ( w ) α P α= ¾ works well because it gives rare noise words slightly higher probability To show this, imagine two events p(a)=.99 and p(b) = .01: . 99 . 75 α ( a ) = . 99 . 75 + . 01 . 75 = . 97 P . 01 . 75 α ( b ) = . 99 . 75 + . 01 . 75 = . 03 P
Setup Let's represent words as vectors of some length (say 300), randomly initialized. So we start with 300 * V random parameters Over the entire training set, we’d like to adjust those word vectors such that we ◦ Maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data ◦ Minimize the similarity of the (t,c) pairs drawn from the negative data . 76 8/13/18
Learning the classifier Iterative process. We’ll start with 0 or random weights Then adjust the word weights to ◦ make the positive pairs more likely ◦ and the negative pairs less likely over the entire training set:
Objective Criteria We want to maximize… X X logP (+ | t, c ) + logP ( − | t, c ) ( t,c ) ∈ + ( t,c ) ∈− Maximize the + label for the pairs from the positive training data, and the – label for the pairs sample from the negative data. 78 8/13/18
Focusing on one target word t: k X L ( θ ) = log P (+ | t , c )+ log P ( − | t , n i ) i = 1 k X = log σ ( c · t )+ log σ ( − n i · t ) i = 1 k 1 1 X = log 1 + e − c · t + log 1 + e n i · t i = 1
increase similarity( apricot , jam) C wj . ck W 1. .. … d apricot 1 1.2…….j………V . jam neighbor word 1 k “…apricot jam…” . . random noise aardvark . n word . . d V decrease similarity( apricot , aardvark) wj . cn
Train using gradient descent Actually learns two separate embedding matrices W and C Can use W and throw away C, or merge them somehow
Summary: How to learn word2vec (skip-gram) embeddings Start with V random 300-dimensional vectors as initial embeddings Use logistic regression, the second most basic classifier used in machine learning after naïve bayes ◦ Take a corpus and take pairs of words that co-occur as positive examples ◦ Take pairs of words that don't co-occur as negative examples ◦ Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance ◦ Throw away the classifier code and keep the embeddings.
Evaluating embeddings Compare to human scores on word similarity-type tasks: • WordSim-353 (Finkelstein et al., 2002) • SimLex-999 (Hill et al., 2015) • Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) • TOEFL dataset: Levied is closest in meaning to: imposed, believed, requested, correlated
Properties of embeddings Similarity depends on window size C C = ±2 The nearest words to Hogwarts: ◦ Sunnydale ◦ Evernight C = ±5 The nearest words to Hogwarts: ◦ Dumbledore ◦ Malfoy ◦ halfblood 84
Analogy: Embeddings capture relational meaning! vector( ‘king’ ) - vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) 85
Embeddings can help study word history! Train embeddings on old books to study changes in word meaning!! Will Hamilton
Diachronic word embeddings for studying language change! Word vectors 1990 Word vectors for 1920 “dog” 1990 word vector “dog” 1920 word vector vs. 1950 2000 1900 8 9
Visualizing changes Project 300 dimensions down into 2 ~30 million books, 1850-1990, Google Books data
The evolution of sentiment words Negative words change faster than positive words 91
Embeddings and bias
Embeddings reflect cultural bias Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In Advances in Neural Information Processing Systems , pp. 4349-4357. 2016. Ask “Paris : France :: Tokyo : x” ◦ x = Japan Ask “father : doctor :: mother : x” ◦ x = nurse Ask “man : computer programmer :: woman : x” ◦ x = homemaker
Embeddings reflect cultural bias Caliskan, Aylin, Joanna J. Bruson and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356:6334, 183-186. Implicit Association test (Greenwald et al 1998): How associated are ◦ concepts ( flowers , insects ) & attributes ( pleasantness , unpleasantness )? ◦ Studied by measuring timing latencies for categorization. Psychological findings on US participants: ◦ African-American names are associated with unpleasant words (more than European- American names) ◦ Male names associated more with math, female names with arts ◦ Old people's names with unpleasant words, young people with pleasant words. Caliskan et al. replication with embeddings: ◦ African-American names ( Leroy, Shaniqua ) had a higher GloVe cosine with unpleasant words ( abuse, stink, ugly ) ◦ European American names ( Brad, Greg, Courtney ) had a higher cosine with pleasant words ( love, peace, miracle ) Embeddings reflect and replicate all sorts of pernicious biases.
Directions Debiasing algorithms for embeddings ◦ Bolukbasi, Tolga, Chang, Kai-Wei, Zou, James Y., Saligrama, Venkatesh, and Kalai, Adam T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Infor- mation Processing Systems , pp. 4349–4357. Use embeddings as a historical tool to study bias
Embeddings as a window onto history Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences , 115 (16), E3635–E3644 Use the Hamilton historical embeddings The cosine similarity of embeddings for decade X for occupations (like teacher) to male vs female names ◦ Is correlated with the actual percentage of women teachers in decade X
History of biased framings of women Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences , 115 (16), E3635–E3644 Embeddings for competence adjectives are biased toward men ◦ Smart, wise, brilliant, intelligent, resourceful, thoughtful, logical, etc. This bias is slowly decreasing
Embeddings reflect ethnic stereotypes over time Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences , 115 (16), E3635–E3644 • Princeton trilogy experiments • Attitudes toward ethnic groups (1933, 1951, 1969) scores for adjectives • industrious, superstitious, nationalistic , etc • Cosine of Chinese name embeddings with those adjective embeddings correlates with human ratings.
Change in linguistic framing 1910-1990 Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences , 115 (16), E3635–E3644 Change in association of Chinese names with adjectives framed as "othering" ( barbaric , monstrous , bizarre )
Changes in framing: adjectives associated with Chinese Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences , 115 (16), E3635–E3644 1910 1950 1990 Irresponsible Disorganized Inhibited Envious Outrageous Passive Barbaric Pompous Dissolute Aggressive Unstable Haughty Transparent Effeminate Complacent Monstrous Unprincipled Forceful Hateful Venomous Fixed Cruel Disobedient Active Greedy Predatory Sensitive Bizarre Boisterous Hearty
Recommend
More recommend