More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu
Last week… • Q: what is understanding meaning? • A: meaning is knowing when words are similar or not • Topics – Word similarity – Thesaurus-based methods – Distributional word representations – Dimensionality reduction
T oday New models for learning word • representations From “count” -based models (e.g., LSA) • to “prediction” -based models (e.g., word2vec) • … and back • Beyond semantic similarity • Learning semantic relations between words •
DI DISTR TRIBU IBUTIO TIONAL NAL MO MODE DELS OF OF WO WORD ME D MEAN ANING NG
Distributional Approaches: Intuition “You shall know a word by the company it keeps!” (Firth, 1957) “ Differences of meaning correlates with differences of distribution” (Harris, 1970)
Context Features • Word co-occurrence within a window: • Grammatical relations:
Association Metric • Commonly-used metric: Pointwise Mutual Information P ( w , f ) associatio n ( w , f ) log PMI 2 P ( w ) P ( f ) • Can be used as a feature value or by itself
Computing Similarity • Semantic similarity boils down to computing some measure on context vectors • Cosine distance: borrowed from information retrieval N v w v w i i i 1 sim ( v , w ) cosine v w N N 2 2 v w i i 1 1 i i
Dimensionality Reduction with Latent Semantic Analysis
NE NEW DI DIRECT CTIONS IONS: PR PREDIC DICT T VS. COU OUNT NT MOD MODELS
Word vectors as a byproduct of language modeling A neural probabilistic Language Model. Bengio et al. JMLR 2003
Using neural word representations in NLP • word representations from neural LMs – aka distributed word representations – aka word embeddings • How would you use these word vectors? • Turian et al. [2010] – word representations as features consistently improve performance of • Named-Entity Recognition • Text chunking tasks
Word2vec [Mikolov et al. 2013] introduces simpler models https://code.google.com/p/word2vec
Word2vec claims Useful representations for NLP applications Can discover relations between words using vector arithmetic king – male + female = queen Paper+tool received lots of attention even outside the NLP research community try it out at “word2vec playground”: http://deeplearner.fz-qqq.net/
Demystifying the skip-gram model [Levy & Goldberg, 2014] Word context word embeddings Learn word vector parameters so as to maximize the probability of training set D Expensive!! http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
T oward the training objective for skip-gram Problem: trivial solution when Vc=Vw and Vc.Vw = K for all Vc,Vw, with a large K http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Final training objective Word context pairs not observed in data D’ (negative sampling) Word context pairs (artificially generated) observed in data D http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Skip-gram model [Mikolov et al. 2013] Predict context words given current word (ie 2(n-1) classifiers for context window of size n) Use negative samples at each position
Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict vectors. As seasoned distributional semanticists with thorough experience in developing and using count vectors, we set out to conduct this study because we were annoyed by the triumphalist overtones surrounding predict models, despite the almost complete lack of a proper comparison to count vectors.”
Don’t count, predict! [Baroni et al. 2014] “Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. […] Instead, we found that the predict models are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.”
Why does word2vec produce good word representations? Levy & Goldberg, Apr 2014: “Good question. We don’t really know. The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity v_w.v_c for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other […]. This is, however, very hand-wavy .”
Learning skip-gram is almost equivalent to matrix factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf
New directions: Summary • There are alternative ways to learn distributional representations for word meaning • Understanding >> Magic
BEYOND SIMILARITY PR PREDIC DICTING TING SEMAN MANTIC TIC RELATIO TIONS NS BE BETWE WEEN EN WO WORDS DS Slides credit: Peter Turney
Recognizing T extual Entailment • Sample problem – Text iTunes software has seen strong sales in Europe – Hypothesis Strong sales for iTunes in Europe – Task: Does Text entails Hypothesis? Yes or No?
Recognizing T extual Entailment • Sample problem – Task: Does Text entails Hypothesis? Yes or No? • Has emerged as a core task for semantic analysis in NLP – subsumes many tasks: Paraphrase Detection, Question Answering, etc. – fully text based: does not require committing to a specific semantic representation [Dagan et al. 2013]
Recognizing lexical entailment • To recognize entailment between sentences, we must first recognize entailment between words • Sample problem – Text George was bitten by a dog – Hypothesis George was attacked by an animal
Lexical entailment & semantic relations • Synonymy: synonyms entail each other firm entails company • is-a relations: hyponyms entail hypernyms automaker entails company • part-whole relations: it depends government entails minister division does not entail company • entailment also covers other relations ocean entails water murder entails death
• We know how to build word vectors that represent word meaning • How can we predict entailment using these vectors?
Approach 1: context inclusion hypothesis • Hypothesis: – if a word a tends to occur in subset of the contexts in which a word b occur (b contextually includes a) – then a (the narrower term) tends to entail b (the broader term) • Inspired by formal logic • In practice – Design an asymmetric real-valued metric to compare word vectors [Kotlerman, Dagan, et al. 2010]
Approach 1: the BalAPinc Metric Complex hand- crafted metric!
Approach 2: context combination hypothesis • Hypothesis: – The tendence of word a to entail word b is correlated with some learnable function of the contexts in which a occurs, and the contexts in which b occurs – Some combination of contexts tend to block entailment, others tend to allow entailment • In practice – Binary prediction task – Supervised learning from labeled word pairs [Baroni, Bernardini, Do and Shan, 2012]
Approach 3: similarity differences hypothesis • Hypothesis – The tendency of a to entail b is correlated with some learnable function of the differences in their similarities, sim(a,r) – sim(b,r), to a set of reference words r in R – Some differences tend to block entailment, and others tend to allow entailment • In practice – Binary prediction task – Supervised learning from labeled word pairs + reference words [Turney & Mohammad 2015]
Approach 3: similarity differences hypothesis
Evaluation: test set 1/3 (KDSZ)
Evaluation: test set 2/3 (JMTH)
Evaluation: test set 3/3 (BBDS)
Evaluation [Turney & Mohammad 2015]
Lessons from lexical entailment task • Distributional hypothesis can be refined and put to use in various ways • to detect relations between words beyond • concept of similarity • Combination of unsupervised similarity+ supervised learning is powerful
RECAP AP
Today A glimpse into recent research • New models for learning word • representations From “count” -based models (e.g., LSA) • to “prediction” -based models (e.g., word2vec) • … and back • Beyond semantic similarity • Learning lexical entailment • Next topics multiword expressions & predicate argument • structure
References Don’t count, predict! [ Baroni et al. 2014] http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal- countpredict-acl2014.pdf Word2vec explained [Goldberg & Levy 2014] http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf Neural Word Embeddings as Implicit Matrix Factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf Experiments with Three Approaches to Recognizing Lexical Entailment [Turney & Mohammad 2015] http://arxiv.org/abs/1401.8269
Recommend
More recommend