CS447: Natural Language Processing Where we’re at http://courses.engr.illinois.edu/cs447 We have looked at how to obtain the meaning of sentences from the meaning of their words Lecture 17: (represented in predicate logic). Now we will look at how to represent the meaning of Vector-space semantics words (although this won’t be in predicate logic) (distributional similarities) We will consider different tasks: - Computing the semantic similarity of words Julia Hockenmaier by representing them in a vector space - Finding groups of similar words by inducing word clusters juliahmr@illinois.edu - Identifying different meanings of words 3324 Siebel Center by word sense disambiguation � 2 CS447: Natural Language Processing (J. Hockenmaier) What we’re going to cover today Pointwise mutual information A very useful metric to identify events that frequency co-occur Using PMI to identify Distributional (Vector-space) semantics: Measure the semantic similarity of words words that “go in terms of the similarity of the contexts in which the words appear - The distributional hypothesis together” - Representing words as (sparse) vectors - Computing word similarities � 3 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier) � 4
Mutual information I ( X ; Y ) Discrete random variables A discrete random variable X can take on values Two random variables X, Y are independent {x 1 ,…, x n } with probability p(X = x i ) iff their joint distribution is equal to the product of their individual distributions: A note on notation: p(X) refers to the distribution, while p(X = x i ) refers to the p ( X , Y ) = p ( X ) p ( Y ) probability of a specific value x i . p(X = x i ) also written as p(x i ) That is, for all outcomes x , y : p ( X=x , Y=x ) = p ( X=x ) p ( Y=y ) In language modeling, the random variables correspond to words W or to sequences of words W (1) …W (n) . I ( X ; Y ) , the mutual information of two random Another note on notation: variables X and Y is defined as We’re often sloppy in making the distinction between p ( X = x, Y = y ) log p ( X = x, Y = y ) the i -th word [token] in a sequence/string, and X I ( X ; Y ) = the i -th word [type] in the vocabulary clear. p ( X = x ) p ( Y = y ) X,Y � 5 � 6 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier) Pointwise mutual information (PMI) Using PMI to find related words Recall that two events x , y are independent Find pairs of words w i , w j that have high pointwise mutual information: if their joint probability is equal to the product of their individual probabilities: PMI ( w i , w j ) = log p ( w i , w j ) x,y are independent iff p(x,y) = p(x)p(y) p ( w i ) p ( w j ) x,y are independent iff p(x,y) ∕ p(x)p(y) = 1 Different ways of defining p ( w i , w j ) give different answers. In NLP, we often use the pointwise mutual information (PMI) of two outcomes/events (e.g. words): PMI ( x, y ) = log p ( X = x, Y = y ) p ( X = x ) p ( Y = y ) � 7 � 8 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier)
Using PMI to find “sticky pairs” p ( w i , w j ): probability that w i , w j are adjacent Define p ( w i , w j ) = p (“ w i w j ”) High PMI word pairs under this definition: Back to lexical Humpty Dumpty, Klux Klan, Ku Klux, Tse Tung, avant garde, gizzard shad, Bobby Orr, mutatis mutandis, semantics… Taj Mahal, Pontius Pilate, ammonium nitrate, jiggery pokery, anciens combattants, fuddle duddle, helter skelter, mumbo jumbo (and a few more) � 9 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier) � 10 Vector representations of words Different approaches to lexical semantics Lexicographic tradition: “Traditional” distributional similarity approaches - Use lexicons, thesauri, ontologies represent words as sparse vectors [today’s lecture] - Assume words have discrete word senses: - Each dimension represents one specific context - Vector entries are based on word-context co-occurrence bank1 = financial institution; bank2 = river bank, etc. - May capture explicit relations between word (senses): statistics (counts or PMI values) “dog” is a “mammal”, etc. Alternative, dense vector representations: Distributional tradition: - We can use Singular Value Decomposition to turn these - Map words to (sparse) vectors that capture corpus statistics sparse vectors into dense vectors (Latent Semantic Analysis) - Contemporary variant: use neural nets to learn dense vector - We can also use neural models to explicitly learn a dense “embeddings” from very large corpora vector representation (embedding) (word2vec, Glove, etc.) (this is a prerequisite for most neural approaches to NLP) - This line of work often ignores the fact that words have Sparse vectors = most entries are zero multiple senses or parts-of-speech Dense vectors = most entries are non-zero � 11 � 12 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier)
Distributional Similarities Why do we care about word similarity? Measure the semantic similarity of words Question answering: in terms of the similarity of the contexts Q: “How tall is Mt. Everest?” in which the words appear Candidate A: “The official height of Mount Everest is 29029 feet” Represent words as vectors “tall” is similar to “height” � 13 � 14 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier) Why do we care about word similarity? Why do we care about word contexts? Plagiarism detection What is tezgüino? A bottle of tezgüino is on the table. Everybody likes tezgüino. Tezgüino makes you drunk. We make tezgüino out of corn. (Lin, 1998; Nida, 1975) The contexts in which a word appears tells us a lot about what it means. � 15 � 16 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier)
The Distributional Hypothesis Exploiting context for semantics Zellig Harris (1954): Distributional similarities (vector-space semantics): “oculist and eye-doctor … occur in almost the same Use the set of contexts in which words (= word types) environments” appear to measure their similarity “If A and B have almost identical environments we say that Assumption: Words that appear in similar contexts ( tea, coffee ) they are synonyms.” have similar meanings. John R. Firth 1957: Word sense disambiguation (future lecture) You shall know a word by the company it keeps. Use the context of a particular occurrence of a word (token) to identify which sense it has. The contexts in which a word appears Assumption: If a word has multiple distinct senses tells us a lot about what it means. (e.g. plant : factory or green plant ), each sense will appear in Words that appear in similar contexts have similar meanings different contexts. � 17 � 18 CS447: Natural Language Processing (J. Hockenmaier) CS447: Natural Language Processing (J. Hockenmaier) Distributional similarities Distributional similarities use the set of contexts in which words appear to measure their similarity. They represent each word w as a vector w Distributional w = ( w 1 , …, w N ) ∈ R N in an N-dimensional vector space. similarities - Each dimension corresponds to a particular context c n - Each element w n of w captures the degree to which the word w is associated with the context c n . - w n depends on the co-occurrence counts of w and c n The similarity of words w and u is given by the similarity of their vectors w and u � 20 CS447: Natural Language Processing (J. Hockenmaier) � 19 CS447: Natural Language Processing (J. Hockenmaier)
Recommend
More recommend