lexical semantics similarity measures and clustering
play

Lexical Semantics: Similarity Measures and Clustering Friday Monday - PowerPoint PPT Presentation

Beyond Dead Parrots Automatically constricted clusters of semantically similar words (Charniak, 1997): Lexical Semantics: Similarity Measures and Clustering Friday Monday Thursday Wednesday Tuesday Saturday Sunday People guys folks fellows


  1. Beyond Dead Parrots Automatically constricted clusters of semantically similar words (Charniak, 1997): Lexical Semantics: Similarity Measures and Clustering Friday Monday Thursday Wednesday Tuesday Saturday Sunday People guys folks fellows CEOs commies blocks water gas cola liquid acid carbon steam shale that the theat head body hands eyes voice arm seat eye hair mouth Today: Semantic Similarity State-of-the-art Methods Closest words for ? anthropology 0.275881, sociology 0.247909, comparative lit- This parrot is no more! erature 0.245912, computer science 0.220663, political sci- It has ceased to be! ence 0.219948, zoology 0.210283, biochemistry 0.197723, It’s expired and gone to meet its maker! mechanical engineering 0.191549, biology 0.189167, crim- inology 0.178423, social science 0.176762, psychology This is a late parrot! 0.171797, astronomy 0.16531, neuroscience 0.163764, psy- This. . . is an EX-PARROT! chiatry 0.163098, geology 0.158567, archaeology 0.157911, mathematics 0.157138

  2. Motivation Learning Similarity from Corpora Smoothing for statistical language models • You shall know a word by the company it keeps (Firth 1957) • Two alternative guesses of speech recognizer: What is tizguino? (Nida, 1975) For breakfast, she ate durian. For breakfast, she ate Dorian. A bottle of tizguino is on the table. • Our corpus contains neither “ate durian” nor “ate Tizguino makes you drunk. Dorian” We make tizguino out of corn. • But, our corpus contains “ate orange”, “ate banana” Motivation Learning Similarity from Corpora Aid for Question-Answering and Information Retrieval CAT • Task: “Find documents about women astronauts” cute smart dirty • Problem: some documents use paraphrase of DOG astronaut In the history of Soviet/Russian space exploration, there cute smart dirty have only been three Russian women cosmonauts: PIG Valentina Tereshkova, Svetlana Savitskaya, and Elena Kondakova. cute smart dirty

  3. Outline Example 1: Next Word Representation Brown et al. (1992) • Vector-space representation and similarity computation • C ( x ) denotes the vector of properties of x (“context” – Similarity-based Methods for LM of x) • Hierarchical clustering • Assume alphabet of size K : w 1 , . . . , w K – Name Tagging with Word Clusters • C ( w ) = � #( w 1 ) , #( w 2 ) , . . . , #( w K ) � , where #( w i ) is the number of times w i followed w in the corpus • Computing semantic similarity using WordNet Learning Similarity from Corpora Example 2: Syntax-Based Representation • The vector C ( n ) for a noun n is the distribution of verbs for which it served as direct object • Select important distributional properties of a word • Assume (verb) alphabet of size K : v 1 , . . . , v K • Create a vector of length n for each word to be • C ( n ) = � P ( v 1 | n ) , P ( v 2 | n ) , . . . , P ( v K | n ) � , where classified P ( v i | n ) is the probability that v is a verb for which n • Viewing the n -dimensional vector as a point in an serves as a direct object n -dimensional space, cluster points that are near • Representation can be expanded to account for one another additional syntactic relations (subject, object, indirect-object)

  4. Vector Space Model Similarity Measure: Cosine � n i =1 x i y i x ∗ � � y Cosine cos ( � x, � y ) = y | = | � x || � �� n i =1 x 2 �� n i =1 y 2 cosmonaut astronaut moon car truck Each word is represented as a vector � x = ( x 1 , x 2 , . . . , x n ) Soviet 1 0 0 1 1 American 0 1 0 1 1 man woman grape spacewalking 1 1 0 0 0 orange red 0 0 0 1 1 apple full 0 0 1 0 0 old 0 0 0 1 1 1 ∗ 0+0 ∗ 1+1 ∗ 1+0 ∗ 0+0 ∗ 0+0 ∗ 0 cos ( cosm, astr ) = � 12+02+12+02+02+02 � 02+12+12+02+02+02 Similarity Measure: Euclidean Outline �� n Euclidean | � x, � y | = | � x − � y | = i =1 ( x i − y i ) 2 cosmonaut astronaut moon car truck • Vector-space representation and similarity Soviet 1 0 0 1 1 computation American 0 1 0 1 1 – Similarity-based Methods for LM spacewalking 1 1 0 0 0 • Hierarchical clustering red 0 0 0 1 1 – Name Tagging with Word Clusters full 0 0 1 0 0 • Computing semantic similarity using WordNet old 0 0 0 1 1 euclidian ( cosm, astr ) = � (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2

  5. Discounting Smoothing for Language Modeling • Task: estimate the probability of unseen word pairs � • Possible approaches: P d ( w 2 | w 1 ) c ( w 1 , w 2 ) > 0 ˆ P ( w 2 | w 1 ) = – Katz back-off scheme — utilize unigram α ( w 1 ) P r ( w 2 | w 1 ) otherwise estimates P d Good-Turing discounted estimate – Class-based methods — utilize average α ( w 1 ) normalization factor co-occurrence probabilities of the classes to the model for probability redistribution among unseen words P r which the two words belong – Similarity-based methods Similarity-based Methods for LM Combining Evidence (Dagan, Lee & Pereira, 1997) • Idea: Assumption: if word w ′ 1 is “similar” to word w 1 , 1. combine estimates for the words most similar to a then w ′ word w 1 can yield information about the probability 2. weight the evidence provided by word w ′ by a of unseen word pairs involving w 1 function of its similarity to w S ( w 1 ) — the set of words most similar to w 1 • Implementation: W ( w 1 , w ′ 1 ) — similarity function – a scheme for deciding which word pairs require a W ( w 1 ,w ′ 1 ) P ( w 2 | w ′ P sim ( w 2 | w 1 ) = � 1 ) w ′ 1 ∈ S ( w 1 ) N ( w 1 ) similarity-based estimate 1 ∈ S ( w 1 ) W ( w 1 , w ′ N ( w 1 ) = � 1 ) w ′ – a method for combining information from similar words – a function measuring similarity between words

  6. Combining Evidence (cont.) Other Probabilistic Dissimilarity Measures • Information Radius: How to define S ( w 1 ) ? Possible options: IRad( p, q ) = D ( p || p + q ) + D ( q || p + q ) • S ( w 1 ) = V 2 2 – Symmetric • S ( w 1 ) : the closest k or fewer words w ′ 1 such that – Well-defined if either q i > 0 or p i > 0 dissimilarity between w 1 and w ′ 1 is less than a threshold value t • L 1 norm: � L 1 ( p, q ) = | p i − q i | Redistribution model: i – Symmetric P r ( w 2 | w 1 ) = P sim ( w 2 | w 1 ) – Well-defined for arbitrary p and q Kullback Leibler Divergence Evaluation Task: Word Disambiguation • Definition: The KL Divergence D ( p || q ) measures how • Task: Given a noun and two verbs, decide which much information is lost if we assume distribution q when verb is more likely to have this noun as a direct the true distribution is p object p i log p i � D ( p || q ) = P ( plans | make ) vs. P ( plans | take ) q i i P ( action | make ) vs. P ( action | take ) • Properties: • Construction of candidate verb pairs: – Non-negative – generate verb-noun pairs on the test set – D ( p || q ) = 0 iff p = q – select pairs of verbs with similar frequency – Not symmetric and doesn’t satisfy triangle inequality – remove all the pairs seen in the training set – If q i = 0 and p i > 0 , then D ( p || q ) gets infinite value

  7. Evaluation Setup Automatic Thesaurus Construction • Performance metric http://www.cs.ualberta.ca/˜lindek/demos/depsimdoc.htm (# of incorrect choices) + (# of ties) / 2 N Closest words for president N is the size of the test corpus • Data: leader 0.264431, minister 0.251936, vice president 0.238359, – 44m words of 1998 AP newswire Clinton 0.238222, chairman 0.207511, government 0.206842, Governor 0.193404, official 0.191428, Premier 0.177853, – select 1000 most frequent nouns and their Yeltsin 0.173577, member 0.173468, foreign minister corresponding verbs 0.171829, Mayor 0.168488, head of state 0.167166, chief – Training: 587833 pairs, Testing: 17152 pairs 0.164998, Ambassador 0.162118, Speaker 0.161698, General • Baseline: Maximum Likelihood Estimator 0.159422, secretary 0.156158, chief executive 0.15158 – Error rate: 0.5 Performance of Similarity-Based Methods Problems with Corpus-based Similarity Methods Error rate Katz 0.51 • Low-frequency words skew the results MLE 0.50 – “breast-undergoing”, “childhood-phychosis”, RandMLE 0.47 “outflow-infundibulum” L 1 MLE 0.27 • Semantic similarity does not imply synonymy IRadMLE 0.26 – “large-small”, “heavy-light”, “shallow-coastal” • RandMLE — Randomized combination of weights • Distributional information may not be sufficient for true semantic grouping • L 1 MLE — Similarity function based on L 1 • IRadMLE — Similarity function based on IRad

  8. Outline Bottom-Up Hierarchical Clustering Given: a set X = { x 1 , . . . , x n } of objects a similarity function sim for i := 1 to n do • Vector-space representation and similarity c i := x i computation C := { c 1 , . . . , c n } – Similarity-based Methods for LM j := n + 1 while | C | > 1 • Hierarchical clustering ( c n 1 , c n 2 ) := argmax ( c u ,c v ) ∈ C × C sim ( c u , c v ) – Name Tagging with Word Clusters c j := c n 1 ∪ c n 2 • Computing semantic similarity using WordNet C := ( C − { c n 1 , c n 2 } ) ∪ { c j } j := j + 1 Hierarchical Clustering Agglomerative Clustering B E D C Greedy, bottom-up version: A 0.1 0.2 0.2 0.8 • Initialization: Create a separate cluster for each object B 0.1 0.1 0.2 • Each iteration: Find two most similar clusters and merge C 0.7 them 0.0 • Termination: All the objects are in the same cluster D 0.6 C D E A B

Recommend


More recommend