MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural Language Processing Similarity and Clustering
Advanced Natural Language Processing 1 Similarity and Clustering Similarity and Clustering Similarity Clustering Hierarchical Clustering Non-hierarchical Clustering Evaluation
Advanced Natural Language Processing 1 Similarity and Clustering Similarity and Clustering Similarity Similarity Clustering Hierarchical Clustering Non-hierarchical Clustering Evaluation
The Concept of Similarity Advanced Natural Language Processing Similarity, proximity, affinity, distance, difference, divergence Similarity and We use distance when metric properties hold: Clustering Similarity d ( x , x ) = 0 d ( x , y ) � 0 when x � = y d ( x , y ) = d ( y , x ) (simmetry) d ( x , z ) � d ( x , y ) + d ( y , z ) (triangular inequation) We use similarity in the general case Function: sim : A × B → S (where S is often [ 0, 1 ] ) Homogeneous: sim : A × A → S (e.g. word-to-word) Heterogeneous: sim : A × B → S (e.g. word-to-document) Not necessarily symmetric, or holding triangular inequation.
The Concept of Similarity Advanced Natural Language Processing Similarity and If A is a metric space, the distance in A may be used. Clustering Similarity � � ( x i − y i ) 2 D euclidean ( � x , � y ) = | � x − � y | = i Similarity vs distance 1 sim D ( A , B ) = 1 + D ( A , B ) monotonic: min { sim ( x , y ) , sim ( x , z ) } � sim ( x , y ∪ z )
Applications Advanced Natural Language Processing Similarity and Clustering Clustering, case-based reasoning, IR, ... Similarity Discovering related words - Distributional similarity Resolving syntactic ambiguity - Taxonomic similarity Resolving semantic ambiguity - Ontological similarity Acquiring selectional restrictions/preferences
Relevant Information Advanced Natural Language Processing Content (information about compared units) Similarity and Words: form, morphology, PoS, ... Clustering Senses: synset, topic, domain, ... Similarity Syntax: parse trees, syntactic roles, ... Documents: words, collocations, NEs, ... Context (information about the situation in which simmilarity is computed) Window–based vs. Syntactic–based External Knowledge Monolingual/bilingual dictionaries, ontologies, corpora
Vectorial methods (1) Advanced Natural Language L 1 norm, Manhattan distance, taxi-cab distance, Processing city-block distance N Similarity and � L 1 ( � x , � y ) = | x i − y i | Clustering Similarity i = 1 L 2 norm, Euclidean distance � N � � � L 2 ( � x , � y ) = | � x − � y | = ( x i − y i ) 2 � i = 1 Cosine distance � x i y i y ) = � x · � y i cos ( � x , � y | = � � � � | � x | · | � x 2 y 2 i · i i i
Vectorial methods (2) Advanced Natural Language Processing L 1 and L 2 norms are particular cases of Minkowsky measure Similarity and Clustering � N � 1 r Similarity � ( x i − y i ) r D minkowsky ( � x , � y ) = L r ( � x , � y ) = i = 1 Camberra distance N | x i − y i | � D camberra ( � x , � y ) = | x i + y i | i = 1 Chebychev distance D chebychev ( � x , � y ) = max | x i − y i | i
Set-oriented methods (3): Binary–valued vectors seen as sets Advanced Natural Language Processing Dice. S dice ( X , Y ) = 2 · | X ∩ Y | | X | + | Y | Similarity and Clustering Jaccard. S jaccard ( X , Y ) = | X ∩ Y | Similarity | X ∪ Y | | X ∩ Y | Overlap. S overlap ( X , Y ) = min ( | X | , | Y | ) | X ∩ Y | Cosine. cos ( X , Y ) = � | X | · | Y | Above similarities are in [ 0, 1 ] and can be used as distances simply substracting: D = 1 − S
Set-oriented methods (4): Agreement contingency table Advanced Object i Natural Language 1 0 Processing 1 a + b a b Object j Similarity and 0 c d c + d Clustering a + c b + d p Similarity 2 a Dice. S dice ( X , Y ) = 2 a + b + c a Jaccard. S jaccard ( X , Y ) = a + b + c a Overlap. S overlap ( X , Y ) = min ( a + b , a + c ) a Cosine. S overlap ( X , Y ) = � ( a + b )( a + c ) Matching coefficient. S mc ( i , j ) = a + d p
Distributional Similarity Advanced Natural Language Processing Particular case of vectorial representation where attributes are probability distributions Similarity and N Clustering x T = [ x 1 . . . x N ] such that ∀ i , 0 � x i � 1 and � � x i = 1 Similarity i = 1 Kullback-Leibler Divergence (Relative Entropy) q ( y ) log q ( y ) � D ( q || r ) = (non symmetrical) r ( y ) y ∈ Y Mutual Information h ( a , b ) � � I ( A , B ) = D ( h || f · g ) = h ( a , b ) log f ( a ) · g ( b ) a ∈ A b ∈ B (KL-divergence between joint and product distribution)
Semantic Similarity Advanced Natural Language Processing Project objects onto a semantic space: Similarity and Clustering D A ( x 1 , x 2 ) = D B ( f ( x 1 ) , f ( x 2 )) Similarity Semantic spaces: ontology (WordNet, CYC, SUMO, ...) or graph-like knowledge base (e.g. Wikipedia). Not easy to project words, since semantic space is composed of concepts, and a word may map to more than one concept. Not obvious how to compute distance in the semantic space.
WordNet Advanced Natural Language Processing Similarity and Clustering Similarity
WordNet Advanced Natural Language Processing Similarity and Clustering Similarity
Distances in WordNet Advanced Natural WordNet::Similarity Language Processing http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi Similarity and Clustering Some definitions: Similarity SLP ( s 1 , s 2 ) = Shortest Path Length from concept s 1 to s 2 (Which subset of arcs are used? antonymy, gloss, . . . ) depth ( s ) = Depth of concept s in the ontology MaxDepth = max s ∈ WN depth ( s ) LCS ( s 1 , s 2 ) = Lowest Common Subsumer of s 1 and s 2 IC ( s ) = − log 1 P ( s ) = Information Content of s (given a corpus)
Distances in WordNet Advanced Shortest Path Length: D ( s 1 , s 2 ) = SLP ( s 1 , s 2 ) Natural Language SLP ( s 1 , s 2 ) Processing Leacock & Chodorow: D ( s 1 , s 2 ) = − log 2 · MaxDepth Similarity and Wu & Palmer: D ( s 1 , s 2 ) = 2 · depth ( LCS ( s 1 , s 2 )) Clustering Similarity depth ( s 1 ) + depth ( s 2 ) Resnik: D ( s 1 , s 2 ) = IC ( LCS ( s 1 , s 2 )) Jiang & Conrath: D ( s 1 , s 2 ) = IC ( s 1 ) + IC ( s 2 ) − 2 · IC ( LCS ( s 1 , s 2 )) Lin: D ( s 1 , s 2 ) = 2 · IC ( LCS ( s 1 , s 2 )) IC ( s 1 ) + IC ( s 2 ) Gloss overlap: Sum of squares of lengths of word overlaps between glosses Gloss vector: Cosine of second-order co-occurrence vectors of glosses
Distances in Wikipedia Advanced Natural Language Processing Similarity and Clustering Measures using links, including measures usend on Similarity WordNet, but applied to Wikipedia graph http://www.h-its.org/english/research/nlp/download/wikipediasimilarity.php Measures using content of articles (vector spaces) Measures using Wikipedia Categories
Advanced Natural Language Processing 1 Similarity and Clustering Similarity and Clustering Similarity Clustering Clustering Hierarchical Clustering Non-hierarchical Clustering Evaluation
Clustering Advanced Natural Language Processing Partition a set of objects into clusters. Objects: features and values Similarity and Clustering Similarity measure Clustering Utilities: Exploratory Data Analysis (EDA). Generalization ( learning ). Ex: on Monday , on Sunday , ? Friday Supervised vs unsupervised classification Object assignment to clusters Hard. one cluster per object. Soft. distribution P ( c i | x j ) . Degree of membership.
Clustering Advanced Natural Language Processing Produced structures Hierarchical (set of clusters + relationships) Similarity and Good for detailed data analysis Clustering Provides more information Clustering Less efficient No single best algorithm Flat / Non-hierarchical (set of clusters) Preferable if efficiency is required or large data sets K-means: Simple method, sufficient starting point. K-means assumes euclidean space, if is not the case, EM may be used. Cluster representative Centroid − → x ∈ c − → 1 µ = � x − → | c |
Dendogram Advanced Natural Language Processing Similarity and Clustering Hierarchical Clustering Single-link clustering of 22 frequent En- glish words represent- ed as a dendogram. be not he I it this the his a and but in on with for at from of to as is was
Hierarchical Clustering Advanced Natural Language Processing Similarity and Bottom-up (Agglomerative Clustering) Clustering Hierarchical Start with individual objects, iteratively group the most Clustering similar. Top-down (Divisive Clustering) Start with all the objects, iteratively divide them maximizing within-group similarity.
Agglomerative Clustering (Bottom-up) Advanced Natural Input: A set X = { x 1 , . . . , x n } of objects Language Processing A function sim: P ( X ) × P ( X ) − → R Output: A cluster hierarchy Similarity and Clustering Hierarchical for i :=1 to n do c i := { x i } end Clustering C := { c 1 , . . . , c n } ; j := n + 1 while C > 1 do ( c n 1 , c n 2 ) :=arg max ( c u , c v ) ∈ C × C sim ( c u , c v ) c j = c n 1 ∪ c n 2 C := C \ { c n 1 , c n 2 } ∪ { c j } j := j + 1 end–while
Recommend
More recommend