INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 22/25: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov 2011 1 / 75
Overview Recap 1 Feature selection 2 Introduction to Hierarchical clustering 3 Single-link/Complete-link 4 Centroid/GAAC 5 Variants 6 2 / 75
Administrativa Assignment 4 due 2 Dec (extended til 4 Dec). Sample of minHash duplicate detection . . . 3 / 75
Worked example, assignment 4 first part doc1: “This is a six word sentence” (0 1 2 3) doc2: “six word sentence this is a” (3 4 5 0) doc3: “sentence this is a six word” (5 0 1 2) doc4: “this this this this this this” (6 6 6 6) 3-grams: 0 “this is a” 1 “is a six” 2 “a six word” 3 “six word sentence” 4 “word sentence this” 5 “sentence this is” 6 “this this this” 4 / 75
Documents contain the following 3-grams: doc1: { 0 , 1 , 2 , 3 } doc2: { 0 , 3 , 4 , 5 } doc3: { 0 , 1 , 2 , 5 } doc4: { 6 } J ( S 1 , S 2 ) = | S 1 ∩ S 2 | / | S 1 ∪ S 2 | J ( d 1 , d 2 ) = 2 / 6, J ( d 1 , d 3 ) = 3 / 5, J ( d 2 , d 3 ) = 2 / 6, J ( d i , d 4 ) = 0 To estimate, pick three random functions f i ( x ) = ( ax + b ) mod7: f 1 ( x ) = 2 x + 3 mod7 : (0,1,2,3,4,5,6) → (3,5,0,2,4,6,1) f 2 ( x ) = x + 5 mod7 : (0,1,2,3,4,5,6) → (5,6,0,1,2,3,4) f 3 ( x ) = 4 x + 1 mod7 : (0,1,2,3,4,5,6) → (1,5,2,6,3,0,4) 5 / 75
f 1 : (0,1,2,3) → (3,5, 0 ,2) f 2 : (0,1,2,3) → (5,6, 0 ,1) doc1 ⇒ sketch for doc1 is (0,0,1) f 3 : (0,1,2,3) → ( 1 ,5,2,6) f 1 : (0,3,4,5) → (3, 2 ,4,6) f 2 : (0,3,4,5) → (5, 1 ,2,3) doc2 ⇒ sketch for doc2 is (2,1,0) f 3 : (0,3,4,5) → (1,6,3, 0 ) f 1 : (0,1,2,5) → (3,5, 0 ,6) f 2 : (0,1,2,5) → (5,6, 0 ,3) doc3 ⇒ sketch for doc3 is (0,0,0) f 3 : (0,1,2,5) → (1,5,2, 0 ) f 1 : (6) → (1) f 2 : (6) → (4) doc4 ⇒ sketch for doc4 is (1,4,4) f 3 : (6) → (4) 6 / 75
doc1=(0,0,1), doc2=(2,1,0) : estimate k / n = 0 / 3, exact J ( d 1 , d 2 ) = 1 / 3 doc1=( 0 , 0 ,1), doc3=( 0 , 0 ,0) : estimate k / n = 2 / 3, exact J ( d 1 , d 3 ) = 3 / 5 doc2=(2,1, 0 ), doc3=(0,0, 0 ) : estimate k / n = 1 / 3, exact J ( d 2 , d 3 ) = 1 / 3 doc1=(0,0,1), doc4=(1,4,4) : estimate k / n = 0 / 3, exact J ( d 1 , d 4 ) = 0 doc2=(2,1,0), doc4=(1,4,4) : estimate k / n = 0 / 3, exact J ( d 2 , d 4 ) = 0 doc3=(0,0,0), doc4=(1,4,4) : estimate k / n = 0 / 3, exact J ( d 3 , d 4 ) = 0 7 / 75
Outline Recap 1 Feature selection 2 Introduction to Hierarchical clustering 3 Single-link/Complete-link 4 Centroid/GAAC 5 Variants 6 8 / 75
Major issue in clustering – labeling After a clustering algorithm finds a set of clusters: how can they be useful to the end user? We need a pithy label for each cluster. For example, in search result clustering for “jaguar”, The labels of the three clusters could be “animal”, “car”, and “operating system”. Topic of this section: How can we automatically find good labels for clusters? 9 / 75
Exercise Come up with an algorithm for labeling clusters Input: a set of documents, partitioned into K clusters (flat clustering) Output: A label for each cluster Part of the exercise: What types of labels should we consider? Words? 10 / 75
Discriminative labeling To label cluster ω , compare ω with all other clusters Find terms or phrases that distinguish ω from the other clusters We can use any of the feature selection criteria used in text classification to identify discriminating terms: (i) mutual information, (ii) χ 2 , (iii) frequency (but the latter is actually not discriminative) 11 / 75
Non-discriminative labeling Select terms or phrases based solely on information from the cluster itself Terms with high weights in the centroid (if we are using a vector space model) Non-discriminative methods sometimes select frequent terms that do not distinguish clusters. For example, Monday , Tuesday , . . . in newspaper text 12 / 75
Using titles for labeling clusters Terms and phrases are hard to scan and condense into a holistic idea of what the cluster is about. Alternative: titles For example, the titles of two or three documents that are closest to the centroid. Titles are easier to scan than a list of phrases. 13 / 75
Cluster labeling: Example labeling method # docs centroid mutual information title plant oil production oil plant mexico pro- MEXICO: Hurricane barrels crude bpd 4 622 duction crude power Dolly heads for Mex- mexico dolly capac- 000 refinery gas bpd ico coast ity petroleum police killed military police security rus- RUSSIA: Russia’s security peace told sian people military 9 1017 Lebed meets rebel troops forces rebels peace killed told chief in Chechnya people grozny court delivery traders fu- 00 000 tonnes traders USA: Export Business tures tonne tonnes futures wheat prices 10 1259 - Grain/oilseeds com- desk wheat prices cents september plex 000 00 tonne Three methods: most prominent terms in centroid, differential labeling using MI, title of doc closest to centroid All three methods do a pretty good job. 14 / 75
Outline Recap 1 Feature selection 2 Introduction to Hierarchical clustering 3 Single-link/Complete-link 4 Centroid/GAAC 5 Variants 6 15 / 75
Feature selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. 16 / 75
Example for a noise feature Let’s say we’re doing text classification for the class China . Suppose a rare term, say arachnocentric , has no information about China . . . . . . but all instances of arachnocentric happen to occur in China documents in our training set. Then we may learn a classifier that incorrectly interprets arachnocentric as evidence for the China . Such an incorrect generalization from an accidental property of the training set is called overfitting. Feature selection reduces overfitting and improves the accuracy of the classifier. 17 / 75
Basic feature selection algorithm SelectFeatures ( D , c , k ) 1 V ← ExtractVocabulary ( D ) 2 L ← [] 3 for each t ∈ V 4 do A ( t , c ) ← ComputeFeatureUtility ( D , t , c ) 5 Append ( L , � A ( t , c ) , t � ) 6 return FeaturesWithLargestValues ( L , k ) How do we compute A , the feature utility? 18 / 75
Different feature selection methods A feature selection method is mainly defined by the feature utility measures it employs. Feature utility measures: Frequency – select the most frequent terms Mutual information – select the terms with the highest mutual information (mutual information is also called information gain in this context) χ 2 (Chi-square) 19 / 75
Information H [ p ] = � i =1 , n − p i log 2 p i measures information uncertainty (p.91 in book) has maximum H = log 2 n for all p i = 1 / n Consider two probability distributions: p ( x ) for x ∈ X and p ( y ) for y ∈ Y MI: I [ X ; Y ] = H [ p ( x )] + H [ p ( y )] − H [ p ( x , y )] measures how much information p ( x ) gives about p ( y ) (and vice versa) MI is zero iff p ( x , y ) = p ( x ) p ( y ), i.e., x and y are independent for all x ∈ X and y ∈ Y can be as large as H [ p ( x )] or H [ p ( y )] p ( x , y ) � I [ X ; Y ] = p ( x , y ) log 2 p ( x ) p ( y ) x ∈ X , y ∈ Y 20 / 75
Mutual information Compute the feature utility A ( t , c ) as the expected mutual information (MI) of term t and class c . MI tells us “how much information” the term contains about the class and vice versa. For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition: P ( U = e t , C = e c ) � � I ( U ; C )= P ( U = e t , C = e c ) log 2 P ( U = e t ) P ( C = e c ) e t ∈{ 1 , 0 } e c ∈{ 1 , 0 } p ( t , c ) p ( t , c ) = p ( t , c ) log 2 p ( t ) p ( c ) + p ( t , c ) log 2 p ( t ) p ( c ) p ( t , c ) p ( t , c ) + p ( t , c ) log 2 p ( t ) p ( c ) + p ( t , c ) log 2 p ( t ) p ( c ) 21 / 75
How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: N 11 NN 11 + N 10 NN 10 I ( U ; C ) = N log 2 N log 2 (1) N 1 . N . 1 N 1 . N . 0 + N 01 NN 01 + N 00 NN 00 N log 2 N log 2 N 0 . N . 1 N 0 . N . 0 N 11 : # of documents that contain t ( e t = 1) and are in c ( e c = 1) N 10 : # of documents that contain t ( e t = 1) and not in c ( e c = 0) N 01 : # of documents that don’t contain t ( e t = 0) and in c ( e c = 1) N 00 : # of documents that don’t contain t ( e t = 0) and not in c ( e c = 0) N = N 00 + N 01 + N 10 + N 11 p ( t , c ) ≈ N 11 / N , p ( t , c ) ≈ N 01 / N , p ( t , c ) ≈ N 10 / N , p ( t , c ) ≈ N 00 / N N 1 . = N 10 + N 11 : # documents that contain t , p ( t ) ≈ N 1 . / N N . 1 = N 01 + N 11 : # documents in c , p ( c ) ≈ N . 1 / N N 0 . = N 00 + N 01 : # documents that don’t contain t , p ( t ) ≈ N 0 . / N N . 0 = N 00 + N 10 : # documents not in c , p ( c ) ≈ N . 0 / N 22 / 75
Recommend
More recommend