INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 23/26: Hierarchical Clustering & Text Classification Redux Paul Ginsparg Cornell University, Ithaca, NY 19 Nov 2009 1 / 61
Administrativa Assignment 4 will be available by the end of today (Thu 19 Nov), due Fri 4 Dec (extended to Sun 6 Dec). Discussion 7 (Tues 24 Dec): Peter Norvig, “How to Write a Spelling Corrector” http://norvig.com/spell-correct.html See also http://www.facebook.com/video/video.php?v=644326502463 roughly 00:11:00 – 00:19:15 of a one hour video, but whole first half (or more) if you have time... 2 / 61
Overview Recap 1 Single-link/Complete-link 2 Centroid/GAAC 3 Variants 4 Labeling clusters 5 Text classification 6 3 / 61
Outline Recap 1 Single-link/Complete-link 2 Centroid/GAAC 3 Variants 4 Labeling clusters 5 Text classification 6 4 / 61
Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures. 5 / 61
A dendrogram The history of mergers 1.0 0.8 0.6 0.4 0.2 0.0 can be read off from left to right. Ag trade reform. Back−to−school spending is up The vertical line of Lloyd’s CEO questioned Lloyd’s chief / U.S. grilling Viag stays positive each merger tells us Chrysler / Latin America Ohio Blue Cross what the similarity of Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service the merger was. Planet Hollywood Trocadero: tripling of revenues We can cut the German unions split War hero Colin Powell dendrogram at a War hero Colin Powell Oil prices slip Chains may raise prices particular point (e.g., Clinton signs law Lawsuit against tobacco companies at 0.1 or 0.4) to get a suits against tobacco firms Indiana tobacco lawsuit Most active stocks flat clustering. Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady 6 / 61
Naive HAC algorithm SimpleHAC ( d 1 , . . . , d N ) 1 for n ← 1 to N 2 do for i ← 1 to N 3 do C [ n ][ i ] ← Sim ( d n , d i ) 4 I [ n ] ← 1 (keeps track of active clusters) 5 A ← [] (collects clustering as a sequence of merges) 6 for k ← 1 to N − 1 7 do � i , m � ← arg max {� i , m � : i � = m ∧ I [ i ]=1 ∧ I [ m ]=1 } C [ i ][ m ] 8 A . Append ( � i , m � ) (store merge) 9 for j ← 1 to N 10 (use i as representative for � i , m � ) do 11 C [ i ][ j ] ← Sim ( � i , m � , j ) 12 C [ j ][ i ] ← Sim ( � i , m � , j ) 13 I [ m ] ← 0 (deactivate cluster) 14 return A 7 / 61
Computational complexity of the naive algorithm First, we compute the similarity of all N × N pairs of documents. Then, in each of N iterations: We scan the O ( N × N ) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O ( N ) iterations, each performing a O ( N × N ) “scan” operation. Overall complexity is O ( N 3 ). We’ll look at more efficient algorithms later. 8 / 61
Key question: How to define cluster similarity Single-link: Maximum similarity Maximum similarity of any two documents Complete-link: Minimum similarity Minimum similarity of any two documents Centroid: Average “intersimilarity” Average similarity of all document pairs (but excluding pairs of docs in the same cluster) This is equivalent to the similarity of the centroids. Group-average: Average “intrasimilarity” Average similary of all document pairs, including pairs of docs in the same cluster 9 / 61
Single-link: Maximum similarity 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 10 / 61
Complete-link: Minimum similarity 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 11 / 61
Centroid: Average intersimilarity 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 12 / 61
Group average: Average intrasimilarity 4 3 b b b b b b b b 2 b b b b b b b b b b b b 1 0 0 1 2 3 4 5 6 7 13 / 61
Outline Recap 1 Single-link/Complete-link 2 Centroid/GAAC 3 Variants 4 Labeling clusters 5 Text classification 6 14 / 61
Single link HAC The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second cluster. Once we have merged two clusters, how do we update the similarity matrix? This is simple for single link: sim ( ω i , ( ω k 1 ∪ ω k 2 )) = max( sim ( ω i , ω k 1 ) , sim ( ω i , ω k 2 )) 15 / 61
This dendrogram was produced by single-link Notice: many small 1.0 0.8 0.6 0.4 0.2 0.0 clusters (1 or 2 members) being added Ag trade reform. Back−to−school spending is up to the main cluster Lloyd’s CEO questioned Lloyd’s chief / U.S. grilling Viag stays positive There is no balanced Chrysler / Latin America Ohio Blue Cross 2-cluster or 3-cluster Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service clustering that can be Planet Hollywood Trocadero: tripling of revenues derived by cutting the German unions split War hero Colin Powell dendrogram. War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady 16 / 61
Complete link HAC The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster. Once we have merged two clusters, how do we update the similarity matrix? Again, this is simple: sim ( ω i , ( ω k 1 ∪ ω k 2 )) = min( sim ( ω i , ω k 1 ) , sim ( ω i , ω k 2 )) We measure the similarity of two clusters by computing the diameter of the cluster that we would get if we merged them. 17 / 61
Complete-link dendrogram 1.0 0.8 0.6 0.4 0.2 0.0 Notice that this NYSE closing averages dendrogram is much Hog prices tumble Oil prices slip more balanced than Ag trade reform. Chrysler / Latin America the single-link one. Japanese prime minister / Mexico Fed holds interest rates steady We can create a Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady 2-cluster clustering Mexican markets British FTSE index with two clusters of War hero Colin Powell War hero Colin Powell about the same size. Lloyd’s CEO questioned Lloyd’s chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back−to−school spending is up German unions split Chains may raise prices Clinton signs law 18 / 61
Exercise: Compute single and complete link clusterings d 1 d 2 d 3 d 4 × × × × 3 2 d 5 d 6 d 7 d 8 × × × × 1 0 0 1 2 3 4 19 / 61
Single-link clustering d 1 d 2 d 3 d 4 × × × × 3 2 d 5 d 6 d 7 d 8 × × × × 1 0 0 1 2 3 4 20 / 61
Complete link clustering d 1 d 2 d 3 d 4 × × × × 3 2 d 5 d 6 d 7 d 8 × × × × 1 0 0 1 2 3 4 21 / 61
Single-link vs. Complete link clustering d 1 d 2 d 3 d 4 d 1 d 2 d 3 d 4 × × × × × × × × 3 3 2 2 d 5 d 6 d 7 d 8 d 5 d 6 d 7 d 8 × × × × × × × × 1 1 0 0 0 1 2 3 4 0 1 2 3 4 22 / 61
Single-link: Chaining × × × × × × 2 × × × × × × 1 0 0 1 2 3 4 5 6 Single-link clustering often produces long, straggly clusters. For most applications, these are undesirable. 23 / 61
What 2-cluster clustering will complete-link produce? d 1 d 2 d 3 d 4 d 5 × × × × × 1 0 0 1 2 3 4 5 6 7 Coordinates: 1 + 2 ε, 4 , 5 + 2 ε, 6 , 7 − ε . 24 / 61
Complete-link: Sensitivity to outliers d 1 d 2 d 3 d 4 d 5 × × × × × 1 0 0 1 2 3 4 5 6 7 The complete-link clustering of this set splits d 2 from its right neighbors – clearly undesirable. The reason is the outlier d 1 . This shows that a single outlier can negatively affect the outcome of complete-link clustering. Single-link clustering does better in this case. 25 / 61
Outline Recap 1 Single-link/Complete-link 2 Centroid/GAAC 3 Variants 4 Labeling clusters 5 Text classification 6 26 / 61
Recommend
More recommend