Outline • Linkage-based Clustering – Motivation – Definitions Clustering with Semantic Links – Applications • Algorithms – SimRank Yin X., Han J. and Yu P. S., 2006, LinkClus: Efficient – ReCoM Clustering via Heterogeneous Semantic Links , Int. Conf. on Very Large Data Bases – LinkClus • Comparison • Conclusions http://www.cs.umd.edu/hcil/InfovisRepository/contest-2004/17/unzip/entry.html Why Linkage-Based Clustering More Reasons… • Links contain semantic information – Attributes of individual objects may be unavailable – Clustering objects based on attributes may offer no – we can extract complex relationships based on insight into the data, or insight of no interest at the topology of individual links time – Additional information regarding inter- and intra- relationships among objects in a database beyond attributes of specific objects – Objects of different types can be clustered based on linkages to other similar and different objects • Multi-typed links 1
Definitions – 1/2 Definitions – 2/2 • Linkage-based clustering • Link – Same general definition: clustering is a process of partitioning a set of – A connection between two objects objects into a set of meaningful sub-classes. • Multi-type link • Linkage-based cluster – A connection between two objects of differing type – Also the same: a collection of data objects that are “similar.” – Both are essentially pointers. – The difference is in measuring “similarity” • Similarity – The similarity between two objects is the average similarity between objects linked with them. – Two objects are said to be similar if they are linked with similar objects – Can cluster in various ways once similarity is calculated (hierarchical, k -means, k -medoids, etc…) Applications SimRank – 1/3 • Recommender systems • Glen Jeh and Jennifer Widom, “SimRank: A measure of structural-context similarity,” In Proceedings of the Eighth ACM SIGKDD International Conference – Collaborative filtering: similar users and items are grouped on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 2002 based on their user preferences • Measures similarity of the structural context in • Web queries which objects occur, based on their relationships with other objects. • Bioinformatics – Automatic recognition of protein families for • Based on definition that two objects are similar phylogenomic analysis if they are related to similar objects • Social networks – Fuzzy version of: “ if a=b and b=c, then a=c” • Other cross-referenced databases 2
SimRank – 2/3 SimRank – 3/3 • Several versions • Convert web graphs to Simple webpage hierarchy node-pair (directed) – Naïve: computes all pair-wise similarities, O( Kn 2 d ) graphs for analysis • K is number of iterations, d the average in-neighbor pairs between any two nodes. – Nodes � objects – Pruning: only searches so far down chains of links, – Edges � relationships O( Knd r d ) • Similarity for objects • d r is equivalent to a search radius measured in links a and b : – Fingerprinting (Fogaras and Racz, 2005) = 1 if a b ∈ • Uses pre-computed random walks through linkages to I a ( ) I b ( ) ( ) C (0,1] C ∑ ∑ = ≠ s a b ( , ) s I a I b ( ), ( ) if a b i j I a ( ) number of items linked approximate similarity I a ( ) I b ( ) = = i 1 j 1 ≠ = ∅ = ∅ to (in-neighbor pairs) a 0 if a b and ( ) I a or ( ) I b ReCoM – 1/3 ReCoM – 2/3 • J.D. Wang, H.J. Zeng, Z. Chen, H.J. Lu, L. Tao, W.Y. Ma. “ReCoM: • Reinforcement clustering Reinforcement clustering of multi-type interrelated data objects,” SIGIR, 2003 – Propagate clustering results of one type to all its • Similarity of objects measured by their own related types by updating their relationship attributes as well as their relationships features. • Uses inter-relationships to reinforce clustering – Perform clustering on the updated features process of related objects – Finished when updated features have no effect on – An iterative process done until clusters reach current clustering results. steady state – Reinforcement clustering 3
ReCoM – 3/3 LinkClus • Also looked at assigning “importance” to • Hierarchical linkage-based clustering technique – Other clustering can be used on similarity measures objects of the same type (CLARANS) – e.g. More authoritative web pages or authors – Similarities are multi-granular: detailed between closely having more publications related objects and overall between groups of objects. – This feature was not incorporated in comparison • Propose a hierarchical data structure: SimTree with LinkClus – Based on: • Hierarchical structures naturally exist for many object types (animal/plant taxonomy, merchandise categories, research communities, etc…) • Linkages through these hierarchies tend to a power-law distribution (internet topology, human respiratory system, social networks, automobile networks) Power-law Distribution SimTree Among Linkages • Power-law • Main structure behind LinkClus ∝ a y x x y , variables of interest (similarity and proportion of objects) • Designed around power-law assumption a scaling exponent – Stores significant similarities and compresses • Metric (similarity) insignificant ones measures connectivity SimRank similarities • There is a high proportion of insignificant similarities of nodes for DBLP authors – Reduces the number of pair-wise similarities to be • SimTree’s take evaluated advantage of Power- • Insignificant similarities are aggregated law assumption 4
SimTree Architecture SimTree Construction • Leaf-nodes – Contain 1. Initialization individual objects a. Leaf-node similarity = 1 to itself, 0 to others • Parent-nodes – Contain b. Find tight sets of leaf-nodes and merge into groups of leaf-nodes or parent-nodes (frequent pattern mining) parent nodes at level k -1 • Construction – bottom-up 2. Iterative similarity updating • Similarities: a. SimTree restructuring using other SimTrees – Between sibling nodes – Between leafs and parents (adjustment factors) – Path-based ( ) ( ) ( ) ( ) = s n n , s n n , s n n , s n n , 7 9 7 4 4 5 5 8 − k 1 ( ) = Π ( ) s n n , s n n , + 1 k i i 1 = i 1 Initialization Grouping Leaf-nodes • Tight group – set of nodes co-linked with many • Start with 2 or more data types objects of other types linked together • Frequent pattern – a set of items that co-appear in – Each data type has its own SimTree many transactions – Each object forms a leaf-node • tightness of a group equal to item-set support SimTree 1, • Choose tightest, non-overlapping groups as parent- leaf-nodes nodes SimTree 2, • Constraints on node generation – Convert to transactions and use leaf-nodes [ ] ∈ frequent pattern mining to choose Max children/parent: c 10,20 α N N parent nodes ≤ ≤ < α ≤ l l Parent-nodes: Np , 1 2 c c 5
Grouping Leaf-nodes Example Calculating Similarity – 1/2 • Calculate similarity between sibling nodes in the SimTrees for each object • Similarity calculated as average similarity between the objects linked with them ( ) { } { } = ( , ) 10,11,12,16 , 10,13,14,17 s a b s Calculating Similarity – 2/2 SimTree Restructuring • Average similarity is the aggregated path- • Move leaf/child-nodes to most similar parent- based similarity sibling after updating similarities • Consider s({10,11,12},{13,14}) • Do not exceed node constraints ( ) 1 12 1 14 { } { } ∑ ∑ = × × s 10,11,12 , 13,14 s i ( ,4) s (4,5) s (5, ) j 3 3 = = i 10 j 13 Sibling-similarity Aggregated adjustment factors • Simplifies time complexity from quadratic (as in SimRank) to linear < s (10,4) s (10,5) < s (13,5) s (13,4) 6
Comparison Databases • Tested LinkClus against algorithms discussed • DBLP (dblp.uni-trier.de) – SimRank – Database and logic programming bibliographies • Naïve – Clustered 4170 authors, 2517 proceedings, 154 conferences and 2518 keywords • Pruning (P-SimRank) • Fingerprinting (F-SimRank) • Email Dataset – ReCoM – 370 emails on conferences, 272 on jobs, 789 spam emails (kept 371), and 2500 most frequent words • Measure clustering accuracy with modified – Clustered two types of objects: emails and words with Jaccard coefficient ~141,000 links between them. – Assumes two objects correctly clustered if they • Other Synthetic Sets share at least one common class label DBLP DBLP Accuracy: no keywords Authors Conferences 0.96 0.8 • Goal was to correctly 0.7 0.94 label an author’s 0.6 0.92 accuracy accuracy research area and 0.5 0.90 0.4 conference subject area LinkClus 0.88 LinkClus 0.3 SimRank SimRank 0.86 ReCom ReCom – Manually labeled 400 0.2 F-SimRank F-SimRank 0.84 0.1 authors and all 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 iteration iteration conferences for accuracy Authors Conferences Time/Iteration calculations LinkClus 0.957 0.723 76.7 • Grouped all object types LinkClus (CLARANS) 0.953 0.752 107.7 SimRank 0.958 0.760 1020 into 20 clusters ReCom 0.907 0.457 43.1 F-SimRank 0.908 0.583 83.6 7
Recommend
More recommend