Clustering Community Detection - PowerPoint PPT Presentation

Clustering

Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Clustering (1) 2

Customer Relation Management • Partitioning customers into groups such that customers within a group are similar in some aspects • A manager can be assigned to a group • Customized products and services can be developed Jian Pei: CMPT 741/459 Clustering (1) 3

What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 Jian Pei: CMPT 741/459 Clustering (1) 4

Requirements of Clustering • Scalability • Ability to deal with various types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters Jian Pei: CMPT 741/459 Clustering (1) 5

Data Matrix • For memory-based clustering – Also called object-by-variable structure • Represents n objects with p variables (attributes, measures) � � x x x ⎡ ⎤ 11 1 f 1 p – A relational table ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x i 1 if ip ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x ⎢ ⎥ n 1 nf np ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 6

Dissimilarity Matrix • For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative 0 ⎡ ⎤ – Close to 0: similar ⎢ ⎥ d (2,1) 0 ⎢ ⎥ d (3,1) d (3,2) 0 ⎢ ⎥ ⎢ ⎥ � � � ⎢ ⎥ ⎢ d ( n ,1) d ( n ,2) � � 0 ⎥ ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 7

How Good Is Clustering? • Dissimilarity/similarity depends on distance function – Different applications have different functions • Judgment of clustering quality is typically highly subjective Jian Pei: CMPT 741/459 Clustering (1) 8

Types of Data in Clustering • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types Jian Pei: CMPT 741/459 Clustering (1) 9

Interval-valued Variables • Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude coordinates, temperature, etc. • Effect of measurement units in attributes – Smaller unit à larger variable range à larger effect to the result – Standardization + background knowledge Jian Pei: CMPT 741/459 Clustering (1) 10

Standardization • Calculate the mean absolute deviation 1 s (| x m | | x m | ... | x m |) 1 = − + − + + − m (x x x n = + ... ) n + + f 1 f f 2 f f nf f . f 1 f 2 f nf • Calculate the standardized measurement (z- x m − score) if f z = s if f • Mean absolute deviation is more robust – The effect of outliers is reduced but remains detectable Jian Pei: CMPT 741/459 Clustering (1) 11

Similarity and Dissimilarity • Distances are normally used measures • Minkowski distance: a generalization q q q d ( i , j ) | x x | | x x | ... | x x | ( q 0 ) = − + − + + − > q i j i j i j 1 1 2 2 p p • If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance • If q = ∞ , d is Chebyshev distance • Weighed distance q q q d ( i , j ) w | x x | w | x x | ... w | x x | ) ( q 0 ) = − + − + + − > q p i j i j i j 1 2 1 1 2 2 p p Jian Pei: CMPT 741/459 Clustering (1) 12

Manhattan and Chebyshev Distance Chebyshev Distance Manhattan Distance When n = 2, chess-distance Picture from Wekipedia http://brainking.com/images/rules/chess/02.gif Jian Pei: CMPT 741/459 Clustering (1) 13

Properties of Minkowski Distance • Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0 – d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality i j – d(i,j) ≤ d(i,k) + d(k,j) k Jian Pei: CMPT 741/459 Clustering (1) 14

Object j 1 0 Sum Binary Variables 1 q r q+r Object i 0 s t s+t Sum q+s r+t p • A contingency table for binary data • Symmetric variable: each state carries the same weight r s + + d ( i , j ) = q r s t + + – Invariant similarity • Asymmetric variable: the positive value carries more weight r s ++ d ( i , j ) = q r s + – Noninvariant similarity (Jacard) Jian Pei: CMPT 741/459 Clustering (1) 15

Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green p m − d ( i , j ) = • Method 1: simple matching p – M: # of matches, p: total # of variables • Method 2: use a large number of binary variables – Creating a new binary variable for each of the M nominal states Jian Pei: CMPT 741/459 Clustering (1) 16

Ordinal Variables • An ordinal variable can be discrete or continuous r ∈ { 1 ,..., M } if f • Order is important, e.g., rank • Can be treated like interval-scaled – Replace x if by their rank – Map the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by r 1 − z if = M 1 if − f – Compute the dissimilarity using methods for interval-scaled variables Jian Pei: CMPT 741/459 Clustering (1) 17

Ratio-scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such as Ae Bt • Treat them like interval-scaled variables? – Not a good choice: the scale can be distorted! • Apply logarithmic transformation, y if = log(x if ) • Treat them as continuous ordinal data, treat their rank as interval-scaled Jian Pei: CMPT 741/459 Clustering (1) 18

Variables of Mixed Types • A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects d p ( f ) ( f ) Σ δ f 1 ij ij d ( i , j ) = = p ( f ) Σ δ f 1 ij = Jian Pei: CMPT 741/459 Clustering (1) 19

Clustering Methods • K-means and partitioning methods • Hierarchical clustering • Density-based clustering • Grid-based clustering • Pattern-based clustering • Other clustering methods Jian Pei: CMPT 741/459 Clustering (1) 20

Partitioning Algorithms: Ideas • Partition n objects into k clusters – Optimize the chosen partitioning criterion • Global optimal: examine all possible partitions – (k n -(k-1) n - … -1) possible partitions, too expensive! • Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster Jian Pei: CMPT 741/459 Clustering (1) 21

K-means • Arbitrarily choose k objects as the initial cluster centers • Until no change, do – (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster Jian Pei: CMPT 741/459 Clustering (1) 22

K-Means: Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 Update Assi Assign 3 3 3 the e each 2 2 2 cluster 1 o 1 object 1 0 means 0 0 to to the 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 si most reassign reassign similar center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 23

Pros and Cons of K-means • Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined – What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters Jian Pei: CMPT 741/459 Clustering (1) 24

Variations of the K-means • Aspects of variations – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means • Handling categorical data: k-modes – Use mode instead of mean • Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype method • EM (expectation maximization): assign a probability of an object to a cluster (will be discussed later) Jian Pei: CMPT 741/459 Clustering (1) 25

A Problem of K-means + • Sensitive to outliers + – Outlier: objects with extremely large values • May substantially distort the distribution of the data • K-medoids: the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 26

Clustering Community Detection - PowerPoint PPT Presentation

Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Clustering (1) 2 Customer Relation

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Community detection and cascades Rik Sarkar Today Community Detection Spectral

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Matching Appropriate Procedures and Test Statistics with Appropriate Levels of Measurement for

Robust Solution Approaches for Optimization under Uncertainty: Applications to Air Traffic

Update on possibilities to reach *=40cm (work in progress) R. Bruce, S. Redaelli

1 Introduction The KM3NeT Collaboration is constructing an underwater neutrino research

Quest: A Generalized Motif Bicluster Algo- rithm Sebastian Kaiser and Friedrich Leisch Institut

The Nominal Datatype Package in Isabelle/HOL Christian Urban University of Munich joint work

Applied Political Research Session 10: The Difference of Means Test Lecturer: Prof. A.

A Machine-Assisted Proof of Gdel's Incompleteness Theorems Lawrence C. Paulson, Computer