Data Mining: Concepts and Techniques Cluster Analysis Li Xiong - PowerPoint PPT Presentation

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects (into k  clusters), s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster  centroid    2 k ( ) p m   1 i p C i i How to find optimal partition?  September 26, 2017 Data Mining: Concepts and Techniques 30

Number of partitionings Data Mining: Concepts and Techniques 31

Number of partitionings Stirling partition number – number of ways  to partition n objects into k non-empty subsets (n= 5, k = 1, 2, 3, 4, 5): 1, 15, 25, 10, 1 (n=10, k = 1, 2, 3, 4, 5, …): 1, 511, 9330, 34105, 42525, … Bell numbers – number of ways to partition  n objects (n = 0, 1, 2, 3, 4, 5, …): 1 , 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597, 27644437, 190899322, 1382958545, 10480142147, 82864869804, 682076806159, 5832742205057, ... Data Mining: Concepts and Techniques 32

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects into k clusters,  s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster  centroid    2 k ( ) p m   1 i p C i i Heuristic methods: k-means and k-medoids algorithms   k-means (Lloyd’57, MacQueen’67 ): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster September 26, 2017 Data Mining: Concepts and Techniques 33

K-Means Clustering: Lloyd Algorithm  Given k , and randomly choose k initial cluster centers  Partition objects into k nonempty subsets by assigning each object to the cluster with the nearest centroid  Update centroid, i.e. mean point of the cluster  Go back to Step 2, stop when no more new assignment and centroids do not change September 26, 2017 Data Mining: Concepts and Techniques 34

The K-Means Clustering Method  Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 Update 4 Assign 3 3 the 3 each 2 2 2 cluster 1 1 objects 1 0 means 0 0 to most 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 similar reassign reassign center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 35

K -means Clustering – Details Initial centroids are often chosen randomly  Example: Pick one point at random, then k-1 other points, each as  far away as possible from the previous points The centroid is (typically) the mean of the points in the  cluster. ‘ Nearest ’ is measured by Euclidean distance, cosine  similarity, correlation, etc. Most of the convergence happens in the first few  iterations. Often the stopping condition is changed to ‘Until relatively few  points change clusters’ O(tkn) Complexity is  n is # objects, k is # clusters, and t is # iterations.

Comments on the K-Means Method Strength   Simple and works well for “regular” disjoint clusters  Relatively efficient and scalable (normally, k, t << n) Weakness   Need to specify k, the number of clusters, in advance  Depending on initial centroids, may terminate at a local optimum  Sensitive to noisy data and outliers  Not suitable for clusters of  Different sizes  Non-convex shapes September 26, 2017 Data Mining: Concepts and Techniques 37

Getting the k right How to select k ?  Try different k , looking at the change in the average distance to centroid (or SSE) as k increases  Average falls rapidly until right k , then changes little Best value of k Average distance to k centroid J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

Example: Picking k x Too few; x xx x many long distances x x x x to centroid. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

Example: Picking k x x Just right; xx x distances x x rather short. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

Example: Picking k x Too many; x little improvement xx x in average x x x x distance. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

Importance of Choosing Initial Centroids – Case 1 Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Importance of Choosing Initial Centroids – Case 2 Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

Limitations of K-means: Non-convex Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original Points K-means Clusters

Assignment 2  Implement k-means clustering  Evaluate the results September 26, 2017 Data Mining: Concepts and Techniques 48

Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts  Similarity and distances  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Probabilistic Methods  Evaluation of Clustering  49 49

Cluster Evaluation  Determine clustering tendency of data, i.e. distinguish whether non-random structure exists  Determine correct number of clusters  Evaluate the cohesion and separation of the clustering without external information  Evaluate how well the cluster results are compared to externally known results  Compare different clustering algorithms/results

Measures  Unsupervised (internal): Used to measure the goodness of a clustering structure without respect to external information.  Sum of Squared Error (SSE)  Supervised (external): Used to measure the extent to which cluster labels match externally supplied class labels.  Entropy  Relative: Used to compare two different clustering results  Often an external or internal index is used for this function, e.g., SSE or entropy

Internal Measures: Cohesion and Separation  Cluster Cohesion: how closely related are objects in a cluster  Cluster Separation: how distinct or well-separated a cluster is from other clusters Example: Squared Error  Cohesion: within cluster sum of squares (SSE)    2   ( ) WSS x m i  i x C i Separation: between cluster sum of squares  separation    Cohesion 2 ( ) BSS m m i j i j

Cluster Validity: Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Random DBSCAN 0.6 0.6 Points y 0.5 0.5 y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 K-means Complete 0.8 0.8 Link 0.7 0.7 0.6 0.6 0.5 0.5 y y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x

Internal Measures: Cluster Validity Statistics framework for cluster validity  More “atypical” -> likely valid structure in the data  Use values resulting from random data as baseline  Example  Clustering: SSE = 0.005  SSE of three clusters in 500 sets of random data points  1 50 0.9 45 0.8 40 0.7 35 0.6 30 Count 0.5 y 25 0.4 20 0.3 15 0.2 10 0.1 5 0 0 0 0.2 0.4 0.6 0.8 1 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 x SSE

Internal Measures: number of clusters  Good for comparing two clusterings  Can also be used to estimate the number of clusters  Elbow method: use turning point in the curve of SSE wrt # of clusters 10 6 9 8 4 7 2 6 SSE 5 0 4 -2 3 -4 2 1 -6 0 5 10 15 2 5 10 15 20 25 30 K

Internal Measures: Number of clusters  Another example of a more complicated data set with varying number of clusters 1 2 6 3 4 5 7 SSE of clusters found using K-means

External Measures  Compare cluster results with “ground truth” or manually clustering  Still different from classification measures  Classification-oriented measures: entropy/purity based, precision and recall based  Similarity-oriented measures: Jaccard scores

External Measures: Classification-Oriented Measures  Entropy based measures: the degree to which each cluster consists of objects of a single class  Purity: based on majority class in each cluster

External Measures: Classification-Oriented Measures  BCubed Precision and recall: measures precision and recall associated with each object  Precision of an object: proportion of objects in the same cluster belong to the same category  Recall of an object: proportion of objects of the same category are assigned to the same cluster  Bcubed precision and recall are the average precision and recall of all objects

BCubed precision and recall September 26, 2017 60

External Measure: Similarity-Oriented Measures Given a reference clustering T and clustering S f 00 : number of pair of points belonging to different clusters in both T  and S f 01 : number of pair of points belonging to different cluster in T but  same cluster in S f 10 : number of pair of points belonging to same cluster in T but  different cluster in S f 11 : number of pair of points belonging to same clusters in both T and  S  f f  00 11 Rand    f f f f 00 01 10 11 f  11 Jaccard   f f f 01 10 11 T S September 26, 2017 Li Xiong 61

Variations of the K-Means Method A few variants of the k-means which differ in   Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98)   Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method September 26, 2017 Data Mining: Concepts and Techniques 63

K-Medoids Method  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the mean of the data. K-Medoids: Instead of using the mean as cluster  representative, use medoid , the most centrally located object in a cluster. Possible number of solutions?  10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 64

The K - Medoids Clustering Method PAM (Partitioning Around Medoids) (Kaufman and Rousseeuw, 1987) Arbitrarily select k objects as medoid  Assign each data object in the given data set to most similar  medoid. For each nonmedoid object O’ and medoid object O   Compute total cost, S, of swapping the medoid object O to O’ (cost as total sum of absolute error) If min S<0, then swap O with O’  Repeat until there is no change in the medoids.  September 26, 2017 Data Mining: Concepts and Techniques 65

A Typical K-Medoids Algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Select a nonmedoid object,O random Total Cost = 26 Do loop 10 10 Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 66

What Is the Problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers  Pam works efficiently for small data sets but does not scale well for large data sets.  Complexity? n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 67

What Is the Problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers  Pam works efficiently for small data sets but does not scale well for large data sets.  Complexity? O(k(n-k) 2 ) n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 68

CLARA (Clustering Large Applications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990)  Draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output September 26, 2017 Data Mining: Concepts and Techniques 69

CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids September 26, 2017 Data Mining: Concepts and Techniques 70

Search graph September 26, 2017 Data Mining: Concepts and Techniques 71

CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids  PAM examines all neighbors for local minimum  CLARA works on subgraphs of samples  CLARANS examines neighbors dynamically  Limit the neighbors to explore (maxneighbor)  If local optimum is found, start with new randomly selected node in search for a new local optimum (numlocal) September 26, 2017 Data Mining: Concepts and Techniques 72

Overcoming K-means Limitations Original Points K-means Clusters

Hierarchical Clustering  Produces a set of nested clusters  Can be visualized as a dendrogram, a tree like diagram  Y-axis measures closeness  Clustering obtained by cutting at desired level  Do not have to assume any particular number of clusters  May correspond to meaningful taxonomies 5 6 0.2 4 3 4 0.15 2 5 2 0.1 1 0.05 1 3 0 1 3 2 5 4 6

September 26, 2017 Data Mining: Concepts and Techniques 77

Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative (AGNES) Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one  cluster (or k clusters) left  Divisive (DIANA) Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point  (or there are k clusters)

Agglomerative Clustering Algorithm Compute the proximity matrix 1. Let each data point be a cluster 2. Repeat 3. Merge the two closest clusters 4. Update the proximity matrix 5. Until only a single cluster remains 6.

Starting Situation  Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C5 C2 ... p1 p2 p3 p4 p9 p10 p11 p12

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 Similarity? p4 p5 . . . Proximity Matrix

Distance between Clusters X X  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = min(t ip , t jq )  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = max(t ip , t jq )  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = avg(t ip , t jq )  Centroid: distance between the centroids of two clusters, i.e., dist(K i , K j ) = dist(C i , C j )  Medoid: distance between the medoids of two clusters, i.e., dist(K i , K j ) = dist(M i , M j )  Medoid: a chosen, centrally located object in the cluster 83

Hierarchical Clustering: MIN 5 1 3 5 0.2 2 1 0.15 2 3 6 0.1 0.05 4 4 0 3 6 2 5 4 1 Nested Clusters Dendrogram

View points/similarities as a graph  Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12

Single link clustering and MST (Minimum Spanning Tree) An aggolomerative algorithm using minimum distance (single-link  clustering) essentially the same as Kruskal’s algorithm for minimal spanning tree (MST) MST: a subgraph which is a tree and connects all vertices  together that has the minimum weight Kruskal’s algorithm: Add edges in increasing weight, skipping  those whose addition would create a cycle Prim’s algorithm: Grow a tree with any root node, adding the  frontier edge with smallest weight

Min vs. Max vs. Group Average 5 4 1 1 3 2 5 MIN 5 5 2 2 1 MAX 2 3 3 6 6 3 1 4 4 4 5 1 2 5 2 Group Average 3 6 3 1 4 4

Strength of MIN Original Points Two Clusters • Can handle clusters with varying sizes • Can also handle non-elliptical shapes

Limitations of MAX Original Points Two Clusters • Tends to break large clusters • Biased towards globular clusters

Limitations of MIN Original Points Two Clusters • Chaining phenomenon • Sensitive to noise and outliers

Strength of MAX Original Points Two Clusters • Less susceptible to noise and outliers

Hierarchical Clustering: Group Average Compromise between Single and Complete  Link Strengths   Less susceptible to noise and outliers Limitations   Biased towards globular clusters

Hierarchical Clustering: Major Weaknesses  Do not scale well (N: number of points)  Space complexity:  Time complexity:

Hierarchical Clustering: Major Weaknesses  Do not scale well (N: number of points)  Space complexity: O(N 2 )  Time complexity: O(N 3 ) O(N 2 log(N)) for some cases/approaches  Cannot undo what was done previously  Quality varies in terms of distance measures MIN (single link): susceptible to noise/outliers  MAX/GROUP AVERAGE: may not work well with non-  globular clusters

Density-Based Clustering Methods  Clustering based on density  Major features:  Clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Several interesting studies:  DBSCAN: Ester, et al. (KDD ’ 96)  OPTICS: Ankerst, et al (SIGMOD ’ 99).  DENCLUE: Hinneburg & D. Keim (KDD ’ 98)  CLIQUE: Agrawal, et al. (SIGMOD ’ 98) (more grid-based) September 26, 2017 Data Mining: Concepts and Techniques 96

DBSCAN: Basic Concepts Density = number of points within a specified radius  core point: has high density  border point: has less density, but in the  neighborhood of a core point noise point: not a core point or a border point.  Core point border point noise point

DBScan: Definitions  Two parameters :  Eps : radius of the neighbourhood  MinPts : Minimum number of points in an Eps- neighbourhood of that point  N Eps (p) : {q belongs to D | dist(p,q) <= Eps}  core point: | N Eps (q) | >= MinPts p MinPts = 5 q Eps = 1 cm September 26, 2017 Data Mining: Concepts and Techniques 98

DBScan: Definitions  Directly density-reachable (p from q): p MinPts = 5 p belongs to N Eps (q) q Eps = 1 cm  Density-reachable (p from q): if there p is a chain of points p 1 , … , p n , p 1 = q , p 2 p n = p such that p i+1 is directly q density-reachable from p i p q  Density-connected (p and q): if there is a point o such that both, p and q o are density-reachable from o w.r.t. Eps and MinPts Data Mining: Concepts and Techniques 99

DBSCAN: Cluster Definition  A cluster is defined as a maximal set of density-connected points Outlier Border Eps = 1cm Core MinPts = 5 September 26, 2017 Data Mining: Concepts and Techniques 100

DBSCAN: The Algorithm  Arbitrary select an unvisited point p, retrieve all neighbor points density-reachable from p w.r.t. Eps and MinPts  If p is a core point, a cluster is formed, add all neighbors of p to the cluster, and recursively add their neighbors if they are a core point  Otherwise, mark p as a noise point  Continue the process until all of the points have been processed.  Complexity: O(n 2 ). If a spatial index is used, O(nlogn) September 26, 2017 Data Mining: Concepts and Techniques 101

DBSCAN: Sensitive to Parameters September 26, 2017 Data Mining: Concepts and Techniques 102

DBSCAN: Determining EPS and MinPts Basic idea (given MinPts = k, find eps):  For points in a cluster, their k th nearest neighbors  are at roughly the same distance Noise points have the k th nearest neighbor at  farther distance Plot sorted distance of every point to its k th nearest  neighbor

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber Tan, Steinbach, Kumar September 26, 2017 Data Mining: Concepts and Techniques 1 Cluster Analysis: Basic Concepts and Methods

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science

Stability of Cluster Analysis 2. Preparation of the data 3. Distance measure used S T A T I S

INFO 1998: Introduction to Machine Learning Lecture 9: Clustering and Unsupervised Learning INFO

10601 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview

Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of

Machine Learning Lecture Notes on Clustering (IV) 2016-2017 Davide Eynard davide.eynard@usi.ch

K -means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] COMP24111 Machine Learning Outline

Classification method in single particle analysis Cluster Analysis Pawel A. Penczek