data mining techniques
play

Data Mining Techniques: Partitioning Methods: K-Means Cluster - PDF document

Cluster Analysis Overview Introduction Foundations: Measuring Distance (Similarity) Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical Methods Mirek Riedewald Density-Based Methods Many slides


  1. Cluster Analysis Overview • Introduction • Foundations: Measuring Distance (Similarity) Data Mining Techniques: • Partitioning Methods: K-Means Cluster Analysis • Hierarchical Methods Mirek Riedewald • Density-Based Methods Many slides based on presentations by • Clustering High-Dimensional Data Han/Kamber, Tan/Steinbach/Kumar, and Andrew • Cluster Evaluation Moore 2 What is Cluster Analysis? What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters Inter-cluster Intra-cluster distances are • Unsupervised learning: usually no training set distances are maximized with known “classes” minimized • Typical applications – As a stand-alone tool to get insight into data properties – As a preprocessing step for other algorithms 3 4 Rich Applications, Multidisciplinary Examples of Clustering Applications Efforts • Pattern Recognition • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to • Spatial Data Analysis develop targeted marketing programs • Image Processing • Land use : Identification of areas of similar land use in • Data Reduction an earth observation database • Economic Science • Insurance : Identifying groups of motor insurance policy Clustering precipitation in Australia holders with a high average claim cost – Market research • City-planning : Identifying groups of houses according • WWW to their house type, value, and geographical location – Document classification • Earth-quake studies : Observed earth quake epicenters – Weblogs: discover groups of similar access patterns should be clustered along continent faults 5 6 1

  2. Quality: What Is Good Clustering? Notion of a Cluster can be Ambiguous • Cluster membership  objects in same class • High intra-class similarity, low inter-class similarity How many clusters? Six Clusters – Choice of similarity measure is important • Ability to discover some or all of the hidden patterns Two Clusters Four Clusters – Difficult to measure without ground truth 7 8 Cluster Analysis Overview Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • Introduction – Non-exclusive clustering: points may belong to • Foundations: Measuring Distance (Similarity) multiple clusters • Fuzzy versus non-fuzzy • Partitioning Methods: K-Means – Fuzzy clustering: a point belongs to every cluster with • Hierarchical Methods some weight between 0 and 1 • Weights must sum to 1 • Density-Based Methods • Partial versus complete • Clustering High-Dimensional Data – Cluster some or all of the data • Heterogeneous versus homogeneous • Cluster Evaluation – Clusters of widely different sizes, shapes, densities 9 10 Distance Similarity Between Objects • Clustering is inherently connected to question • Usually measured by some notion of distance of (dis-)similarity of objects • Popular choice: Minkowski distance   q q q  q       dist x ( i ), x ( j ) | x ( i ) x ( j ) | | x ( i ) x ( j ) |  | x ( i ) x ( j ) | 1 1 2 2 d d • How can we define similarity between – q is a positive integer objects? • q = 1: Manhattan distance          dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d • q = 2: Euclidean distance:   2 2 2  2       dist ( ), ( ) | ( ) ( ) | | ( ) ( ) |  | ( ) ( ) | x i x j x i x j x i x j x i x j 1 1 2 2 d d 11 12 2

  3. Metrics Challenges • Properties of a metric • How to compute a distance for categorical – d(i,j)  0 attributes – d(i,j) = 0 if and only if i=j – d(i,j) = d(j,i) – d(i,j)  d(i,k) + d(k,j) • An attribute with a large domain often • Examples: Euclidean distance, Manhattan distance dominates the overall distance • Many other non-metric similarity measures exist – Weight and scale the attributes like for k-NN • After selecting the distance function, is it now clear how to compute similarity between objects? • Curse of dimensionality 13 14 Curse of Dimensionality Nominal Attributes • Best solution: remove any attribute that is • Method 1: work with original values known to be very noisy or not interesting – Difference = 0 if same value, difference = 1 otherwise • Method 2: transform to binary attributes • Try different subsets of the attributes and – New binary attribute for each domain value determine where good clusters are found – Encode specific domain value by setting corresponding binary attribute to 1 and all others to 0 15 16 Ordinal Attributes Scaling and Transforming Attributes • Method 1: treat as nominal • Sometimes it might be necessary to transform numerical attributes to [0,1] or use another – Problem: loses ordering information normalizing transformation, maybe even non- linear (e.g., logarithm) • Method 2: map to [0,1] • Might need to weight attributes differently – Problem: To which values should the original values be mapped? – Default: equi-distant mapping to [0,1] • Often requires expert knowledge or trial-and- error 17 18 3

  4. Other Similarity Measures Calculating Cluster Distances • Single link = smallest distance between an element in one • Special distance or similarity measures for cluster and an element in the other: dist(K i , K j ) = min( x ip , many applications x jq ) • Complete link = largest distance between an element in – Might be a non-metric function one cluster and an element in the other: dist(K i , K j ) = • Information retrieval max( x ip , x jq ) • Average distance between an element in one cluster and an – Document similarity based on keywords element in the other: dist(K i , K j ) = avg( x ip , x jq ) • Bioinformatics • Distance between cluster centroids: dist(K i , K j ) = d( m i , m j ) – Gene features in micro-arrays • Distance between cluster medoids: dist(K i , K j ) = dist( x mi , x mj ) – Medoid: one chosen, centrally located object in the cluster 19 20 Cluster Analysis Overview Cluster Centroid, Radius, and Diameter 1  • Centroid : the “middle” of a cluster C  • Introduction m x | | C  x C • Foundations: Measuring Distance (Similarity) • Radius: square root of average distance from any • Partitioning Methods: K-Means  point of the cluster to its centroid  2 ( ) x m • Hierarchical Methods   x C R | C | • Density-Based Methods • Diameter: square root of average mean squared • Clustering High-Dimensional Data distance between all pairs of points in the cluster    • Cluster Evaluation 2 ( ) x y     , x C y C y x D   | C | (| C | 1 ) 21 22 K-means Clustering Partitioning Algorithms: Basic Concept • Construct a partition of a database D of n objects into a • Each cluster is associated with a centroid set of K clusters, s.t. sum of squared distances to • Each object is assigned to the cluster with the cluster “representative” m is minimized closest centroid   K  2 ( ) m x   i i 1 x C i 1. Given K, select K random objects as initial • Given a K, find partition of K clusters that optimizes the centroids chosen partitioning criterion 2. Repeat until centroids do not change – Globally optimal: enumerate all partitions – Heuristic methods 1. Form K clusters by assigning every object to its • K-means (’67): each cluster represented by its centroid nearest centroid • K-medoids (’87): each cluster represented by one of the objects in 2. Recompute centroid of each cluster the cluster 23 24 4

Recommend


More recommend