cs6220 data mining techniques
play

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015 Announcements Homework 1 grades out Re-grading policy: If you have doubts in your grading, please submit a


  1. CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2015

  2. Announcements • Homework 1 grades out • Re-grading policy: • If you have doubts in your grading, please submit a regrading form (via emails to both TAs and CC to the Instructor) indicating clearly the reason why you think it should be regraded • The deadline of the regrading form should be submitted within one week after you receive your score • We will regrade the whole homework/exam • Homework 3 out tomorrow 2

  3. Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree; HMM Label Neural Naïve Bayes; Propagation* Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN*; hierarchical Spectral clustering; Clustering* DBSCAN ; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Prediction Similarity DTW P-PageRank Search Ranking PageRank 3

  4. Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 4

  5. What is Cluster Analysis? • Cluster: A collection of data objects • similar (or related) to one another within the same group • dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering , data segmentation, … ) • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms 5

  6. Applications of Cluster Analysis • Data reduction • Summarization: Preprocessing for regression, PCA, classification, and association analysis • Compression: Image processing: vector quantization • Prediction based on groups • Cluster & find characteristics/patterns for each group • Finding K-nearest Neighbors • Localizing search to one or a small number of clusters • Outlier detection: Outliers are often viewed as those “far away” from any cluster 6

  7. Clustering: Application Examples • Biology : taxonomy of living things: kingdom, phylum, class, order, family, genus and species • Information retrieval : document clustering • Land use : Identification of areas of similar land use in an earth observation database • Marketing : Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning : Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies : Observed earth quake epicenters should be clustered along continent faults • Climate : understanding earth climate, find patterns of atmospheric and ocean 7

  8. Basic Steps to Develop a Clustering Task • Feature selection • Select info concerning the task of interest • Minimal information redundancy • Proximity measure • Similarity of two feature vectors • Clustering criterion • Expressed via a cost function or some rules • Clustering algorithms • Choice of algorithms • Validation of the results • Validation test (also, clustering tendency test) • Interpretation of the results • Integration with applications 8

  9. Requirements and Challenges • Scalability • Clustering all the data instead of only on samples • Ability to deal with different types of attributes • Numerical, binary, categorical, ordinal, linked, and mixture of these • Constraint-based clustering User may give inputs on constraints • Use domain knowledge to determine input parameters • • Interpretability and usability • Others • Discovery of clusters with arbitrary shape • Ability to deal with noisy data • Incremental clustering and insensitivity to input order • High dimensionality 9

  10. Matrix Data: Clustering: Part 1 • Cluster Analysis: Basic Concepts • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Evaluation of Clustering • Summary 10

  11. Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a dataset D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c i is the centroid or medoid of cluster C i )     k 2 E ( d ( p , c ))  i 1 p C i i • Given k , find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 11

  12. The K-Means Clustering Method • Given k , the k-means algorithm is implemented in four steps: • Step 0: Partition objects into k nonempty subsets • Step 1: Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point , of the cluster) • Step 2: Assign each object to the cluster with the nearest seed point • Step 3: Go back to Step 1, stop when the assignment does not change 12

  13. An Example of K-Means Clustering K=2 Arbitrarily Update the partition cluster objects into centroids k groups The initial data set Loop if Reassign objects needed Partition objects into k nonempty  subsets Repeat  Update the Compute centroid (i.e., mean  cluster point) for each partition centroids Assign each object to the  cluster of its nearest centroid Until no change  13

  14. Theory Behind K-Means • Objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝐷 𝑗 =𝑘 ||𝑦 𝑗 − 𝑑 • Total within-cluster variance • Re-arrange the objective function 𝑙 𝑘 || 2 • 𝐾 = 𝑘=1 𝑗 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • 𝑥 𝑗𝑘 ∈ {0,1} • 𝑥 𝑗𝑘 = 1, 𝑗𝑔 𝑦 𝑗 𝑐𝑓𝑚𝑝𝑜𝑕𝑡 𝑢𝑝 𝑑𝑚𝑣𝑡𝑢𝑓𝑠 𝑘; 𝑥 𝑗𝑘 = 0, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • Looking for: • The best assignment 𝑥 𝑗𝑘 • The best center 𝑑 𝑘 14

  15. Solution of K-Means 𝑙 𝑘 || 2 𝐾 = 𝑥 𝑗𝑘 ||𝑦 𝑗 − 𝑑 • Iterations 𝑘=1 𝑗 • Step 1: Fix centers 𝑑 𝑘 , find assignment 𝑥 𝑗𝑘 that minimizes 𝐾 𝑘 || 2 is the smallest • => 𝑥 𝑗𝑘 = 1, 𝑗𝑔 ||𝑦 𝑗 − 𝑑 • Step 2: Fix assignment 𝑥 𝑗𝑘 , find centers that minimize 𝐾 • => first derivative of 𝐾 = 0 𝜖𝐾 • => 𝜖𝑑 𝑘 = −2 𝑗 𝑥 𝑗𝑘 (𝑦 𝑗 − 𝑑 𝑘 ) = 0 𝑗 𝑥 𝑗𝑘 𝑦 𝑗 • => 𝑑 𝑘 = 𝑗 𝑥 𝑗𝑘 • Note 𝑗 𝑥 𝑗𝑘 is the total number of objects in cluster j 15

  16. Comments on the K-Means Method • Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n . • Comment: Often terminates at a local optimal • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes 16

  17. Variations of the K-Means Method • Most of the variants of the k-means which differ in • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method 17

  18. What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! • Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 18

  19. PAM: A Typical K-Medoids Algorithm Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,O ramdom Total Cost = 26 10 10 Do loop Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 19

Recommend


More recommend