CISC 4631 Data Mining Lecture 09: Clustering Theses slides are - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 09: • Clustering Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) • Raymond Mooney (UT Austin)

What is Clustering?  Finding groups of objects such that objects in a group will be similar to one another and different from the objects in other groups  Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Inter-cluster Intra-cluster distances are distances are maximized minimized

What is a natural grouping among these objects? Clustering is subjective 3 Simpson's Family Females Males School Employees

Similarity is Subjective 4

Intuitions behind desirable distance measure properties D (A,B) = D (B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D (A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D (A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D (A,B)  D (A,C) + D (B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.” 5

Applications of Cluster Analysis  Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, – Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, group genes Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, 2 and proteins that have ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, similar functionality, group Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, stocks with similar price 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN fluctuations, or customers Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP that have similar buying Schlumberger-UP habits  Summarization – Reduce the size of large data sets Clustering precipitation in Australia

Notion of a Cluster can be Ambiguous So tell me how many clusters do you see? How many clusters? Six Clusters Two Clusters Four Clusters

Types of Clusterings  A clustering is a set of clusters  Important distinction between hierarchical and partitional sets of clusters  Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset  Hierarchical clustering – A set of nested clusters organized as a hierarchical tree

Partitional Clustering Original Points A Partitional Clustering

Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram Simpsonian Dendrogram

Other Distinctions Between Sets of Clusters  Exclusive versus non-exclusive – In non-exclusive clusterings points may belong to multiple clusters – Can represent multiple classes or ‘border’ points  Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics  Partial versus complete – In some cases, we only want to cluster some of the data

Types of Clusters  Well-separated clusters  Center-based clusters (our main emphasis)  Contiguous clusters  Density-based clusters  Described by an Objective Function

Types of Clusters: Well-Separated  Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

Types of Clusters: Center-Based  Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster (assuming numerical attributes), or a medoid, the most “representative” point of a cluster (used if there are categorical features) 4 center-based clusters

Types of Clusters: Contiguity- Based  Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters

Types of Clusters: Density-Based  Density-based – A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters

Types of Clusters: Objective Function  Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) – Example: Sum of squares of distances to cluster center

Clustering Algorithms  K-means and its variants  Hierarchical clustering  Density-based clustering

K-means Clustering Partitional clustering approach  Each cluster is associated with a centroid (center point)  Each point is assigned to the cluster with the closest centroid  Number of clusters, K, must be specified  The basic algorithm is very simple  – K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html

K-means Clustering 1. Ask user how many clusters 5 they’d like. (e.g. k=3) 4 2. Randomly guess k cluster Center locations 3 3. Each datapoint finds out which 2 Center it’s closest to. 4. Each Center finds 1 the centroid of the points it owns… 0 5. …and jumps there 0 1 2 3 4 5 6. …Repeat until 20 terminated!

K-means Clustering: Step 1 means Clustering: Step 1 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 21

K-means Clustering 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 22

K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 23

K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 24

K-means Clustering 5 expression in condition 2 4 k 1 3 2 k 2 k 3 1 0 0 1 2 3 4 5 expression in condition 1 25

K-means Clustering – Details Initial centroids are often chosen randomly.  – Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the  cluster ‘Closeness’ is measured by Euclidean distance, correlation,  etc. K-means will converge for common similarity measures  mentioned above. Most of the convergence happens in the first few iterations.  – Often the stopping condition is changed to ‘Until relatively few points change clusters’

Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – We can show that to minimize SSE the best update strategy is to use the center of the cluster. – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters  A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Two different K-means Clusterings 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids Iteration 6 Iteration 3 Iteration 2 Iteration 4 Iteration 5 Iteration 1 3 3 3 3 3 3 2.5 2.5 2.5 2.5 2.5 2.5 2 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1.5 y y y y y y 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1 -1 -1 -1 -1 -1 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 1.5 2 2 2 2 2 2 x x x x x x If you happen to choose good initial centroids, then you will get this after 6 iterations

Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Good clustering Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) What is Clustering? Finding groups

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides

CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by

Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan,

Data Mining Lecture 04: Decision Trees Theses slides are based on the slides by Tan,

Data Mining Lecture 03: Introduction to classification Linear classifier Theses

CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the

Data Mining Lecture 05: Overfitting Evaluation: accuracy, precision, recall, ROC

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Doctoral theses research data and metadata documentation ETD 2013 Hong Kong 16th

CISC Semiconductor GmbH Dr. Markus PISTAUER CEO m.pistauer@cisc.at Company at a glance

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Assessment of theses at Assessment of theses at masters and PhD level masters and PhD level

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

EECS 3401 AI and Logic Prog. Lectures 4 & 5 Adapted from slides of Prof. Yves

2 ( g 2 ) 10 14 a e = 115965218073 ( 28 ) expt better 115965218178 ( 76 ) SM with

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Predictive Hebbian Learning Computational Models of Neural Systems Lecture 5.2 David S.

Functional Testing Software Engineering Andreas Zeller Saarland University Testing Again, a

Chapter 7: Arrays CS 121 Department of Computer Science College of Engineering Boise State

Eduardo Silva @ edsiper eduardo@treasure-data.com About Me Eduardo Silva Github &

Dr. Rachel Feeney On behalf of the Skate Plan Development T eam Council meeting June 23-25,

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) What is Clustering? Finding groups

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides

CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by

Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan,

Data Mining Lecture 04: Decision Trees Theses slides are based on the slides by Tan,

Data Mining Lecture 03: Introduction to classification Linear classifier Theses

CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the

Data Mining Lecture 05: Overfitting Evaluation: accuracy, precision, recall, ROC

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Doctoral theses research data and metadata documentation ETD 2013 Hong Kong 16th

CISC Semiconductor GmbH Dr. Markus PISTAUER CEO m.pistauer@cisc.at Company at a glance

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Assessment of theses at Assessment of theses at masters and PhD level masters and PhD level

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

EECS 3401 AI and Logic Prog. Lectures 4 &amp; 5 Adapted from slides of Prof. Yves

2 ( g 2 ) 10 14 a e = 115965218073 ( 28 ) expt better 115965218178 ( 76 ) SM with

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Predictive Hebbian Learning Computational Models of Neural Systems Lecture 5.2 David S.

Functional Testing Software Engineering Andreas Zeller Saarland University Testing Again, a

Chapter 7: Arrays CS 121 Department of Computer Science College of Engineering Boise State

Eduardo Silva @ edsiper eduardo@treasure-data.com About Me Eduardo Silva Github &amp;

Dr. Rachel Feeney On behalf of the Skate Plan Development T eam Council meeting June 23-25,

EECS 3401 AI and Logic Prog. Lectures 4 & 5 Adapted from slides of Prof. Yves

Eduardo Silva @ edsiper eduardo@treasure-data.com About Me Eduardo Silva Github &