21/05/12 ¡ Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012 Unsupervised Learning (cont…) Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html 1 ¡
21/05/12 ¡ Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary Mixed attributes n The distance functions we have seen are for data with all numeric attributes, or all nominal attributes, etc. n In many practical cases data has different types of attributes, from the following 6: q interval-scaled q ratio-scaled q symmetric binary q asymmetric binary q nominal q ordinal n Clustering a data set involving mixed attributes is a challenging problem 2 ¡
21/05/12 ¡ Convert to a single type n One common way of dealing with mixed attributes is to: Choose a dominant attribute type 1. Convert the other types to this type 2. n E.g., if most attributes in a data set are interval-scaled q we convert ordinal attributes and ratio-scaled attributes to interval-scaled attributes q it is also appropriate to treat symmetric binary attributes as interval-scaled attributes Convert to a single type (cont …) n It does not make much sense to convert a nominal attribute or an asymmetric binary attribute to an interval-scaled attribute q but it is frequently done in practice by assigning some numbers to them according to some hidden ordering, e.g., prices of the fruits n Alternatively, a nominal attribute can be converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes 3 ¡
21/05/12 ¡ Combining individual distances n This approach computes individual attribute distances and then combine them n A combination formula, proposed by Gower, is r f f ∑ d δ ij ij f 1 (4) dist ( x , x ) = = i j r f ∑ δ ij f 1 = q The distance dist( x i , x j ) is between 0 and 1 q r is the number of attributes ! 1 if x if and x jf are not missing # # f = q ! ij " 0 if x if or x jf is missing # 0 if attribute f is asymmetric and x if and x jf are both 0 # $ q d ij f is the distance contributed by attribute f , in the range [0,1] Combining individual distances (cont …) n If f is a binary or nominal attribute " $ 1 if x if ! x jf f = d ij # $ 0 otherwise % q distance (4) reduces to equation (3)-lect 10 if all attributes are nominal n the simple matching distance (1)-lect 10 if all attributes are symmetric binary n the Jaccard distance (2)-lect 10 if all attributes are asymmetric n n If f is interval-scaled f = x if ! x jf d ij R f q R f is the value range of f R f = max( f ) ! min( f ) q If all the attributes are interval-scaled, distance (4) reduces to Manhattan distance Assuming that all attributes values are standardized n n Ordinal and ratio-scaled attributes are converted to interval-scaled attributes and handled in the same way 4 ¡
21/05/12 ¡ Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary How to choose a clustering algorithm n Clustering research has a long history q A vast collection of algorithms are available q We only introduced several main algorithms n Choosing the “best” algorithm is challenging q Every algorithm has limitations and works well with certain data distributions q It is very hard, if not impossible, to know what distribution the application data follow The data may not fully follow any “ideal” structure or distribution n required by the algorithms q One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values 5 ¡
21/05/12 ¡ How to choose a clustering algorithm (cont …) n Due to these complexities, the common practice is to run several algorithms using different distance functions 1. and parameter settings carefully analyze and compare the results 2. n The interpretation of the results must be based on q insight into the meaning of the original data q knowledge of the algorithms used n Clustering is highly application dependent and to certain extent subjective (personal preferences) Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary 6 ¡
21/05/12 ¡ Cluster Evaluation: hard problem n The quality of a clustering is very hard to evaluate because q We do not know the correct clusters n Some methods are used q User inspection A panel of experts inspects the resulting clusters and scores them n q Study centroids as spreads q Examine rules (e.g., from a decision tree) that describe the clusters q For text documents, one can inspect by reading The final score is the average of the individual scoring n Manual inspection is labor intensive and time consuming n Cluster evaluation: ground truth n We use some labeled data (for classification) q Assumption: Each class is a cluster n Let the classes in the data D be C =( c 1 , c 2 , … , c k ) q The clustering method produces k clusters, which divides D into k disjoint subsets, D 1 , D 2 , … , D k n After clustering, a confusion matrix is constructed q From the matrix, we compute various measurements: entropy , purity , precision , recall and F-score 7 ¡
21/05/12 ¡ Evaluation measures: Entropy n For each cluster, we can measure the entropy as k " entropy ( D i ) = ! Pr i ( c j )log 2 Pr i ( c j ) j = 1 q Pr i (c j ) : proportion of class c j in cluster D i n The entropy of the whole clustering is k D i ! entropy total ( D ) = D entropy ( D i ) i = 1 q |D i |/|D| is the weight of cluster D i , proportional to its size Evaluation measures: purity n Measures the extent a cluster contains only one class of data ( ) purity ( D i ) = max j Pr( c j ) n The purity of the whole clustering is k D i ! purity total ( D ) = D purity ( D i ) i = 1 q |D i |/|D| is the weight of cluster D i , proportional to its size n Precision, recall, and F-measure can be computed as well q Based on the class that is most frequent in the cluster 8 ¡
21/05/12 ¡ An example We can use the total entropy or purity to compare n different clustering results from the same algorithm q different algorithms q Precision, recall and F-measure can be computed as well for each cluster n The precision of Science in cluster 1 is 0.89, the recall is 0.83, the F-measure is q thus 0.86 A remark about ground truth evaluation n Commonly used to compare different clustering algorithms n A real-life data set for clustering has no class labels q Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand n The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm n This evaluation method is said to be based on external data or information 9 ¡
21/05/12 ¡ Evaluation based on internal information n Intra-cluster cohesion (compactness): q Cohesion measures how near the data points in a cluster are to the cluster centroid q Sum of squared error (SSE) is a commonly used measure n Inter-cluster separation (isolation): q Separation means that different cluster centroids should be far away from one another n In most applications, expert judgments are still the key Indirect evaluation n In some applications, clustering is not the primary task, but used to help perform another task n We can use the performance on the primary task to compare clustering methods n For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers q If we can cluster shoppers according to their features, we might be able to provide better recommendations q We can evaluate different clustering algorithms based on how well they help with the recommendation task q Here, we assume that the recommendation can be reliably evaluated 10 ¡
21/05/12 ¡ Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary Summary n Clustering is has along history and still active q There are a huge number of clustering algorithms q More are still coming every year n We only introduced several main algorithms. There are many others, e.g., q density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. n Clustering is hard to evaluate, but very useful in practice q This partially explains why there are still a large number of clustering algorithms being devised every year n Clustering is highly application dependent and to some extent subjective 11 ¡
Recommend
More recommend