14/05/12 ¡ Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 10: 14 May 2012 Unsupervised Learning (cont…) Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html 1 ¡
14/05/12 ¡ Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary Hierarchical Clustering Produce a nested sequence of clusters, a tree, also called dendrogram n Singleton clusters are at the bottom of the three n One root clusters covers all the data points n Siblings clusters partition the data points of the common parent 2 ¡
14/05/12 ¡ Types of hierarchical clustering n Agglomerative (bottom up) clustering: it builds the dendrogram (tree) from the bottom level, and q merges the most similar (or nearest) pair of clusters q stops when all the data points are merged into a single cluster (i.e., the root cluster) n Divisive (top down) clustering: it starts with all data points in one cluster, the root q splits the root into a set of child clusters q each child cluster is recursively divided further q stops when only singleton clusters of individual data points remain Agglomerative clustering It is more popular then divisive methods n At the beginning, each data point forms a cluster (also called a node) n Merge nodes/clusters that have the least distance n Go on merging n Eventually all nodes belong to one cluster 3 ¡
14/05/12 ¡ Agglomerative clustering algorithm An example: working of the algorithm 4 ¡
14/05/12 ¡ Measuring the distance of two clusters n A few ways to measure distances of two clusters q k-means uses only the distances between centroids n Different variations of the algorithm q Single link q Complete link q Average link q Centroids q … Single link method n The distance between two clusters is the distance between two closest data points in the two clusters q one data point from each cluster n It can find arbitrarily The two natural clusters shaped clusters, but (in red) are not found q It may cause the undesirable “chain effect” by noisy points (in black) 5 ¡
14/05/12 ¡ Complete link method n The distance between two clusters is the distance of two furthest data points in the two clusters n It is sensitive to outliers (in black) because they are far away n It usually produces better clusters than the single-link method Average link and centroid methods Average link method n A compromise between q the sensitivity of complete-link clustering to outliers and q the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects n The distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters Centroid method n the distance between two clusters is the distance between their centroids 6 ¡
14/05/12 ¡ The complexity n All the hierarchical algorithms are at least O(n 2 ) q n is the number of data points n Single link can be done in O(n 2 ) n Complete and average links can be done in O(n 2 log n) n Due the complexity, hierarchical algorithms are hard to use for large data sets q Perform hierarchical clustering on a sample of data points and then assign the others by distance or by supervised learning (see lecture 9) q Use scale-up methods (e.g., BIRCH) that find many small clusters using an efficient algorithm n use these clusters as the starting nodes for the hierarchical clustering n Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary 7 ¡
14/05/12 ¡ Distance functions n Key to clustering q “similarity” and “dissimilarity” are other commonly used terms n There are numerous distance functions for q Different types of data n Numeric data n Nominal data n … q Different specific applications Distance functions for numeric attributes n We denote distance with dist ( x i , x j ), where x i and x j are data points (vectors) n Most commonly used functions are q Euclidean distance and q Manhattan (city block) distance n They are special cases of Minkowski distance 1 h + x i 2 ! x j 2 h + ... + x ir ! x jr ( h ) h dist ( x i , x j ) = x i 1 ! x j 1 h is positive integer, r is the number of attributes 8 ¡
14/05/12 ¡ Euclidean distance and Manhattan distance n If h = 2, it is the Euclidean distance 2 2 2 dist ( x , x ) ( x x ) ( x x ) ... ( x x ) = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n If h = 1, it is the Manhattan distance dist ( x , x ) | x x | | x x | ... | x x | = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n Weighted Euclidean distance 2 2 2 dist ( x , x ) w ( x x ) w ( x x ) ... w ( x x ) = − + − + + − i j 1 i 1 j 1 2 i 2 j 2 r ir jr Squared distance and Chebychev distance n Squared Euclidean distance : to place progressively greater weight on data points that are further apart 2 2 2 dist ( x , x ) ( x x ) ( x x ) ... ( x x ) = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n Chebychev distance : one wants to define two data points as “different” if they are different on any one of the attributes ( ) dist ( x i , x j ) = max x i 1 ! x j 1 , x i 2 ! x j 2 , … , x ir ! x jr 9 ¡
14/05/12 ¡ Distance functions for binary and nominal attributes n Binary attribute: has two values or states but no ordering relationships, q E.g., Gender: female and male q The 2 values are conventionally represented by 1 and 0 n We use a confusion matrix to introduce the distance functions/measures n Let the i th and j th data points be x i and x j (vectors) Confusion matrix 10 ¡
14/05/12 ¡ Symmetric binary attributes n A binary attribute is symmetric if both of its states (0 and 1) have equal importance, e.g., female and male of the attribute Gender n Distance function: Simple Matching Distance, proportion of mismatches of their values b c + (1) dist ( x , x ) = i j a b c d + + + n There are variations, adding weights To mismatches To matches 2( b + c ) b + c dist ( x i , x j ) = dist ( x i , x j ) = a + d + 2( b + c ) 2( a + d ) + b + c Symmetric binary attributes: example n x 1 and x 2 are two data points n Each of the 7 attributes is symmetric binary n The simple matching distance is b + c 2 + 2 + 1 + 2 = 3 2 + 1 dist ( x 1 , x 2 ) = a + b + c + d = 7 = 0.429 n If there is a weight on mismatches 2( b + c ) 2 + 2(2 + 1) + 2 = 6 2(2 + 1) dist ( x 1 , x 2 ) = a + 2( b + c ) + d = 10 = 0.6 11 ¡
14/05/12 ¡ Asymmetric binary attributes n Asymmetric: if one of the states is more important or valuable than the other q By convention, state 1 represents the more important state, which is typically the rare or infrequent state q Jaccard distance is a popular measure b c + (2) dist ( x , x ) = i j a b c + + q There are variations, adding weights To mismatches To matches of the important state 2( b + c ) b + c dist ( x i , x j ) = dist ( x i , x j ) = a + 2( b + c ) 2 a + b + c Asymmetric binary attributes: example n x 1 and x 2 are two data points n Each of the 7 attributes is asymmetric binary n The Jaccard distance is b + c 2 + 2 + 1 = 3 2 + 1 dist ( x 1 , x 2 ) = a + b + c = 5 = 0.6 n If there is a weight on matches of the important state b + c 2*2 + 2 + 1 = 3 2 + 1 dist ( x 1 , x 2 ) = 7 = 0.429 2 a + b + c = 12 ¡
14/05/12 ¡ Nominal attributes n Nominal attributes : with more than two states or values q the commonly used distance measure is also based on the simple matching method q Given two data points x i and x j , let the number of attributes be r , and the number of values that match in x i and x j be q r q − dist ( x , x ) (3) = i j r Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary 13 ¡
14/05/12 ¡ Data standardization n In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances n Consider the following pair of data points q x i : (0.1, 20) and x j : (0.9, 720) (0.9 ! 0.1) 2 + (720 ! 20) 2 = 700.000457 dist ( x i , x j ) = n The distance is almost completely dominated by (720-20) = 700 n Standardize attributes: to force the attributes to have a common value range Interval-scaled attributes n Their values are real numbers following a linear scale q E.g., the difference in Age between 10 and 20 is the same as that between 40 and 50 q The key idea is that intervals keep the same importance through out the scale n Two main approaches to standardize interval scaled attributes, range and z-score 14 ¡
Recommend
More recommend