Clustering Aggregation Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas Helsinki Institute for Information Technology, BRU Department of Computer Science University of Helsinki, Finland first.lastname@cs.helsinki.fi Abstract C 1 C 2 C 3 C v 1 1 1 1 1 v 2 1 2 2 2 We consider the following problem: given a set of clus- v 3 2 1 1 1 terings, find a clustering that agrees as much as possible v 4 2 2 2 2 with the given clusterings. This problem, clustering aggre- v 5 3 3 3 3 gation, appears naturally in various contexts. For example, v 6 3 4 3 3 clustering categorical data is an instance of the problem: each categorical variable can be viewed as a clustering of Figure 1. An example of clustering aggregation the input rows. Moreover, clustering aggregation can be used as a meta-clustering method to improve the robustness of clusterings. The problem formulation does not require a- produce a single clustering C that agrees as much as possible priori information about the number of clusters, and it gives with the m clusterings. We define a disagreement between two clusterings C and C ′ as a pair of objects ( v, u ) such a natural way for handling missing values. We give a formal that C places them in the same cluster, while C ′ places them statement of the clustering-aggregationproblem, we discuss related work, and we suggest a number of algorithms. For in a different cluster, or vice versa. If d ( C , C ′ ) denotes the several of the methods we provide theoretical guarantees on number of disagreements between C and C ′ , then the task is the quality of the solutions. We also show how sampling can to find a clustering C that minimizes � m i =1 d ( C i , C ) . be used to scale the algorithms for large data sets. We give As an example, consider the dataset V = { v 1 , v 2 , v 3 , an extensive empirical evaluation demonstrating the useful- v 4 , v 5 , v 6 } that consists of six objects, and let C 1 = ness of the problem and of the solutions. {{ v 1 , v 2 } , { v 3 , v 4 } , { v 5 , v 6 }} , C 2 = {{ v 1 , v 3 } , { v 2 , v 4 } , { v 5 } , { v 6 }} , and C 3 = {{ v 1 , v 3 } , { v 2 , v 4 } , { v 5 , v 6 }} be three clusterings of V . Figure 1 shows the three cluster- 1 Introduction ings, where each column corresponds to a clustering, and a value i denotes that the tuple in that row belongs in the i -th cluster of the clustering in that column. The rightmost Clustering is an important step in the process of data column is the clustering C = {{ v 1 , v 3 } , { v 2 , v 4 } , { v 5 , v 6 }} analysis with applications to numerous fields. Informally, that minimizes the total number of disagreements with the clustering is defined as the problem of partitioning data ob- clusterings C 1 , C 2 , C 3 . In this example the total number jects into groups (clusters), such that objects in the same of disagreements is 5: one with the clustering C 2 for the group are similar, while objects in different groups are dis- pair ( v 5 , v 6 ) , and four with the clustering C 1 for the pairs similar. This definition assumes that there is some well ( v 1 , v 2 ) , ( v 1 , v 3 ) , ( v 2 , v 4 ) , ( v 3 , v 4 ) . It is not hard to see that defined quality measure that captures intra-cluster similar- this is the minimum number of disagreements possible for ity and/or inter-cluster dissimilarity, and then clustering be- comes the problem of grouping together data objects so that any partition of the dataset V . the quality measure is optimized. We define clustering aggregation as the optimization In this paper we propose a novel approach to clustering problem where, given a set of m clusterings, we want to that is based on the concept of aggregation . We assume that find the clustering that minimizes the total number of dis- given the data set we can obtain some information on how agreements with the m clusterings. Clustering aggregation these points should be clustered. This information comes provides a general framework for dealing with a variety of in the form of m clusterings C 1 , . . . , C m . The objective is to problems related to clustering: ( i ) it gives a natural cluster-
right, and it has recently attracted a lot of attention in the theoretical computer-science community [2, 6, 8, 10]. We v 2 v 3 review some of the related literature on both clustering ag- gregation, and correlation clustering in Section 6. v 4 v 1 Our contributions can be summarized as follows. • We formally define the problem of clustering aggrega- tion, and we demonstrate the connection between clus- tering aggregation and correlation clustering. v 6 v 5 • We present a number of algorithms for clustering ag- gregation and correlation clustering. We also propose Figure 2. Correlation clustering instance for the a sampling mechanism that allows our algorithms to dataset in Figure 1. Solid edges indicate distances of 1/3, handle large datasets. The problems we consider are dashed edges indicate distances of 2/3, and dotted edges NP-hard, yet we are still able to provide approxima- indicate distances of 1. tion guarantees for many of the algorithms we propose. For the formulation of correlation clustering we con- sider we give a combinatorial 3-approximation algo- ing algorithm for categorical data, which allows for a simple rithm, which is an improvement over the best known treatment of missing values, ( ii ) it handles heterogeneous 9-approximation algorithm. data, where tuples are defined over incomparable attributes, ( iii ) it determines the appropriate number of clusters and • We present an extensive experimental study, where we it detects outliers, ( iv ) it provides a method for improving demonstrate the benefits of our approach. Further- the clustering robustness, by combining the results of many more, we show that our sampling technique reduces clustering algorithms, ( v ) it allows for clustering of data that the running time of the algorithms, without sacrificing is vertically partitioned in order to preserve privacy. We the quality of the clustering. elaborate on the properties and the applications of cluster- ing aggregation in Section 2. The rest of this paper is structured as follows. In Sec- tion 2 we discuss the various applications of the clustering- The algorithms we propose for the problem of clustering aggregation framework, which is formally defined in Sec- aggregation take advantage of a related formulation, which tion 3. In Section 4 we describe in detail the proposed is known as correlation clustering [2]. We map clustering algorithms for clustering aggregation and correlation clus- aggregation to correlation clustering by considering the tu- tering, and the sampling-based algorithm that allows us to ples of the dataset as vertices of a graph, and summariz- ing the information provided by the m input clusterings handle large datasets. Our experiments on synthetic and real datasets are presented in Section 5. Finally, Section 6 con- by weights on the edges of the graph. The weight of the edge ( u, v ) is the fraction of clusterings that place u and tains a review of the related work, and Section 7 is a short v in different clusters. For example, the correlation clus- conclusion. tering instance for the dataset in Figure 1 is shown in Fig- ure 2. Note that if the weight of the edge ( u, v ) is less than 2 Applications of clustering aggregation 1 / 2 then the majority of the clusterings place u and v to- gether, while if the weight is greater than 1 / 2 , the major- Clustering aggregationcan be applied in various settings. ity places u and v in different clusters. Ideally, we would We will now present some of the main applications and fea- like to cut all edges with weight more than 1 / 2 , and not tures of our framework. cut all edges with weight less than 1 / 2 . The goal in corre- lation clustering is to find a partition of the vertices of the Clustering categorical data: An important application graph that it cuts as few as possible of the edges with low of clustering aggregation is that it provides a very natural weight (less than 1 / 2 ), and as many as possible of the edges method for clustering categorical data. Consider a dataset with high weight (more than 1 / 2 ). In Figure 2, clustering with tuples t 1 , . . . , t n over a set of categorical attributes C = {{ v 1 , v 3 } , { v 2 , v 4 } , { v 5 , v 6 }} is the optimal clustering. A 1 , . . . , A m . The idea is to view each attribute A j as a Clustering aggregation has has been previously consid- way of producing a simple clustering of the data: if A j con- ered under a variety of names (consensus clustering, clus- tains k j distinct values, then A j partitions the data in k j tering ensemble, clustering combination) in a variety of dif- clusters – one cluster for each value. Then, clustering ag- ferent areas: machine learning [19, 12], pattern recogni- gregation considers all those m clusterings produced by the tion [14], bio-informatics [13], and data mining [21, 5]. The m attributes and tries to find a clustering that agrees as much problem of correlation clustering is interesting in its own as possible with all of them. 2
Recommend
More recommend