alternative clusterings current progress and open
play

Alternative Clusterings: Current Progress and Open Challenges James - PowerPoint PPT Presentation

Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1 Introduction Cluster analysis: group similar objects into


  1. Alternative Clusterings: Current Progress and Open Challenges James Bailey Department of Computer Science and Software Engineering The University of Melbourne, Australia 1

  2. Introduction • Cluster analysis: group “similar” objects into clusters • No single solution Cluster by pose or individual ? => Equally important, different views or hypotheses regarding the data

  3. Motivations • Multiple explanations of the data – user doesn’t initially know what they want, needs options – different viewpoints of users – may be aiming to verify that multiple explanations do not exist (hypothesis verification, or for benchmarking clustering algorithms) • Contrast with consensus clustering • ‘’Every clustering should be accompanied by at least one alternative clustering’’ !?

  4. Alternative Clustering: Is it new ? • From one perspective, alternative clustering is not so new • Generation of clusterings often goes like – Generate and assess a clustering with 2 clusters – Generate and assess a clustering with 3 clusters – … – Generate and assess a clustering with k clusters • We now have k-1 alternative clusterings …. – But some of them may be very similar

  5. Alternative Clustering Algorithms • Growing number of approaches ADFT, CAMI, COALA, Condens, Convolutional EM, Decorrelated k-means, MAXIMUS, Meta clustering, Multiview orthogonal clustering, NACI, Non redundant clustering,…. • Papers have appeared at – KDD10, ICML10, SDM10, KDD09, SDM09,ICDM08,ICDM07,ICDM06,KDD05, ICDM04, …,DMKD, KAIS, …

  6. How do these approaches differ ? • Task formulation : – Number of alternatives to generate – Sequential or Simultaneous Generation • Mathematical basis – Linear algebra – Information theory – Other objective functions

  7. Sequential Alternative Clustering Generation • Task : Given input clusterings {C1,..Cn}, generate an alternative clustering C’, such that C’ is of high quality and C’ is different from {C1…Cn} • Important special case: n=1 Alternative Existing C1 generate C2 ------> C ’ … Cn

  8. Simultaneous Alternative Clustering Generation • Task : Simultaneously generate n clusterings {C1,…Cn}, such that each Ci is of high quality and each pair (Ci,Cj) is different from one another • Important special case: n=2 Alternatives C1 generate C2 ----------> … Cn

  9. Sequential vs. Simultaneous • Sequential (greedy) – Semi-supervised – For i=2 to n • {generate the optimal alternative clustering with respect to the previous i clusterings} – Locally optimal at each step • Simultaneous (non-greedy) – Unsupervised – In parallel, generate optimal set of n clusterings – Globally optimal clustering collection • but might miss some strong clusterings which would be generated by a sequential technique • More difficult optimisation problem

  10. Style of Algorithm • Projection based – Project the data into an orthogonal subspace and then re-cluster – Appealing linear algebra formulation – Relatively efficient – Orthogonality may be too strict • More complex objective function – Generate the alternative clustering, trading off dissimilarity and quality in the objective function – More flexible – May require parameter choices

  11. Simple Example Most existing techniques seem to work well (a canonical example)

  12. Circle of Gaussians -Techniques which trade off dissimilarity and quality more likely to produce the second clustering -Orthogonal projection doesn’t work so well here

  13. Other issues • Evaluation : Measuring quality/dissimilarity of alternatives • Clustering setting : – Desired shape of clusters: spherical versus elongated, linear versus non linear separation – low versus high dimensionality data – continuous versus discrete features – soft versus hard clusters – EM versus K-means versus hierarchical versus constraint based – Number of clusters desired in each clustering

  14. Alternative Clustering Evaluation • Measuring dissimilarity : Mathematical measures - Rand index, Jaccard index, normalised mutual information … • Measuring quality : – Internal validation measures : Dunn index, David Bouldin index, silhouette width – External validation : Synthetic examples • Combine dissimilarity and quality into a single number, or present separately ? • Are these numbers useful ?

  15. Where are we ? • Good existing algorithms for generation of one or two alternatives – Sequential generation – Simultaneous generation • Not yet deployed on very large datasets • Validated using assorted benchmark datasets and internal metrics

  16. Open Issues • What’s the killer application ? – Deployment of alternative clusterings – Need convincing use cases where consensus clustering is limited • Objective function and performance measures • How many alternatives is enough ? • How many clusters should be in an alternative clustering ? – the same number as the original clustering ?

  17. Open Issues cont. • How to find alternative subspace clusters (rather than clusterings) ? • Visualisation of alternative clusterings • More focused alternatives – ``Give me another clustering which is similar in these respects and different in these other respects to the previous clustering’’

  18. Moving Forward • Central repository of code and canonical examples (synthetic and real) • Make alternative clusterings algorithms accessible • Identify cases in the literature of ‘’missing’’ alternative clusterings

  19. Bibliography • E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge Discovery. • D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of ICML 10, 2010. • X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non Linear Alternative Clusterings. Proc. of KDD 2010. • X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of SDM 2010. • Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc. of KDD 2009. • P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Proc. of SDM 2008. • I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM 2008. • Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc. of ICDM 2007. • E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. Proc. of ICDM 2006. • R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, 2006. • D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD 2005. • Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.

Recommend


More recommend