less is more non redundant subspace clustering
play

Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel - PowerPoint PPT Presentation

Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel Mller Stephan Gnnemann Ralph Krieger Thomas Seidl Aalborg University, Denmark RWTH Aachen University, Germany MultiClust Workshop at SIGKDD 2010


  1. Less is More: Non-Redundant Subspace Clustering Ira Assent ◦ Emmanuel Müller • Stephan Günnemann • Ralph Krieger • Thomas Seidl • ◦ Aalborg University, Denmark • RWTH Aachen University, Germany MultiClust Workshop at SIGKDD 2010 July 25, 2010

  2. Effective Models Efficient Computation Evaluation and Exploration of Results Detection of Non-Redundant Subspace Clusters I # boats internet in Miami usage C 1 C 2 C 3 C 5 C 6 sportive C 4 income activities Hidden clusters are described by different attribute sets Each object might be grouped in multiple clusters ⇒ Novel challenges for subspace clustering Less is More: Non-Redundant Subspace Clustering 1 / 11

  3. Effective Models Efficient Computation Evaluation and Exploration of Results Detection of Non-Redundant Subspace Clusters II # boats in Miami C 1 Subspace Cluster: (rich; boat owner; car fan; C 3 globetrotter; horse fan) Exp. many projections # horses freq. flyer miles (rich) (boat owner) (rich; globetrotter) # cars ... C 4 income Huge amount of redundant clusters ⇒ Typically number of clusters ≫ number of objects ⇒ Detection of all and only non-redundant subspace clusters Less is More: Non-Redundant Subspace Clustering 2 / 11

  4. Effective Models Efficient Computation Evaluation and Exploration of Results Overview Main question How can you use/extend non-redundant clustering ... In this talk, we present A survey of our contributions so far The generality of our techniques Our open source initiatives for the community Research questions arise in the areas of: Effective Models 1 Efficient Computation 2 Evaluation and Exploration of Results 3 Less is More: Non-Redundant Subspace Clustering 3 / 11

  5. Effective Models Efficient Computation Evaluation and Exploration of Results Notions and Related Work Abstract subspace clustering definition Definition of object set O 1,2,3,4 1,2,3,4 clustered in subspace S 1,2,3 1,2,3 1,2,4 1,2,4 1,3,4 1,3,4 2,3,4 2,3,4 C = ( O , S ) with O ⊆ DB , S ⊆ DIM 1,2 1,2 1,3 1,4 1,4 2,3 2,3 2,4 3,4 3,4 Selection of result set M 1 1 2 2 3 3 4 4 a subset of all valid subspace clusters ALL M = { ( O 1 , S 1 ) . . . ( O n , S n ) } ⊆ ALL Related work Subspace clustering: focus on definition of ( O , S ) ⇒ Output all valid subspace cluster M = ALL ( ⇒ too many) Projected clustering: focus on definition of disjoint clusters in M ⇒ Unable to detect objects in multiple clusters ( ⇒ too few) Less is More: Non-Redundant Subspace Clustering 4 / 11

  6. Effective Models Efficient Computation Evaluation and Exploration of Results Non-Redundant Subspace Clustering Models Select M ⊆ ALL : Exclude redundant subspace clusters... C 2 C 2 C 3 C 1 C 1 Local (pairwise) redundancy elimination [ 1 ][ 2 ] ( O , S ) is non-redundant iff ¬∃ ( O ′ , S ′ ) with O ′ ⊆ O ∧ S ′ ⊃ S ∧ | O ′ | ≥ R · | O | ⇒ Excludes large number of redundant subspace clusters [1] Assent, Krieger, Müller and Seidl: DUSC: Dimensionality Unbiased Subspace Clustering , in ICDM 2007. [2] Assent, Krieger, Müller and Seidl: INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , in ICDM 2008. Less is More: Non-Redundant Subspace Clustering 5 / 11

  7. Effective Models Efficient Computation Evaluation and Exploration of Results Generalization of Redundancy Elimination Relevant subspace clustering model [ 3 ] Include the most interesting subspace clusters Exclude redundant subspace clusters ⇒ Provide most relevant subspace clusters in result set ⇒ Extract novel knowledge with each cluster relevance model all possible relevant clustering clusters interestingness redundancy M ALL ALL of clusters of clusters Given any definition of subspace clusters C = ( O , S ) ⇒ Choose optimal subset M = { C 1 , . . . , C n } ⊆ ALL Proof: Such an optimization is an NP-hard problem [3] Müller, Assent, Günnemann, Krieger and Seidl: Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data , in ICDM 2009. Less is More: Non-Redundant Subspace Clustering 6 / 11

  8. Effective Models Efficient Computation Evaluation and Exploration of Results Redundancy Pruning by Depth-First Processing Pruning Applicable for local redundancy (simple pairwise model) Enables in-process pruning of redundant clusters. depth-first 1,2,3,4 1,2,3,4 1,2,3,4 1,2,3,4 1,2,3 1,2,3 1,2,4 1,2,4 1,3,4 1,3,4 2,3,4 2,3,4 1,2,3 1,2,3 1,2,4 1,2,4 1,3,4 1,3,4 2,3,4 2,3,4 1,2 1,2 1,3 1,4 1,4 2,3 2,3 2,4 3,4 3,4 1,2 1,2 1,3 1,4 1,4 2,3 2,3 2,4 3,4 3,4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 breadth-first Step-by-step processing ( k -D → ( k + 1 ) -D subspace!) ⇒ Scalability to high dimensional data? Less is More: Non-Redundant Subspace Clustering 7 / 11

  9. Effective Models Efficient Computation Evaluation and Exploration of Results Scalable Subspace Processing direct jump 1,2,3,4 1,2,3,4 1,2,3 1,2,3 1,2,4 1,2,4 1,3,4 1,3,4 2,3,4 2,3,4 Dimension 2 2 1,2 1,2 1,3 1,4 1,4 2,3 2,3 2,4 3,4 3,4 1 4 2 n 1 1 2 2 3 3 4 4 o i s 1 n e m interval 1 interval 2 i D Dimension 1 Key idea: density estimation + steered jumps Subspace clusters are represented by many low dimensional projections Use 2-D projections to estimate density in higher subspace regions [ 4 ] Use k -D projections to jump directly to ( k + x ) -D subspaces [ x ≫ 1 ] Best-first search: Intelligent steering to promising subspace regions [4] Müller, Assent, Krieger, Günnemann and Seidl: DensEst: Density Estimation for Data Mining in High Dimensional Spaces , in SDM 2009. Less is More: Non-Redundant Subspace Clustering 8 / 11

  10. Effective Models Efficient Computation Evaluation and Exploration of Results Challenges in Evaluation and Exploration General challenge for clustering No ground truth available for clustering ⇒ Subjective evaluation by exploration requires visualization techniques and interactive exploration tools ⇒ Objective evaluations are incomparable using different implementations , databases and quality measures ⇒ We provide broad evaluation study & interactive exploration framework Evaluation Study [ 5 ] Characterization of major paradigms Providing comparable baseline implementations Evaluation based on broad set of data sets , quality measures and parameter settings [5] Müller, Günnemann, Assent and Seidl: Evaluating Clustering in Subspace Projections of High Dimensional Data , in VLDB 2009. Less is More: Non-Redundant Subspace Clustering 9 / 11

  11. Effective Models Efficient Computation Evaluation and Exploration of Results Open Source Framework OpenSubspace framework Framework for research, education and application [ 6 ][ 7 ][ 8 ][ 9 ] Baselines for algorithm and evaluation measure development OpenSubspace unified algo. A algorithm algo. B repository algo. C algo. D re-implementation rare case: common implementation eval. 1 unified eval. 2 eval. 1 eval. 2 eval. 3 evaluation eval. 3 repository http://dme.rwth-aachen.de/OpenSubspace/ [6] Müller, Assent, Krieger, Jansen and Seidl: Morpheus: Interactive Exploration of Subspace Clustering , in KDD 2008. [7] Assent, Müller, Krieger, Jansen and Seidl: Pleiades: Subspace Clustering and Evaluation , in PKDD 2008. [8] Günnemann, Färber, Kremer, Seidl: CoDA: Interactive Cluster Based Concept Discovery , in VLDB 2010 [9] Müller, Schiffer, Gerwert, Hannen, Jansen, Seidl: SOREX: Subspace Outlier Ranking Exploration Toolkit , in PKDD 2010. Less is More: Non-Redundant Subspace Clustering 10 / 11

  12. Effective Models Efficient Computation Evaluation and Exploration of Results Conclusion and Future Work Subspace clustering is still an emerging research field... Is the basis for a lot of further research Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering . . . Less is More: Non-Redundant Subspace Clustering 11 / 11

  13. Effective Models Efficient Computation Evaluation and Exploration of Results Conclusion and Future Work Subspace clustering is still an emerging research field... Is the basis for a lot of further research Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering . . . Thank you for your attention. Questions? Less is More: Non-Redundant Subspace Clustering 11 / 11

Recommend


More recommend