clusterability in model selection
play

Clusterability in Model Selection Johannes Kiesel - PowerPoint PPT Presentation

Clusterability in Model Selection Johannes Kiesel Bauhaus-Universitt Weimar 28 th May, 2014 1 [] Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities


  1. Clusterability in Model Selection Johannes Kiesel Bauhaus-Universität Weimar 28 th May, 2014 1 []

  2. Cluster Analysis: Motivation Art and Design Computer Science Media Studies Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it 2 []

  3. Cluster Analysis: Motivation ? ? ? Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

  4. Cluster Analysis: Motivation D R A W E Y Data Categorization Given data (a set of comparable entities or objects ) Find a categorization of it (without labels) 2 []

  5. Cluster Analysis: In the Beginning was the Data Data 3 []

  6. Cluster Analysis: Modeling Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 4 []

  7. Cluster Analysis: Modeling Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 4 []

  8. Cluster Analysis: Clustering Data Clustering algorithm Model Clustering 5 []

  9. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  10. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  11. Cluster Analysis: Clustering D R A W E Y Data Categorization Clustering algorithm Model Clustering 5 []

  12. Cluster Analysis: Modeling II Age: Fashion index: Data XKCD/week: Library (h/day): Sketches/day: Model 6 []

  13. Cluster Analysis: Modeling II Age: Noselength (cm): Data Weight (kg): Heigth (cm): Student ID: Model 6 []

  14. Cluster Analysis: Modeling II Data Categorization Clustering algorithm Model Clustering 6 []

  15. Cluster Analysis: Modeling II Data Categorization Clustering algorithm Model Clustering 6 []

  16. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering 7 []

  17. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Separation Cohesiveness 7 []

  18. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test good (2.0) Separation Cohesiveness 7 []

  19. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  20. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  21. Cluster Analysis: Cluster Evaluation Clustering Algorithm Model Clustering Evaluation index Test bad (0.0) Separation Cohesiveness 7 []

  22. Cluster Analysis: Model Evaluation Model Clustering Clusterability index Test bad (0.0) 8 []

  23. Cluster Analysis: Overview Test (1.2) Test (1.4) Test (4.2) Test (0.6) Test (2.3) Test (0.8) Test (1.3) Test (2.0) Test (0.9) Test (1.0) Clustering Clusterability Evaluation algorithm(s) index index 9 []

  24. Clusterability ◮ Task: calculate a score for a model ◮ Has to be comparable at least among similar models (same number of objects) Test (4.2) ◮ A clusterable model (high score) has a dominant structure of mutually separated parts that are cohesive groups of objects. 10 []

  25. Clusterability I: Salient Clustering Idea Model selection by cluster evaluation (“one-step”) ◮ Cluster the model with different algorithms and/or parameter settings ◮ Evaluate all clusterings ◮ Choose best combination of model & clustering → two-step one-step 11 []

  26. Clusterability I: Dunn Index Dunn index Evaluation family Dunn MST index min ( ) / max ( 1 / ) Minimum spanning tree Dunn index (Dunn MST) 1 / Largest edge length in the minimum spanning tree of the cluster Smallest dissimilarity of objects from different clusters Optimum clustering is feasibly computable (no other clustering algorithm necessary) 12 []

  27. Clusterability I: Salient Clustering + - + - Most evaluation indices Needs no additional clusterability index require local optimization + Evaluation indices are - Not all evaluation indices better understood can compare clusterings of different models → 13 []

  28. Clusterability II: Statistical Tests on Structure Idea Use a statistical test for unstructured models ◮ Null hypothesis: model generated from a model distribution that generates non-clusterable models (e.g., uniform distribution) ◮ Calculate a test statistic with known distribution under the null hypothesis ◮ Use the probability that a similar large value occurs under the null hypothesis for the clusterability assessment 14 []

  29. Clusterability II: Hopkins and Skellam Statistic x 0 x spaced uniform clustered Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  30. Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) ψ nn ( x ) Dissimilarity of x to its nearest neighbor 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  31. Clusterability II: Hopkins and Skellam Statistic x 0 ψ nn ( x 0 ) x ψ nn ( x ) H r → 0 H r ≈ 0 . 5 H r → 1 Compare distribution of original objects ( x ) and r uniformly sampled x 0 (null hypothesis) � r i = 1 ( ψ nn ( x 0 i )) m H r = i )) m + � r � r i = 1 ( ψ nn ( x 0 i = 1 ( ψ nn ( x π ( i ) )) m ψ nn ( x ) Dissimilarity of x to its nearest neighbor m Number of dimensions 15 [Hopkins and Skellam. A New Method for Determining the Type of Distribution of Plant Individuals. 1954]

  32. Clusterability II: Statistical Tests on Structure + + The distribution under the null hypothesis allows β r , r -distribution 5 for an interpretation of the probability density function score 4 + Often requires only a 3 sample 2 - 1 0 - Depends heavily on the 0 0.2 0.4 0.6 0.8 1 H r (uniform distribution) null hypothesis - Adjustment of statistics is not trivial 16 []

  33. Clusterability III: Concentration of Dissimilarities Idea In a clusterable model most object pairs should be either very dissimilar (different clusters) or very similar (same clusters) Similarity-histogram separation cohesiveness 0 0.2 0.4 0.6 0.8 1 similarity ϕ ◮ Test if relatively few dissimilarities are of average size 17 []

  34. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  35. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighting-function 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  36. Clusterability III: Dash et al. score spaced uniform clustered Similarity-histogram Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 similarity ϕ similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  37. Clusterability III: Dash et al. score spaced uniform clustered Clusterability-score Weighted similarity-histogram 0 0.2 0.4 0.6 0.8 1 similarity ϕ 1 − ( ϕ · log 2 ( ϕ ) + ( 1 − ϕ ) · log 2 ( 1 − ϕ )) 18 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  38. Clusterability III: Concentration of Dissimilarities + + Very general idea Similarity-histogram + Related to the concept of intrinsic dimensionality separation cohesiveness - - Not clear when the used heuristic (see right figure) 0 0.2 0.4 0.6 0.8 1 applies similarity ϕ - Lacks the interpretability of statistical tests 19 [Dash et al. Dimensionality Reduction for Unsupervised Data. 1997]

  39. Clusterability: Overview ◮ A clusterable model has a dominant structure of mutually separated parts that are cohesive groups of objects. Test (4.2) ◮ Clusterability is related to various other topics in data analysis ◮ Evaluation indices (Dunn) ◮ Tests on model distributions (Hopkins and Skellam) ◮ Methods of unsupervised feature selection (Dash et al.) ◮ Estimators of intrinsic dimensionality ◮ . . . ? 20 []

  40. Experiment: Synthetic Models Can the clusterability indices identify clusterable models? Experiment setup: ◮ 10 model distributions of varying intuitive clusterability 1 model from the uniform distribution ◮ 1 000 models per distribution (results are means) ◮ 180 2-dimensional objects per model 21 []

  41. Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 22 []

  42. Experiment: Synthetic Models s = 0 s = 0 . 1 s = 0 . 2 s = 0 . 3 symbol 22 []

  43. Experiment: Synthetic Models Dunn MST [ 1 ] Hopkins and Skellam [ 2 ] Dash et al. mean clusterability 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0 0.1 0.2 0.3 s s s [ 1 ] Limited to clusterings with 13 or less clusters [ 2 ] Mean of 1 000 applications per model 23 []

Recommend


More recommend