Stability Analysis For Unsupervised Learning Dr. Derek Greene Insight @ UCD April 2014 derek.greene@ucd.ie
Introduction Cluster Validation : A quantitative means of assessing the quality of a clustering. Feedback Cluster Clustering Pre- Validation Algorithm Processing Dataset Model Clustering • A common application is in model selection : i.e. identifying optimal algorithm parameter values. Q. How many clusters k in a given dataset? k = 2 k = 3 Insight Machine Learning Workshop � 2
Common Validation Strategies External validation • Assess agreement between test clustering and “gold standard” clustering. • Approaches: Count pairwise agreements; match corresponding clusters; information theoretic agreement. • Examples: Jaccard, Rand Index, Normalised Mutual Information ✓ Useful for developing and verifying algorithms. x Not directly applicable in real unsupervised tasks. Internal validation • Compare solutions based on the goodness of fit between a clustering and the raw data. • Approaches: Intra-cluster similarity, inter-cluster separation, … • Examples: Dunn’s index, DB index, Silhouette score ✓ Does not require a gold standard clustering. x Can only make comparisons between clusterings generated using the same model/metric. x Often make assumptions about cluster structure. � 3
Stability Analysis Stability: The tendency of a clustering algorithm to produce similar clusterings on data originating from the same source. • Use an approach analogous to cross-validation… • Evaluate similarity for each model on multiple runs, and select the model resulting in the highest level of stability. Clustering Stability 0.??? Model Criterion Original Original Collection of Base Model Model Dataset Dataset Clusterings Evaluation Clusterings Stability • High level of similarity between collection of clusterings ⇒ Model is appropriate for the data set. Insight Machine Learning Workshop � 4
Measuring Stability Stability Analysis Based on Resampling (Levine & Domany, 2001) • Evaluate the pairwise similarity between a collection of clusterings of resampled data. For clusters: k ∈ [ k min , k max ] 1.Apply algorithm to generate clusterings on random samples of the complete dataset. 2.Assess pairwise similarity between each pair of clusterings using an external validation index. 3. = mean pairwise similarity. Stability( k ) ➡ Select model resulting in maximum stability. Insight Machine Learning Workshop � 5
Measuring Stability Prediction-Based Validation (Tibshirani et al., 2001) • Assess degree to which a model allows us to construct a classifier on a training set that will successfully predict a clustering of the test set. - Randomly generate τ training/test splits. - For clusters: k ∈ [ k min , k max ] ‣ For each split: 1. Apply clustering algorithm to training set. 2. Predict assignments for test set. 3. Apply clustering algorithm to test set. 4. Evaluate classification accuracy. ‣ = mean classification accuracy. Stability( k ) ➡ Select model resulting in maximum stability. Insight Machine Learning Workshop � 6
Prediction-Based Validation Example of applying prediction-based validation to examine the suitability of k =2 for small synthetic dataset: µ 2 µ 1 (a) Full dataset (b) Training clustering ( C a ) µ 2 µ 1 (c) Test clustering ( C b ) (d) Predicted clustering ( P b ) Insight Machine Learning Workshop � 7
Prediction-Based Validation Example of applying prediction-based validation to examine the suitability of k =2-3 for corpus of newsgroup documents: 1.5 1.5 1 1 0.5 0.5 k = 2 PC2 PC2 0 0 − 0.5 − 0.5 − 1 − 1 − 2 − 1 0 1 2 − 2 − 1 0 1 2 PC1 PC1 (a) Training ( k = 2) (b) Testing ( k = 2) 1.5 1.5 1 1 0.5 0.5 k = 3 PC2 PC2 0 0 − 0.5 − 0.5 − 1 − 1 − 2 − 1 0 1 2 − 2 − 1 0 1 2 PC1 PC1 (c) Training ( k = 3) (d) Testing ( k = 3) Insight Machine Learning Workshop � 8
Stability Analysis in Topic Modeling Q. How many topics are in an unlabelled text corpus? • Proposal: ‣ Generate topics on samples of the corpus. ‣ Use stability analysis, but take a term-centric approach to agreement, focusing on the highest ranked terms for each topic. ‣ Higher agreement between terms ➢ A more stable topic model. Rank Topic 1 Topic 2 Rank Topic 1 Topic 2 Low agreement 1 oil win 1 cup first between top 2 bank players 2 labour sales ranked terms 3 election minister 3 growth year 4 policy party 4 team minister 5 government ireland 5 senate firm Low stability 6 match club 6 minister match 7 senate year 7 ireland coalition for k=2 8 democracy election 8 players team 9 firm coalition 9 year election 10 team first 10 economy policy Run 1 Run 2 � 9
Stability Analysis in Topic Modeling Q. How many topics are in an unlabelled text corpus? • Proposal: ‣ Generate topics on samples of the corpus. ‣ Use stability analysis, but take a term-centric approach to agreement, focusing on the highest ranked terms for each topic. ‣ Higher agreement between terms ➢ A more stable topic model. Rank Topic 1 Topic 2 Topic 3 Rank Topic 1 Topic 2 Topic 3 High agreement 1 growth game labour 1 game growth labour between top 2 company ireland election 2 win company election ranked terms 3 market win vote 3 ireland market governmen t 4 economy cup party 4 cup economy party 5 bank goal governmen 5 match bank vote t High stability 6 year match coalition 6 team shares policy 7 firm team minister 7 first year minister for k=3 8 sales first policy 8 players firm democracy 9 shares club democracy 9 club sales senate 10 oil players first 10 goal oil coalition Run 1 Run 2 � 10
Summary • Common strategies for model selection in clustering often fail or exhibit strong biases. • Analogous to cross-validation in supervised learning, stability analysis can be applied to choose between models for clustering. ✓ Do not exhibit the biases of classical measures. ✓ Can be used to compare output of di ff erent algorithms run on di ff erent representations. • Drawbacks? Can be computationally expensive… ✓ We can measure the stability of “weak” models. ✓ Information from multiple runs can be subsequently used to build ensemble models. Insight Machine Learning Workshop � 11
References • Levine, E. & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13. • Tibshirani, R., Walther, G., Botstein, D. & Coalition, P . (2001). Cluster validation by prediction strength. Tech. rep., Dept. Statistics, Stanford University. • Lange, T., Roth, V., Braun, M.L. & Buhmann, J.M. (2004). Stability- based validation of clustering solutions. Neural Computation, 16. • Greene, D. & Cunningham, P . (2006). E ffi cient prediction-based validation for document clustering. In Proc. 17th European Conference on Machine Learning (ECML’06). Insight Machine Learning Workshop � 12
Recommend
More recommend