Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Müller and Jilles Vreeken
High Dimensional Data ✓ ✓ ✗ ✗ ? 2
High Dimensional Data 3
High Dimensional Data 4
High Dimensional Data 5
High Dimensional Data 6
Problem Setting • Preserve local neighborhoods • Combine different views on the data • Produce explainable results 7
Transformation CARTIFICATION High Itemset Dimensional (Transaction) DB DB Clustering FIM Subspace Clusters Frequent Patterns 8
Cartification 9
Cartification 10
Cartification 11
Cartification 12
Cartification 13
Cartification 14
Frequent Itemset Mining Cartified DB ? Original DB FIs 15
Cartification • Frequent Itemset Mining solves our problem. • It is not scalable. 16
Take 2 17
Take 2 ? 18
Uniform vs. Clusters 19
Running example Dim 1 ✓ 20
Running example Dim 2 ✓ Dim 1 ? 21
Running example ✓ Dim 1 ✗ ? Dim 2 ✗ ? Dim 3 ✓ ? Dim 4 22
Experiments 1.0 0.8 0.6 F1 Score 0.4 0.2 0.0 Our Method CartiClus FIRES PROCLUS STATPC SUBCLUE 1 2 4 8 16 32 100 200 23
Experiments 10000 1000 Run Time (seconds) 100 10 1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 24
Experiments 10000 1000 Run Time (seconds) 100 10 1 0 D5 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 25
Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977) Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The T wo T owers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) T erminator, The (1984) T erminator 2: Judgment Day (1991) Die Hard (1988) T erminator, The (1984) T erminator 2: Judgment Day (1991) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991) 26
Real World - Movielens Star Wars: A New Hope (1977) Brazil (1985) Star Wars: The Empire Strikes Back (1980) Dr. Strangelove (1964) Star Wars: Return of the Jedi (1983) Clockwork Orange, A (1971) LotR: The Fellowship of the Ring, The (2001) 2001: A Space Odyssey (1968) LotR: The T wo T owers, The (2002) Blade Runner (1982) LotR: The Return of the King, The (2003) Alien (1979) Chinatown (1974) Third Man, The (1949) Rear Window (1954) Citizen Kane (1941) North by Northwest (1959) Godfather: Part II, The (1974) Vertigo (1958) Chinatown (1974) Psycho (1960) Godfather, The (1972) Silence of the Lambs, The (1991) T axi Driver (1976) 27
Conclusion • Preserves neighborhood information • Combines different similarity measures gracefully • Finds relevant features and discards noise • Fast • Produce explainable results → Code and the data is available at our website. Thank you! 28
Real World – Gene Expression Alon Nutt Our method 0.78 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC n/a n/a CartiClus n/a n/a # of Objects 62 50 # of Dims 2000 1377 29
More Experiments 1 0.9 0.8 0.7 0.6 F1 Score 0.5 0.4 0.3 0.2 0.1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 30
More Experiments 1 0.9 0.8 0.7 F1 Score 0.6 0.5 0.4 0.3 0.2 0.1 0 D05 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 31
Experiments • Evaluate: - Subspace cluster detection - Noise Robustness - Scalability • Competitors: - Subspace clustering: PROCLUS - Clustering: K-Means - Dimensionality Reduction: PCA and Random Projection - Clustering Ensemble: CSPA 32
Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM 33
Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM • Very effective on finding relevant dimensions. 34
Recommend
More recommend