using local neighborhoods to find subspace clusters
play

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli - PowerPoint PPT Presentation

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Mller and Jilles Vreeken High Dimensional Data ? 2 High Dimensional Data 3 High Dimensional Data 4 High Dimensional


  1. Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel Müller and Jilles Vreeken

  2. High Dimensional Data ✓ ✓ ✗ ✗ ? 2

  3. High Dimensional Data 3

  4. High Dimensional Data 4

  5. High Dimensional Data 5

  6. High Dimensional Data 6

  7. Problem Setting • Preserve local neighborhoods • Combine different views on the data • Produce explainable results 7

  8. Transformation CARTIFICATION High Itemset Dimensional (Transaction) DB DB Clustering FIM Subspace Clusters Frequent Patterns 8

  9. Cartification 9

  10. Cartification 10

  11. Cartification 11

  12. Cartification 12

  13. Cartification 13

  14. Cartification 14

  15. Frequent Itemset Mining Cartified DB ? Original DB FIs 15

  16. Cartification • Frequent Itemset Mining solves our problem. • It is not scalable. 16

  17. Take 2 17

  18. Take 2 ? 18

  19. Uniform vs. Clusters 19

  20. Running example Dim 1 ✓ 20

  21. Running example Dim 2 ✓ Dim 1 ? 21

  22. Running example ✓ Dim 1 ✗ ? Dim 2 ✗ ? Dim 3 ✓ ? Dim 4 22

  23. Experiments 1.0 0.8 0.6 F1 Score 0.4 0.2 0.0 Our Method CartiClus FIRES PROCLUS STATPC SUBCLUE 1 2 4 8 16 32 100 200 23

  24. Experiments 10000 1000 Run Time (seconds) 100 10 1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 24

  25. Experiments 10000 1000 Run Time (seconds) 100 10 1 0 D5 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 25

  26. Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977) Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The T wo T owers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) T erminator, The (1984) T erminator 2: Judgment Day (1991) Die Hard (1988) T erminator, The (1984) T erminator 2: Judgment Day (1991) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991) 26

  27. Real World - Movielens Star Wars: A New Hope (1977) Brazil (1985) Star Wars: The Empire Strikes Back (1980) Dr. Strangelove (1964) Star Wars: Return of the Jedi (1983) Clockwork Orange, A (1971) LotR: The Fellowship of the Ring, The (2001) 2001: A Space Odyssey (1968) LotR: The T wo T owers, The (2002) Blade Runner (1982) LotR: The Return of the King, The (2003) Alien (1979) Chinatown (1974) Third Man, The (1949) Rear Window (1954) Citizen Kane (1941) North by Northwest (1959) Godfather: Part II, The (1974) Vertigo (1958) Chinatown (1974) Psycho (1960) Godfather, The (1972) Silence of the Lambs, The (1991) T axi Driver (1976) 27

  28. Conclusion • Preserves neighborhood information • Combines different similarity measures gracefully • Finds relevant features and discards noise • Fast • Produce explainable results → Code and the data is available at our website. Thank you! 28

  29. Real World – Gene Expression Alon Nutt Our method 0.78 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC n/a n/a CartiClus n/a n/a # of Objects 62 50 # of Dims 2000 1377 29

  30. More Experiments 1 0.9 0.8 0.7 0.6 F1 Score 0.5 0.4 0.3 0.2 0.1 0 S1500 S2500 S3500 S4500 S5500 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 30

  31. More Experiments 1 0.9 0.8 0.7 F1 Score 0.6 0.5 0.4 0.3 0.2 0.1 0 D05 D10 D15 D25 D50 D75 Our Method CartiClus FIRES PROCLUS STATPC SUBCLU 31

  32. Experiments • Evaluate: - Subspace cluster detection - Noise Robustness - Scalability • Competitors: - Subspace clustering: PROCLUS - Clustering: K-Means - Dimensionality Reduction: PCA and Random Projection - Clustering Ensemble: CSPA 32

  33. Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM 33

  34. Results 1 Quality of the found clusters 0.8 0.6 10 clusters in 10 dimensions 0.4 200 irrelevant dimensions 0.2 0 F1 E4SC Our Method CSPA Proclus K-Means PCA+KM RP+KM • Very effective on finding relevant dimensions. 34

Recommend


More recommend