g ravitational c lustering of the
play

G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc - PowerPoint PPT Presentation

ICANNGA 2011, Ljubljana G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of Ljubljana Faculty of Computer and Information Science I NTRODUCTION Tools needed to deal with data/web mining


  1. ICANNGA 2011, Ljubljana G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of Ljubljana Faculty of Computer and Information Science

  2. I NTRODUCTION • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation ICANNGA, April 2011 2

  3. I NTRODUCTION Visualization of the Internet • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation Credits: Opte Project ICANNGA, April 2011 3

  4. I NTRODUCTION • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation ICANNGA, April 2011 4

  5. I NTRODUCTION Connections between neurons in human brain • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation Credits: Van J. Wedeen, M.D., MGH/Harvard U. ICANNGA, April 2011 5

  6. I NTRODUCTION • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation ICANNGA, April 2011 6

  7. I NTRODUCTION Heat map of gene expression profile • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation Credits: Manfred Gessler ICANNGA, April 2011 7

  8. I NTRODUCTION • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation ICANNGA, April 2011 8

  9. I NTRODUCTION Image segmentation • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation Credits: T . Riklin-Raviv, N. Sochen and N. Kiryati ICANNGA, April 2011 9

  10. I NTRODUCTION • Tools needed to deal with  data/web mining  huge (social) networks  gene expression data  image segmentation ICANNGA, April 2011 10

  11. C LUSTERING • unsupervised process of organizing data into "natural" groups • approaches  information theory  graphs  fuzzy logic  …  artificial neural networks ICANNGA, April 2011 11

  12. C LUSTERING WITH SOM • Self-Organizing Map [Kohonen, 1982] • Advantages  visualization of high-dimensional data  preserves topology and density of input data • Problem  SOM is not "true" clustering method  more neurons than expected number of clusters  How to group neurons into clusters? ICANNGA, April 2011 12

  13. C LUSTERING OF SOM • K-means, hierarchical [Vesanto & Alhoniemi, 2000] • Emergence SOM [Ultsch, 2007]  watershed algorithm  neurons > 1000 • Surface flooding [Brugger et al., 2008]  automatically finds number of clusters ICANNGA, April 2011 13

  14. GSOM – THE IDEA ICANNGA, April 2011 14

  15. GSOM – L EVEL O NE • train SOM on input data • identify winning neurons • remove interpolating neurons 𝑛 𝑗 = [𝑛 𝑗1 , 𝑛 𝑗2 , … , 𝑛 𝑗𝐸 ] ICANNGA, April 2011 15

  16. GSOM – L EVEL T WO • Gravitational clustering [Wright, 1977; Gomez et al., 2003] • BMU  mass point (m=1) • "Move & merge" steps ICANNGA, April 2011 16

  17. E XPERIMENT • GSOM compared to  EM GMM [Dempster et al., 1977]  CS [Jenssen et al., 2003]  SOMkM [Vesanto & Alhoniemi, 2000] • datasets  6 artificial (2D with complex shapes)  3 real from UCI (Iris, Wine, LetterABC) • 100 runs of algorithm, we measure:  Clustering Error (CE): minimal, average  elapsed time ICANNGA, April 2011 17

  18. R ESULTS – G IANT EM GMM CS CE = 0.0 CE = 0.219 SOMkM GSOM CE = 0.352 CE = 0.0 ICANNGA, April 2011 18

  19. R ESULTS – W AVE EM GMM CS CE = 0.280 CE = 0.130 SOMkM GSOM CE = 0.126 CE = 0.0 ICANNGA, April 2011 19

  20. R ESULTS – RANKS Mean Rank • minimal CE • average CE ICANNGA, April 2011 20

  21. R ESULTS – ELAPSED TIME • Hepta N=212 • LettersABC N=1719 ICANNGA, April 2011 21

  22. R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 22

  23. R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 23

  24. R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 24

  25. GSOM - SUMMARY + finds clusters of complex shapes, linearly non-separable + insensitive to unbalanced density of clusters + number of clusters automatically detected + usage of topology relations – neighbourhood + less computational intensive + intuitive - 8 parameters to adjust - sometimes unstable behaviour ICANNGA, April 2011 25

  26. F UTURE WORK • implementing heuristics for setting parameters automatically • study of clustering ensembles based on GSOM  could non-deterministic nature of GSOM be an advantage? • application of GSOM on clustering of gene expression data ICANNGA, April 2011 26

  27. D ATASETS PROPERTIES number number of number of dataset of points dimensions clusters Giant 862 2 2 Hepta 212 2 7 Ring 800 2 2 Wave 293 2 2 Moon 514 2 4 Flag 640 2 3 Iris 150 4 3 Wine 178 13 3 LettersABC 1719 16 3 ICANNGA, April 2011 27

  28. GSOM PARAMETERS SETTING dataset SOM size SOM grid 𝐇 𝚬𝐇 α p Giant 13 x 11 rect. 0.0008 0.045 0.01 0.1 Hepta 9 x 8 rect. 0.0008 0.060 0.01 0.1 Ring 11 x 10 rect. 0.0008 0.045 0.01 0.1 Wave 14 x 12 rect. 0.0008 0.045 0.01 0.1 Moon 20 x 10 rect. 0.0008 0.045 0.01 0.0 Flag 14 x 9 rect. 0.0008 0.045 0.01 0.1 Iris 12 x 5 rect. 0.0008 0.045 0.01 0.1 Wine 7 x 5 rect. 0.0008 0.030 0.01 0.1 LettersABC 12 x 9 rect. 0.0010 0.030 0.01 0.1 ICANNGA, April 2011 28

Recommend


More recommend