ICANNGA 2011, Ljubljana G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of Ljubljana Faculty of Computer and Information Science
I NTRODUCTION • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation ICANNGA, April 2011 2
I NTRODUCTION Visualization of the Internet • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation Credits: Opte Project ICANNGA, April 2011 3
I NTRODUCTION • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation ICANNGA, April 2011 4
I NTRODUCTION Connections between neurons in human brain • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation Credits: Van J. Wedeen, M.D., MGH/Harvard U. ICANNGA, April 2011 5
I NTRODUCTION • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation ICANNGA, April 2011 6
I NTRODUCTION Heat map of gene expression profile • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation Credits: Manfred Gessler ICANNGA, April 2011 7
I NTRODUCTION • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation ICANNGA, April 2011 8
I NTRODUCTION Image segmentation • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation Credits: T . Riklin-Raviv, N. Sochen and N. Kiryati ICANNGA, April 2011 9
I NTRODUCTION • Tools needed to deal with data/web mining huge (social) networks gene expression data image segmentation ICANNGA, April 2011 10
C LUSTERING • unsupervised process of organizing data into "natural" groups • approaches information theory graphs fuzzy logic … artificial neural networks ICANNGA, April 2011 11
C LUSTERING WITH SOM • Self-Organizing Map [Kohonen, 1982] • Advantages visualization of high-dimensional data preserves topology and density of input data • Problem SOM is not "true" clustering method more neurons than expected number of clusters How to group neurons into clusters? ICANNGA, April 2011 12
C LUSTERING OF SOM • K-means, hierarchical [Vesanto & Alhoniemi, 2000] • Emergence SOM [Ultsch, 2007] watershed algorithm neurons > 1000 • Surface flooding [Brugger et al., 2008] automatically finds number of clusters ICANNGA, April 2011 13
GSOM – THE IDEA ICANNGA, April 2011 14
GSOM – L EVEL O NE • train SOM on input data • identify winning neurons • remove interpolating neurons 𝑛 𝑗 = [𝑛 𝑗1 , 𝑛 𝑗2 , … , 𝑛 𝑗𝐸 ] ICANNGA, April 2011 15
GSOM – L EVEL T WO • Gravitational clustering [Wright, 1977; Gomez et al., 2003] • BMU mass point (m=1) • "Move & merge" steps ICANNGA, April 2011 16
E XPERIMENT • GSOM compared to EM GMM [Dempster et al., 1977] CS [Jenssen et al., 2003] SOMkM [Vesanto & Alhoniemi, 2000] • datasets 6 artificial (2D with complex shapes) 3 real from UCI (Iris, Wine, LetterABC) • 100 runs of algorithm, we measure: Clustering Error (CE): minimal, average elapsed time ICANNGA, April 2011 17
R ESULTS – G IANT EM GMM CS CE = 0.0 CE = 0.219 SOMkM GSOM CE = 0.352 CE = 0.0 ICANNGA, April 2011 18
R ESULTS – W AVE EM GMM CS CE = 0.280 CE = 0.130 SOMkM GSOM CE = 0.126 CE = 0.0 ICANNGA, April 2011 19
R ESULTS – RANKS Mean Rank • minimal CE • average CE ICANNGA, April 2011 20
R ESULTS – ELAPSED TIME • Hepta N=212 • LettersABC N=1719 ICANNGA, April 2011 21
R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 22
R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 23
R ESULTS – NUMBER OF CLUSTERS • number of detected clusters true dataset GSOM number Giant 2 2 Hepta 7 7 Ring 2 4 Wave 2 2 Moon 4 4 Flag 3 3 Iris 3 3 Wine 3 3 LettersABC 3 7 ICANNGA, April 2011 24
GSOM - SUMMARY + finds clusters of complex shapes, linearly non-separable + insensitive to unbalanced density of clusters + number of clusters automatically detected + usage of topology relations – neighbourhood + less computational intensive + intuitive - 8 parameters to adjust - sometimes unstable behaviour ICANNGA, April 2011 25
F UTURE WORK • implementing heuristics for setting parameters automatically • study of clustering ensembles based on GSOM could non-deterministic nature of GSOM be an advantage? • application of GSOM on clustering of gene expression data ICANNGA, April 2011 26
D ATASETS PROPERTIES number number of number of dataset of points dimensions clusters Giant 862 2 2 Hepta 212 2 7 Ring 800 2 2 Wave 293 2 2 Moon 514 2 4 Flag 640 2 3 Iris 150 4 3 Wine 178 13 3 LettersABC 1719 16 3 ICANNGA, April 2011 27
GSOM PARAMETERS SETTING dataset SOM size SOM grid 𝐇 𝚬𝐇 α p Giant 13 x 11 rect. 0.0008 0.045 0.01 0.1 Hepta 9 x 8 rect. 0.0008 0.060 0.01 0.1 Ring 11 x 10 rect. 0.0008 0.045 0.01 0.1 Wave 14 x 12 rect. 0.0008 0.045 0.01 0.1 Moon 20 x 10 rect. 0.0008 0.045 0.01 0.0 Flag 14 x 9 rect. 0.0008 0.045 0.01 0.1 Iris 12 x 5 rect. 0.0008 0.045 0.01 0.1 Wine 7 x 5 rect. 0.0008 0.030 0.01 0.1 LettersABC 12 x 9 rect. 0.0010 0.030 0.01 0.1 ICANNGA, April 2011 28
Recommend
More recommend