sd dp sparse dual of the density peaks algorithm for
play

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster - PowerPoint PPT Presentation

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-Dimensional Data November 5, 2018 Dimitris Floros 1 Tiancheng Liu 2 Nikos Pitsianis 12 Xiaobai Sun 2 1 Department of Electrical and Computer Engineering, Aristotle


  1. SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-Dimensional Data November 5, 2018 Dimitris Floros 1 Tiancheng Liu 2 Nikos Pitsianis 12 Xiaobai Sun 2 1 Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki 2 Department of Computer Science, Duke University The Ąrst two authors contributed equally to this work Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 1 / 29

  2. Outline 1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 2 / 29

  3. 1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results

  4. Cluster analysis of high-dimensional data Premise: intrinsic heterogeneous group/cluster structures in Alpert B. Hagstrom T. Waag R. real-word data of research interest Astheimer J. Kropinski M. Hesford A. Duan R. Moura M. Veerapanen S. Lin P. Mayo A. Beylkin G. Dutt A. Cluster analysis: uncover cluster structures in data, with noise and Trease N. Ho K. Wang S. Chang H. Chen Y. Jerschow A. Gu M. Kolm P. Greenbaum A. Grey C. Beylkin D. uncertainty, with quantiĄed features, governed by certain Ilott A. Sun X. Liang Z. Kong W. Chandrashekar S. Klöckner A. Li J. Wandzura S. Tornberg A. Lee J. Huang J. Serkh K. difgerentiation criteria Bao W. Minion M. Zhao J. Helsing J. Jiang S. Askham T. Sammis I. Coifman R. Ambikasaran S. Cheng H. Greengard L. Rokhlin V. Lai J. Murphy W. Imbert-Gerard L. Ethridge J. Bremer J. - massive data of many attributes/features Engheta N. Gropp W. Borges C. Gimbutas Z. Crutchfield W. Epstein C. Vassiliou M. - supervised vs. un-supervised Kobayashi M. Yarvin N. Glaser A. Hogg D. Barnett A. Ambrosiano J. Ethridge F. Cerfon A. O'Neil M. Martinsson P. Hrycak T. Szlam A. Foreman-Mackey D. Ferrando-Bataller M. Vico F. Fundamental to various research studies Berman C. Woolfe F. Gueyffier D. Coakley E. Pataki A. Sifuentes J. Veerapaneni S. Liberty E. Freidberg J. Zorin D. Tygert M. Rachh M. Abell 901/902 supercluster [23] Langston M. Spivak M. Domain-specific analysis Feature description Corona E. Co-authorship communities [25] Molecular dynamics trajectory patterns [1] kinetic, spectral measurements ClassiĄcation of astronomical events [2] Gamma ray measurements Community detection in complex system [3, 4, 5] link features Image segmentation/denoising [6, 7] intensity, patch texture Content-based image retrieval [8] semantic content descriptor Image object recognition [9, 10] SIFT [11], HOG [12] descriptors Gene expression pattern analysis [13, 14, 15, 16, 17] gene-expression matrix Thematic categorization of documents [18, 19] word frequency vector Statistical semantic or sentiment analysis GloVe [20] word vector Statistical categorization of musical genres [21] musical surface features Consumer proĄling/market segmentation [22] purchase history \[-1.5em] US city lights [26] Uber & Taxi demand in NYC [24] Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 3 / 29

  5. 1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results

  6. DP, other influential algorithms & SD-DP Algorithms MEAN K-MEANS [27] DBSCAN [28] OPTICS [29] GN [3] COMBO [5] DP [31] SD-DP [32] SHIFT [30] (1982) (1996) (1999) (2002) (2014) (2014) (2018) Desirable properties 1 (2002) No prescription of # clusters � � � � � � � No restriction in cluster shape � � � � � � � Free choice of metrics � � � � � � Agnostic to distribution � � � � � Easy or no tuning � � � � Robust in high-dim. space � Accurate in high-dim. space � Low computation cost � Checkmarks are based on limited benchmarking experiments 1 Additional properties include low program complexity, stability and more Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 4 / 29

  7. DP vs SD-DP: classification accuracy 60,000 images of handwritten digits (MNIST dataset) [33] DP (2018) [34] SD-DP 5866 2 21 11 8 18 18 10 38 54 97.0% 5893 4 0 5 2 3 12 0 1 3 99.5% 0 0 9.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 3.0% 9.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5% 98.6% 19 6498 8 0 15 4 10 2 29 2 0 5032 1688 0 6 0 5 5 3 3 74.6% 1 1 8 0.0% 10.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.4% 1688 0.0% 8.4% 2.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 25.4% 98.6% 74.6% 0.0% 1 141 5570 21 4 0 0 38 3 0 96.4% 2.8% 42 141 5699 16 3 2 13 32 4 6 95.7% 2 2 0.0% 0.2% 9.3% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 3.6% 1.4% 0.1% 0.2% 9.5% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 4.3% 25.4% error error 4 2 192 5866 0 28 1 54 19 14 94.9% 2 8 304 5690 1 46 5 22 28 25 92.8% 3 3 0.0% 0.0% 0.3% 9.8% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 5.1% precision 0.0% 0.0% 0.5% 9.5% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 7.2% precision Estimated Clusters 1 30 26 1 5484 6 0 19 14 4 98.2% Estimated Clusters 3 34 15 1 5405 0 19 7 2 356 92.5% 4 4 0.0% 0.1% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0% 1.8% 0.0% 0.1% 0.0% 0.0% 9.0% 0.0% 0.0% 0.0% 0.0% 0.6% 7.5% 97.7% 89.1% 4 0 7 27 0 5178 11 3 28 6 98.4% 5 3 58 104 9 5089 82 6 8 57 93.9% 5 5 0.0% 0.0% 0.0% 0.0% 0.0% 8.6% 0.0% 0.0% 0.0% 0.0% 1.6% 0.0% 0.0% 0.1% 0.2% 0.0% 8.5% 0.1% 0.0% 0.0% 0.1% 6.1% 2.3% 10.9% 13 4 6 1 82 48 5870 0 42 5 96.7% 14 21 1 1 3 9 5867 0 1 1 99.1% 6 6 0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 9.8% 0.0% 0.1% 0.0% 3.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 9.8% 0.0% 0.0% 0.0% 0.9% recall recall 97.7% 5 23 71 19 7 2 0 5902 5 8 1 44 1117 0 21 0 0 5048 0 34 80.6% 7 7 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 9.8% 0.0% 0.0% 2.3% 0.0% 0.1% 1.9% 0.0% 0.0% 0.0% 0.0% 8.4% 0.0% 0.1% 19.4% 5 15 32 120 25 116 8 17 5526 43 93.6% 11 48 36 115 42 91 34 7 5374 93 91.8% 8 8 0.0% 0.0% 0.1% 0.2% 0.0% 0.2% 0.0% 0.0% 9.2% 0.1% 6.4% 0.0% 0.1% 0.1% 0.2% 0.1% 0.2% 0.1% 0.0% 9.0% 0.2% 8.2% 5 27 25 65 217 21 0 220 147 5813 88.9% 10 3 22 54 1065 12 1 51 14 4717 79.3% 9 9 0.0% 0.0% 0.0% 0.1% 0.4% 0.0% 0.0% 0.4% 0.2% 9.7% 11.1% 0.0% 0.0% 0.0% 0.1% 1.8% 0.0% 0.0% 0.1% 0.0% 7.9% 20.7% 96.0% 89.7% 99.0% 96.4% 93.5% 95.7% 93.9% 95.5% 99.2% 94.2% 94.4% 97.7% 96.0% 98.5% 94.3% 63.7% 95.1% 82.4% 96.9% 97.2% 97.5% 98.9% 89.1% 89.7% 4.0% 10.3% 1.0% 3.6% 6.5% 4.3% 6.1% 4.5% 0.8% 5.8% 5.6% 2.3% 4.0% 1.5% 5.7% 36.3% 4.9% 17.6% 3.1% 2.8% 2.5% 1.1% 10.9% 10.3% 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 total accuracy total accuracy True Classes True Classes HOG descriptors ( D = 144) Intensity feature vector ( D = 28 × 28 = 784) Euclidean distance Tangent distance Unsupervised cluster revision Manual intervention in peak selection and cluster merge Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 5 / 29

  8. DP vs SD-DP: classification accuracy DP (2018) SD-DP Digit semi-supervised un-supervised 0.99 0.98 0 0.83 0.98 1 All misclassiĄed digit- 0 images by SD-DP 0.77 0.95 2 0.94 0.95 3 0.87 0.96 4 0.95 0.97 5 0.98 0.98 6 0.88 0.96 7 0.95 0.94 8 0.84 0.93 9 Subset of misclassiĄed digit- 2 images by SD-DP Comparison in Dice similarity coeffjcients (DSC) a.k.a. F1 scores and Sørensen-Dice coeffjcients P 60,000 images of handwritten digits (MNIST dataset) 2 TP 2 | T ∩ P | T ∩ P DSC = = T 2 TP + FP + FN | T | + | P | Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 6 / 29

  9. 1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results

  10. The Density Peaks principle [Rodriguez and Laio, Science, 2014] Principle Probability distribution from which point distributions are drawn. The regions with “Cluster centers are characterized by a higher lowest intensity correspond to a back- density than their neighbors and by a ground uniform probability of 20%. relatively large distance from points with higher densities”. Local density description Point distribution for samples of 4000 population in neighborhood of specified radius r points. Points are colored according to the cluster to which they are assigned. Black points belong to the cluster halos. ⎭ |N r ( x i ) | , hard cutofg ρ i = √︂ j exp )︄ − d 2 ij / r 2 [︄ , soft cutofg Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 7 / 29

Recommend


More recommend