an algorithm for sample and data dimensionality reduction
play

An Algorithm for Sample and Data Dimensionality Reduction Using Fast - PowerPoint PPT Presentation

7th International Conference on Advanced Data Mining and Applications An Algorithm for Sample and Data Dimensionality Reduction Using Fast Simulated Annealing Szymon ukasik , Piotr Kulczycki Department of Automatic Control and IT, Cracow


  1. 7th International Conference on Advanced Data Mining and Applications An Algorithm for Sample and Data Dimensionality Reduction Using Fast Simulated Annealing Szymon Łukasik , Piotr Kulczycki Department of Automatic Control and IT, Cracow University of Technology Systems Research Institute, Polish Academy of Sciences

  2. Motivation • It is estimated ("How much information” project, Univ. of California Berkeley) that 1 million terabytes of data is generated annually worldwide, with 99.997% of it available only in digital form. • It is commonly agreed that our ability to analyze new data is growing at much lower pace than the capacity to collect and store it. • When examining huge data samples one faces both technical difficulties and methodological obstacles of high-dimensional data analysis (coined term – "curse of dimensionality”) . 2

  3. Curse of dimensionality - example Source: K. Beyer et al., „ When Is « Nearest Neighbor » meaningful ?”, In: Proc. ICDT, 1999. 3

  4. Scope of our research • We have developed an universal unsupervised data dimensionality reduction technique, in some aspects similar to Principal Components Analysis (it’s linear) and Multidimensional Scaling (it’s distance-preserving). What is more, we try to reduce data sample length at the same time • Establishing exact form of the transformation matrix is treated as a continuous optimization problem and solved by Parallel Fast Simulated Annealing. • The algorithm is supposed to be used in conjunction with various procedures of data mining e.g. outlier detection, cluster analysis, and classification. 4

  5. General description of the algorithm • Data dimensionality reduction is realized via linear transformation: 𝑋 = 𝐵 𝑉 whereas 𝑉 denotes the initial data set ( 𝑜 × 𝑛 ), 𝐵 - transformation matrix ( 𝑂 × 𝑜 ) and 𝑋 represents the transformed data matrix ( 𝑂 × 𝑛 ). • Transformation matrix is obtained using Parallel FSA. The cost function 𝑕 ( 𝐵 ) which is minimized is given by raw Stress: 𝑛 𝑛 2 𝑕 ( 𝐵 ) = 𝑥 𝑗 ( 𝐵 ) − 𝑥 𝑘 ( 𝐵 ) 𝑆 𝑂 − 𝑣 𝑗 − 𝑣 𝑘 𝑆 𝑜 𝑗 =1 𝑘 = 𝑗 +1 with 𝐵 being a solution of the optimization problem, and 𝑣 i , 𝑣 j , 𝑥 i ( 𝐵 ) , 𝑥 j ( 𝐵 ) representing data instances in initial and reduced feature space. 5

  6. FSA neighbor generation strategy 20 20  a 2  a 2 0 0 -20 -20 -20 -10 0 10 20 -20 -10 0 10 20  a 1  a 1 6

  7. FSA temperature and termination criterion Initial temperature 𝑈 (0) is determined through a set of pilot runs consisting • of 𝑙 𝑄 positive transitions from the starting solution. It is supposed to guarantee predetermined initial level of worse solution’s acceptance probability 𝑄 (0) resulting from the Metropolis rule. • Initial solution is obtained using feature selection algorithm introduced by Pal & Mitra in 2004. It is based on feature space clustering, with similar features forming distinctive clusters. As a measure of similarity maximal information compression index was defined. The partition itself is performed using k-nearest neighbor rule (here 𝒍 = 𝒐 − 𝑶 is being used). • The termination criterion is either executing assumed number of iterations or fulfilling customized condition based on the estimator of the global minimum employing order statistics proposed recently for a class of stochastic random search algorithms by Bartkute and Sakalauskas (2009) 7

  8. FSA paralellization Current global solution … Neighbor 1 Neighbor 2 Neighbor n cores FSA FSA FSA Current 1 Current 2 Current n cores Make global current either best improving or random non-improving solution 8

  9. Sample size reduction For each sample element u i a positive weight 𝑞 i is assigned. It incorporates an • information about a relative deformation of the element’s distance to other sample points. Data elements with higher weight could then be treated as more 𝑞 𝑗 = 1 adequate. Weights are normalized to fulfill . • Consequently weights could be then used to improve the performance of data mining procedures e.g. by introducing such weights into the definition of the classic data mining algorithms (e.g. k-means or k-nearest neighbor). • Alternatively one can use weights to eliminate some data elements from the sample. It can be performed by removing from the sample data elements with associated weights fulfilling following condition: 𝑞 i < P where P ∊ [0, 1] and then normalizing all weights. One can achieve in this way simultaneous dimensionality and sample length reduction with P serving as a data compression ratio. 9

  10. Experimental evaluation • We have examined the performance of the algorithm by measuring the accuracy of outlier detection 𝘑 o (for artificially generated datasets), clustering 𝘑 c and classification 𝘑 k (for selected benchmark instances from UCI ML repository). • Outlier detection was performed using nonparametric statistical kernel density estimation. By using randomly generated datasets we had a possibility to designate actual outliers. • Clustering accuracy was measured by Rand index (in reference to class labels). It was implemented via classic k-means algorithm. • Classification accuracy (for nearest-neighbor classifier) was measured, by average classification correctness obtained during 5-fold cross-validation procedure. • Each test consisted of 30 runs, we reported average and the mean of above mentioned indices. We compared our approach to PCA and Evolutionary Algorithms- based Feature Selection (by Saxena et al.). 10

  11. Example: seeds dataset (7D → 2D) Our approach PCA 11

  12. More details – classification glass wine WBC vehicle seeds 9D → 4D 13D → 5D 9D → 4D 18D → 10D 7D → 2D 𝒍𝑱𝑶𝑱𝑼 𝑱 71.90 74.57 95.88 63.37 90.23 ±𝝉 ( 𝑱 𝒍𝑱𝑶𝑱𝑼 ) ±8 .10 ±5 .29 ±1 .35 ±3 .34 ±2 .85 Our approach (P=0.1) 𝒍𝑺𝑭𝑬 𝑱 70.48 78.00 95.95 63.96 89.76 ±𝝉 ( 𝑱 𝒍𝑺𝑭𝑬 ) ±7 .02 ±4 .86 ±1 .43 ±2 .66 ±3 .18 PCA 𝒍𝑺𝑭𝑬 𝑱 58.33 72.00 95.29 62.24 83.09 ±𝝉 ( 𝑱 𝒍𝑺𝑭𝑬 ) ±6 .37 ±7 .22 ±2 .06 ±3 .84 ±7 .31 EA-based Feature Selection 𝒍𝑺𝑭𝑬 𝑱 64.80 72.82 95.10 60.86 not tested ±𝝉 ( 𝑱 𝒍𝑺𝑭𝑬 ) ±4 .43 ±1 .02 ±0 .80 ±1 .51 12

  13. More details – cluster analysis glass wine WBC vehicle seeds 9D → 4D 13D → 5D 9D → 4D 18D → 10D 7D → 2D 𝒅𝑱𝑶𝑱𝑼 𝑱 68.23 93.48 66.23 64.18 91.06 Our approach (P=0.2) 𝒅𝑺𝑭𝑬 𝑱 68.43 92.81 66.29 64.62 89.59 ±𝝉 ( 𝑱 𝒅𝑺𝑭𝑬 ) ±0 .62 ±0 .76 ±0 .62 ±0 .24 ±1 .57 PCA 𝒅𝑺𝑭𝑬 𝑱 67.71 92.64 66.16 64.16 88.95 13

  14. Conclusion • The algorithm was tested for numerous instances of outlier detection, cluster analysis and classification problems and was found to offer promising performance. It results in an accurate distance preservation with possibility of out-of-sample extension at the same time. • Drawbacks? It is not designed for huge datasets (due to significant computational cost of cost function evaluation) and shouldn’t be used in the situation where only single data analysis task needs to be performed. • What can be done in the future We observed that taking into account topological deformation of the dataset in the reduced feature space (by proposed weighting scheme) brings positive results in various data mining procedures. It can be easily extended for other DR techniques! Proposed approach could make algorithms very prone to ‘ curse of dimensionality ’ practically usable (we have examined it in the case of KDE). 14

  15. Thank you for your attention!

  16. Short bibliography 1. H. Szu, R. Hartley: " Fast simulated annealing ” , Physics Letters A, vol. 122/3-4, 1987. 2. L. Ingber: " Adaptive simulated annealing (ASA): Lessons learned “ , Control and Cybernetics, vol. 25/1, 1996. 3. D. Nam, J.-S. Lee, C. H. Park, " N-dimensional Cauchy neighbor generation for the fast simulated annealing ", IEICE Trans. Information and Systems, vol. E87-D/11, 2004 4. S.K. Pal, P. Mitra, " Pattern Recognition Algorithms for Data Mining ” , Chapman and Hall, 2004. 5. V. Bartkute, L. Sakalauskas: " Statistical Inferences for Termination of Markov Type Random Search Algorithms ”, Journal of Optimization Theory and Applications, vol. 141/3, 2009. 6. P. Kulczycki, " Kernel Estimators in Industrial Applications ”, Soft Computing Applications in Industry”, B. Prasad (ed.), Springer-Verlag, 2008. 7. A. Saxena, N.R. Pal, M. Vora, " Evolutionary methods for unsupervised feature selection using Sammon’s stress function ". Fuzzy Information and Engineering, vol. 2, 2010. 16

Recommend


More recommend