detecting multivariate outliers using projection pursuit
play

Detecting multivariate outliers using projection pursuit with - PowerPoint PPT Presentation

Detecting multivariate outliers using projection pursuit with particle swarm optimization Anne Ruiz-Gazen Alain Berro Souad Larabi Marie-Sainte University of Toulouse 1 - Capitole (TSE - IRIT - IMT) COMPSTAT, Paris, September 2010 A.


  1. Detecting multivariate outliers using projection pursuit with particle swarm optimization Anne Ruiz-Gazen Alain Berro Souad Larabi Marie-Sainte University of Toulouse 1 - Capitole (TSE - IRIT - IMT) COMPSTAT, Paris, September 2010 A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 1 / 35

  2. Introduction What is Exploratory Projection Pursuit? search for “interesting” linear low dimensional projections of high dimensional multivariate data A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

  3. Introduction What is Exploratory Projection Pursuit? search for “interesting” linear low dimensional projections of high dimensional multivariate data Interesting structures: outliers, clusters, . . . A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

  4. Introduction What is Exploratory Projection Pursuit? search for “interesting” linear low dimensional projections of high dimensional multivariate data Interesting structures: outliers, clusters, . . . Two ingredients: projection interestingness: projection index I optimization of the index: algorithm A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

  5. Introduction EPP usually known by statisticians but not used! Well-known statistical softwares do NOT propose PP procedures (some routines in Fortran, Splus, Matlab and GGobi). A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 3 / 35

  6. Introduction EPP usually known by statisticians but not used! Well-known statistical softwares do NOT propose PP procedures (some routines in Fortran, Splus, Matlab and GGobi). Recent applications in the domain of anomalies detection in hyperspectral imagery (Achard et al., 2004, Malpika et al., 2008, Smetek and Bauer, 2008). A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 3 / 35

  7. Introduction Mathematically denote X data matrix n × p , X i observation p × 1, continuous variables, data are centered and scaled (divided by standard deviation or made spherical), consider one-dimensional projections from R p to R : z = X α , where α is a p -dimensional projection vector α ′ α = 1, z is a n -dimensional vector: coordinates of the projected observations, define a projection index function I : α → I ( α ), find projection vectors α : max { α ∈ R p | α ′ α =1 } I ( α ) A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 4 / 35

  8. Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

  9. Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals 2 Optimization procedures for PP: a review Strategy for the first proposals Other strategies A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

  10. Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals 2 Optimization procedures for PP: a review Strategy for the first proposals Other strategies 3 New optimization proposals A new strategy Heuristics optimization algorithms Genetic Algorithm Particle Swarm Optimization algorithm Tribes A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

  11. Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals 2 Optimization procedures for PP: a review Strategy for the first proposals Other strategies 3 New optimization proposals A new strategy Heuristics optimization algorithms Genetic Algorithm Particle Swarm Optimization algorithm Tribes 4 Illustration with EPP-Lab A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

  12. Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals 2 Optimization procedures for PP: a review Strategy for the first proposals Other strategies 3 New optimization proposals A new strategy Heuristics optimization algorithms Genetic Algorithm Particle Swarm Optimization algorithm Tribes 4 Illustration with EPP-Lab 5 Conclusion and perspectives A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

  13. PP indices for multivariate outliers detection: a review Contents 1 PP indices for multivariate outliers detection: a review First proposals Other proposals 2 Optimization procedures for PP: a review Strategy for the first proposals Other strategies 3 New optimization proposals A new strategy Heuristics optimization algorithms Genetic Algorithm Particle Swarm Optimization algorithm Tribes 4 Illustration with EPP-Lab 5 Conclusion and perspectives A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 6 / 35

  14. PP indices for multivariate outliers detection: a review First proposals First proposals Definition of an “interesting” projection discussed in the founding papers on PP (Friedman and Tukey, 1974, Huber, 1985, Jones and Sibson, 1987, and Friedman, 1987). Several arguments: “gaussianity is uninteresting”. Any measure of departure from normality = a PP index. Objective more general than looking for projections that reveal outlying observations. However, several indices very sensitive to departure from normality in the tails of the distribution and reveal outliers in priority. A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 7 / 35

  15. PP indices for multivariate outliers detection: a review First proposals First proposals Friedman-Tukey (1974): n n � α ′ ( X i − X j ) � 1 � � I FT ( α ) = K n 2 h 2 h i =1 j =1 32 (1 − u 2 ) 3 I {| u |≤ 1 } and h = 3 . 12 N − 1 with K ( u ) = 35 6 (Klinke, 1997). A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

  16. PP indices for multivariate outliers detection: a review First proposals First proposals Friedman-Tukey (1974): n n � α ′ ( X i − X j ) � 1 � � I FT ( α ) = K n 2 h 2 h i =1 j =1 32 (1 − u 2 ) 3 I {| u |≤ 1 } and h = 3 . 12 N − 1 with K ( u ) = 35 6 (Klinke, 1997). Friedman (1987): index based on the L 2 distance between the projected data distribution and the Gaussian distribution (using expansions based on Legendre polynomials). A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

  17. PP indices for multivariate outliers detection: a review First proposals First proposals Friedman-Tukey (1974): n n � α ′ ( X i − X j ) � 1 � � I FT ( α ) = K n 2 h 2 h i =1 j =1 32 (1 − u 2 ) 3 I {| u |≤ 1 } and h = 3 . 12 N − 1 with K ( u ) = 35 6 (Klinke, 1997). Friedman (1987): index based on the L 2 distance between the projected data distribution and the Gaussian distribution (using expansions based on Legendre polynomials). Kurtosis: n � ( α ′ X i ) 4 I kurt ( α ) = i =1 (Huber, 1985, Pe˜ na and Prieto, 2001) A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

  18. PP indices for multivariate outliers detection: a review Other proposals Other indices Measure of outlyingness (Stahel-Donoho): for each observation i = 1 , . . . , n , I i ( α ) = | α ′ X i − med j ( α ′ Xj ) | mad j ( α ′ X j ) where “med” = median, “mad” = median absolute deviation of the projected data from the median. A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

  19. PP indices for multivariate outliers detection: a review Other proposals Other indices Measure of outlyingness (Stahel-Donoho): for each observation i = 1 , . . . , n , I i ( α ) = | α ′ X i − med j ( α ′ Xj ) | mad j ( α ′ X j ) where “med” = median, “mad” = median absolute deviation of the projected data from the median. Dispersion-based indices: robust dispersion estimator (Li and Chen, 1985, Croux and Ruiz-Gazen, 2005), defines a robust principal component analysis. A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

  20. PP indices for multivariate outliers detection: a review Other proposals Other indices Measure of outlyingness (Stahel-Donoho): for each observation i = 1 , . . . , n , I i ( α ) = | α ′ X i − med j ( α ′ Xj ) | mad j ( α ′ X j ) where “med” = median, “mad” = median absolute deviation of the projected data from the median. Dispersion-based indices: robust dispersion estimator (Li and Chen, 1985, Croux and Ruiz-Gazen, 2005), defines a robust principal component analysis. Juan and Prieto (2001) for concentrated contamination patterns Hall and Kay (2005): non parametric atypicality index Indices adapted to time series, . . . A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

  21. PP indices for multivariate outliers detection: a review Other proposals Other indices Measure of outlyingness (Stahel-Donoho): for each observation i = 1 , . . . , n , I i ( α ) = | α ′ X i − med j ( α ′ Xj ) | mad j ( α ′ X j ) where “med” = median, “mad” = median absolute deviation of the projected data from the median. Dispersion-based indices: robust dispersion estimator (Li and Chen, 1985, Croux and Ruiz-Gazen, 2005), defines a robust principal component analysis. Juan and Prieto (2001) for concentrated contamination patterns Hall and Kay (2005): non parametric atypicality index Indices adapted to time series, . . . Many complementary definitions of indices but . . . the main problem with PP: pursuit computationally intensive. A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

Recommend


More recommend