finding multivariate outlier
play

Finding Multivariate Outlier Applied Multivariate Statistics Spring - PowerPoint PPT Presentation

Finding Multivariate Outlier Applied Multivariate Statistics Spring 2012 Goals Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot R: chisq.plot, pcout from package mvoutlier Appl.


  1. Finding Multivariate Outlier Applied Multivariate Statistics – Spring 2012

  2. Goals  Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot  R: chisq.plot, pcout from package “ mvoutlier ” Appl. Multivariate Statistics - Spring 2012 2

  3. Outlier in one dimension - easy  Look at scatterplots  Find dimensions of outliers  Find extreme samples just in these dimensions  Remove outlier Appl. Multivariate Statistics - Spring 2012 3

  4. No outlier in x or y 2d: More tricky Outlier Appl. Multivariate Statistics - Spring 2012 4

  5. Recap: Mahalanobis distance  True Mahalanobis distance: p ( x ¡ ¹ ) T § ¡ 1 ( x ¡ ¹ ) MD ( x ) =  Estimated Mahalanobis distance: q ¹ ) T ^ ^ § ¡ 1 ( x ¡ ^ ( x ¡ ^ MD ( x ) = ¹ ) Sq. Mahalanobis Distance MD 2 (x ) = Sq. distance from mean in standard deviations IN DIRECTION OF X Appl. Multivariate Statistics - Spring 2012 5

  6. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 Appl. Multivariate Statistics - Spring 2012 6

  7. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (20,0) MD = 4 Appl. Multivariate Statistics - Spring 2012 7

  8. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (0,10) MD = 10 Appl. Multivariate Statistics - Spring 2012 8

  9. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (10, 7) MD = 7.3 Appl. Multivariate Statistics - Spring 2012 9

  10. Theory of Mahalanobis Distance Assume data is multivariate normally distributed (d dimensions) Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom (“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.) Appl. Multivariate Statistics - Spring 2012 10

  11. Check for multivariate outlier  Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi -Square distribution?  Check with a QQ-Plot  Technical details: - Chi-Square distribution is still reasonably good for estimated Mahalanobis distance - use robust estimates for ¹; § Appl. Multivariate Statistics - Spring 2012 11

  12. Robust Estimates: Income of 7 people Robust Scatter Std. Dev.

  13. Robust Std. Dev.

  14. Robust Std. Dev.

  15. Robust Estimates for outlier detection  If scatter is estimated robustly, outlier “stick out” much more  Robust Mahalanobis distance: Mean and Covariance matrix estiamted robustly Appl. Multivariate Statistics - Spring 2012 15

  16. Outlier easily detected ! Example - continued Appl. Multivariate Statistics - Spring 2012 16

  17. Outliers in >2d can be well hidden ! No outlier, right? Appl. Multivariate Statistics - Spring 2012 17

  18. Outliers in >2d can be well hidden ! Wrong! Appl. Multivariate Statistics - Spring 2012 18

  19. Outliers in >2d can be well hidden ! This outlier can’t be seen in the scatterplot- matrix (but in a 3d plot) Appl. Multivariate Statistics - Spring 2012 19

  20. Method 1: Quantile of Chi-Sqaure distribution  Compute for each sample (in d dimensions) the robustly estimated Mahalanobis distance MD(x i )  Compute the 97.5%-Quantile Q of the Chi-Square distribution with d degrees of freedom  All samples with MD(x i ) > Q are declared outlier Appl. Multivariate Statistics - Spring 2012 20

  21. Method 2: Adjusted Quantile  Adjusted Quantile for outlier: Depends on distance between cdf of Chi-Square and ecdf of samples in tails  Simulate “normal” deviations in the tails  Outlier have “abnormally large” deviations in the tails (e.g. more than seen in 100 simulations without outliers) Appl. Multivariate Statistics - Spring 2012 21

  22. Method 2: Adjusted Quantile ECDF leaves “plausible” range Defines adaptive cutoff Appl. Multivariate Statistics - Spring 2012 22

  23. Method 2: Adjusted Quantile Function “ aq.plot ” Appl. Multivariate Statistics - Spring 2012 23

  24. Method 3: State of the art - pcout  Complex method based on robust principal components  Pretty involved methodology  Very fast – good for high dimensions  R: Function “ pcout ” in package “ mvoutlier ”  $wfinal01: 0 is outlier  $wfinal: Small values are more severe outlier  P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data Analysis , 52, 1694-1711, 2008 Appl. Multivariate Statistics - Spring 2012 24

  25. Automatic outlier detection  It is always better to look at a QQ-plot to find outlier ! Just find points “sticking out”; no distributional assumption  If you can’t: Automatic outlier detection - finds usually too many or too few outlier depending on parameter settings - depends on distribution assumptions (e.g. multivariate normality) + good for screening of large amounts of data Appl. Multivariate Statistics - Spring 2012 25

  26. Concepts to know  Find multivariate outlier with robustly estimated Mahalanobis distance  Cutoff - by eye (best method) - quantile of Chi-Square distribution Appl. Multivariate Statistics - Spring 2012 26

  27. R commands to know  chisq.plot, pcout in package “ mvoutlier ” Appl. Multivariate Statistics - Spring 2012 27

  28. Next week  Missing values Appl. Multivariate Statistics - Spring 2012 28

Recommend


More recommend