Robust Location and Scatter Estimation Robust Location and Scatter Estimation Robust Location and Scatter Estimators Outline for Multivariate Data Analysis • Background and Motivation {robustbase}, {rrcov} • Computing the Robust Estimates – Definition and computation • MCD, OGK, S, M Valentin Todorov – Object model for robust estimation – Comparison to other implementations • Applications – Hotelling T 2 – Robust Linear Discriminant Analysis valentin.todorov@chello.at • Conclusions and future work 15.06.2006 useR'2006, Vienna: Valentin Todorov 1 15.06.2006 useR'2006, Vienna: Valentin Todorov 2 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Example Multivariate location and scatter • Marona & Yohai (1998) • Location : coordinate-wise mean • rrcov : data set maryo • A bivariate data set with: • Scatter : covariance matrix – Variances of the variables on the diagonal ( ) n = 20 , µ = 0 0 – Covariance of two variables as off-diagonal elements � � 1 0 . 8 � � S = • Optimally estimated by the sample mean and sample � � 0 . 8 1 covariance matrix at any multivariate normal model • Essential to a number of multivariate data analyses • sample correlation: 0.81 methods • interchange the largest and smallest value in the first coordinate • But extremely sensitive to outlying observations • the sample correlation becomes 0.05 15.06.2006 useR'2006, Vienna: Valentin Todorov 3 15.06.2006 useR'2006, Vienna: Valentin Todorov 4
Robust Location and Scatter Estimation Robust Location and Scatter Estimation Software for robust estimation of multivariate Motivation location and scatter • R 2.3.1: cov.rob ( cov. mcd) in MASS, but • S-Plus – covRob in the Robust library – Implements C-Step similar to the one in Rousseeuw & Van Driessen (1999) but no partitioning and no nesting • Matlab – mcdcov in the toolbox LIBRA -> very slow for larger data sets • SAS/IML – MCD call – No small sample corrections • R – cov.rob and cov.mcd in MASS – No generic functions print/show, summary, plot – No graphical and diagnostic tools • R – covMcd in {robustbase} • R – CovMcd, CovOgk, CovMest {rrcov} 15.06.2006 useR'2006, Vienna: Valentin Todorov 5 15.06.2006 useR'2006, Vienna: Valentin Todorov 6 Robust Location and Scatter Estimation Robust Location and Scatter Estimation rrcov � � � � robustbase rrcov - Port of the Fortran code for FAST-MCD and FAST-LTS of Rousseeuw and Van Driessen + Constrained M-estimates of location and covariance - Rocke (1996) - Functions covMcd, ltsReg and the corresponding help files + Orthogonalized Gnanadesikan-Kettering (OGK) – + Datasets - Rousseeuw and Leroy (1987), Milk - Daudin Maronna and Zamar (2002) (1988), etc. + S4 object model + Generic functions print and summary for covMcd + CovMcd + Graphical and diagnostic tools based on the robust and + CovOgk classical Mahalanobis distances - plot.mcd + CovMest + Formula interface and generic functions print , summary and predict for ltsReg + Graphical and diagnostic tools based on the residual - plot.lts 15.06.2006 useR'2006, Vienna: Valentin Todorov 7 15.06.2006 useR'2006, Vienna: Valentin Todorov 8
Robust Location and Scatter Estimation Robust Location and Scatter Estimation rrcov Outline » CovSest: S estimates - FAST-S Salibian & Yohai (2005) » Trellis style graphics • Background and Motivation » Hotelling T 2 • Computing the Robust Estimates » Robust Linear Discriminant Analysis with option for – Definition and computation Stepwise selection of variables • MCD, OGK, S, M » More data sets – Object model for robust estimation – Comparison to other implementations • Applications – Hotelling T 2 – Robust Linear Discriminant Analysis • Conclusions and future work 15.06.2006 useR'2006, Vienna: Valentin Todorov 9 15.06.2006 useR'2006, Vienna: Valentin Todorov 10 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Minimum Covariance Determinant Estimator Computing of MCD: FAST-MCD • Consists of three phases: basic C-step iteration, partitioning and Given a p dimensional data set X ={ x 1 , …, x n } nesting – The MCD estimator (Rousseeuw, 84) is defined by • C-step : move from one approximation ( T 1 ,C 1 ) of MCD of a data set • the subset of h observations out of n whose classical X ={ x 1 , ..., x n } to a new one ( T 2 ,C 2 ) with possibly lower determinant by computing the distances relative to ( T 1 ,C 1 ) and then computing ( T 2 ,C 2 ) covariance matrix has a smallest determinant for the h observations with smallest distances. • the MCD location estimator T is defined by the mean • C-step iteration : of that subset – Repeat a number of times (say 500) { • the MCD scatter estimator C is a multiple of its • start from a trial subset of h points and perform several C-step s covariance matrix • keep the 10 best solutions • n /2 <= h < n ; h =[( n + p +1)/2] yields maximal BDP } – From each of these solutions carry out C-step s until convergence and select the best result 15.06.2006 useR'2006, Vienna: Valentin Todorov 11 15.06.2006 useR'2006, Vienna: Valentin Todorov 12
Robust Location and Scatter Estimation Robust Location and Scatter Estimation Computing of MCD: FAST-MCD Compound Estimators • Partitioning : If the data set is large (e.g. > 600) it is partitioned into (five) disjoint subsets • MVE and MCD - a first stage procedure – Carry out C-step s iterations for each of the subsets • Rousseeuw and Leroy 87, Rousseeuw and van Zomeren – Use the best (50) solutions as starting points for C-step s on the 91 - one step re-weighting entire data set and again keep the best 10 solutions • One-step M-estimates using Huber or Hampel function – Iterate these 10 solutions to convergence • Woodruff and Rocke 93, 96 - use MCD as a starting point • Nesting : If the data set is larger then (say 1500) for S-estimation or constraint M-estimation – draw a random subset and apply the partitioning procedure to it – use the 10 best solutions from the partitioning phase for iterations on the entire data set • The number of solutions used and the number of C-step s performed on the entire data set depend on its size 15.06.2006 useR'2006, Vienna: Valentin Todorov 13 15.06.2006 useR'2006, Vienna: Valentin Todorov 14 Robust Location and Scatter Estimation Robust Location and Scatter Estimation >library(rrcov) Using the estimators: Example Loading required package: robustbase Loading required package: MASS Scalable Robust Estimators with High Breakdown Point (version 0.3-03) Delivery Time Data – Rousseeuw and Leroy (1987), page 155, table 23 (Montgomery >data(delivery) and Peck (1982)). >delivery.x <- as.matrix(delivery[, 1:2]) >mcd <- CovMcd(delivery.x) – 25 observations in 3 variables >mcd • X1 Number of Products • X2 Distance Call: CovMcd(x = delivery.x) • Y Delivery time – The aim is to explain the time required to service a vending Robust Estimate of Location: machine (Y) by means of the number of products stocked (X1) n.prod distance and the distance walked by the route driver (X2). 5.895 268.053 – delivery.x – the X-part of the data set Robust Estimate of Covariance: n.prod distance n.prod 12.30 232.98 distance 232.98 56158.36 15.06.2006 useR'2006, Vienna: Valentin Todorov 15 15.06.2006 useR'2006, Vienna: Valentin Todorov 16
Robust Location and Scatter Estimation Robust Location and Scatter Estimation > summary(mcd) Call: The CovMcd object CovMcd(x = delivery.x) Robust Estimate of Location: n.prod distance • CovMcd() returns an S4 object of class CovMcd 5.895 268.053 > data.class(mcd) Robust Estimate of Covariance: [1] “CovMcd“ n.prod distance n.prod 12.30 232.98 distance 232.98 56158.36 • Input parameters used for controlling the estimation Eigenvalues of covariance matrix: algorithm: alpha, quan, method, n.obs , etc. [1] 56159.32 11.34 • Raw MCD estimates: crit, best, raw.center, raw.cov, Robust Distances: raw.mah, raw.wt [1] 1.51872 0.68199 0.99165 0.73930 0.27939 0.13181 1.37029 [8] 0.21985 57.68290 2.48532 9.30993 1.70046 0.30187 0.71296 • Final (re-weighted) estimates – center, cov, mah, wt … … 15.06.2006 useR'2006, Vienna: Valentin Todorov 17 15.06.2006 useR'2006, Vienna: Valentin Todorov 18 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Plot of the Robust Distances The CovMcd object (cont.) • The Mahalanobis • show(mcd) distances based on the robust estimates – the • summary(mcd) – additionally prints the eigenvalues of the outliers have large Rd i covariance and the robust distances. • A line is drown at • plot(mcd) - shows the Mahalanobis distances based on 2 y = cutoff = χ the robust and classical estimates of the location and the p , 0 . 975 scatter matrix in different plots. • The observations with – distance plot 2 RD ≥ cutoff = χ i p , 0 . 975 – distance-distance plot are identified by their – chi-Square plot subscript – tolerance ellipses – scree plot 15.06.2006 useR'2006, Vienna: Valentin Todorov 19 15.06.2006 useR'2006, Vienna: Valentin Todorov 20
Recommend
More recommend