Robust Statistics using Stata First Belgian Stata Users Meeting Vincenzo Verardi Fnrs, UNamur, ULB September 2016 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 1 / 77
Outliers do matter and are not always bad August Landmesser Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 2 / 77
Outliers do matter and are not always bad Structure of the presentation Introduction Descriptive Satistics Univariate outliers identi…cation Regression models Multivariate analysis Multivariate outlier identi…cation Robust logit Conclusion Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 3 / 77
Outliers do matter and are not necessarily coding errors Star Cluster CYG OB1 Hertzsprung-Russell Data 7 Log of light intensity 6 5 4 3 3.5 4 4.5 Log of temperature Least Squares Robust Estimator Source: P. J. Rousseeuw and A. M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 4 / 77
Outliers do matter and are not necessarily coding errors Brain and Body Weights 65 Species of Land Animal Robust: y=1.98+0.75 x 10 LS: y=2.17+0.59 x Human Log of Brain Weight 5 Brachiosaurus Triceratops Dipliodocus Water opossum 0 -5 -5 0 5 10 15 Log of Body Weight Source: Weisberg, S. (1985) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 5 / 77
Outliers do matter and are not necessarily coding errors Number of international calls from Belgium Belgian Statistical Survey, Ministry of Economy. 20 15 10 5 0 50 55 60 65 70 75 Year Least Squares Robust Estimator Source: P. J. Rousseeuw and A. M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 6 / 77
Measuring robustness of an estimator Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 7 / 77
Measuring robustness of an estimator Sensitivity curve, see inter alia [Maronna et al., 2006] Let us consider a data set X n = f x 1 , . . . , x n g and the statistic T n = T n ( x 1 , . . . , x n ) . To study the impact of a potential outlier on this statistic, we may analyze the modi…cation of value observed for the statistic when we add an extra data point x and allow it to move on the whole line (from � ∞ to + ∞ ) . The (standardized) sensitivity curve of the statistic T n for the sample X n is de…ned by SC ( x ; T n , X n ) = T n + 1 ( x 1 , . . . , x n , x ) � T n ( x 1 , . . . , x n ) ; 1 n + 1 for each value of x , we compare the value of the statistic in the "contaminated" sample with its value in the initial sample, and rescale the di¤erence by dividing by 1 / ( n + 1 ) , the amount of contamination. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 8 / 77
Measuring robustness of an estimator Sensitivity curve Mean and Median Standardized Sensitivity Curve X~N(0,1), N=20 5 0 -5 -5 0 5 Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 9 / 77
Measuring robustness of an estimator In‡uence function The in‡uence function (IF) can be considered as an asymptotic version of the sensitivity curve of the statistic T n when the sample size n grows, that is, when the empirical distribution function F n tends to the underlying population distribution function F : �� � � � T ( F ) 1 1 n + 1 ∆ x T 1 � F + n + 1 IF ( x ; T , F ) = lim 1 n ! ∞ n + 1 T (( 1 � ε ) F + ε ∆ x ) � T ( F ) = lim , ε ε ! 0 where ∆ x denotes the probability distribution putting all its mass in the point x . This function measures the e¤ect on T of a pertubation of F obtained by adding a small probability mass at the point x . Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 10 / 77
Measuring robustness of an estimator In‡uence Function Mean and Median Influence Function X~N(0,1) 5 y 0 -5 -5 0 5 x Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 11 / 77
Measuring robustness of an estimator Gross-error sensitivity The gross-error sensitivity of T at distribution F , de…ned by γ � ( T , F ) = sup x j IF ( x ; T , F ) j , evaluates the biggest in‡uence that an outlier may have on T . From the robustness point of view, it is of course preferable to use an estimator for which γ � ( T , F ) is …nite (i.e. bounded IF). Local-shift sensitivity The local-shift sensitivity measures the e¤ect of a small perturbation of the value of x on T . We may determine the local-shift sensitivity j IF ( y ; T , F ) � IF ( x ; T , F ) j λ � ( T , F ) = sup . j y � x j x 6 = y From the robustness point of view, it is of course preferable to use an estimator for which the IF is smooth everywhere. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 12 / 77
Measuring robustness of an estimator Breakdown point The sensitivity curve shows how an estimator reacts to the introduction of one single outlier. Some estimators have bounded sensitivity curve (SC) and therefore resist to this contamination. However, it is possible that the number of outliers in a sample is so large that even these estimators with bounded SC can break. The breakdown point is, roughly, the smallest amount of contamination in the sample that may cause the estimator to take on arbitrary values . Example If the i th observation among x 1 , . . . , x n goes to in…nity, the sample mean µ n goes to in…nity as well. This means that the …nite-sample breakdown point of this statistic is only 1 / n . In contrast, the …nite-sample breakdown if n is even and ( n + 1 ) / 2 point of the median Q 0 . 5 ; n is n / 2 if n is odd. n n Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 13 / 77
Measuring robustness of an estimator Choosing a good (robust) estimator Fisher-consistent . If the estimator was calculated using the entire population rather than a sample, the true value of the estimated parameter should be obtained Bounded in‡uence function (low gross-error sensitivity). The biggest in‡uence that an outlier may have on the estimator should be limited Smooth in‡uence function (low local-shift sensitivity). The e¤ect on the estimator of a small perturbation in the data should be limited High breakdown point . The estimator must withstand a contamination of a large proportion of the data Highly e¢cient with convergence rate of p n Computationally feasible Compromises must often be made to achieve good performance. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 14 / 77
Descriptive statistics Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 15 / 77
Descriptive statistics Location parameters Several measures of location are available in the literature. We compare i) two “classical” estimators based on (centered) moments of the empirical distribution, ii) an estimator based on quantiles of the distribution, and iii) an estimator based on pairwise comparisons of the observations Classical estimator (mean) µ n = 1 n ∑ n i = 1 x i Classical estimator (trimmed mean) n �b α n c 1 µ α n = n � 2 b α n c ∑ i = b α n c + 1 x ( i ) Quantile-based estimator (median) Q 0 . 5 = med f x i g Pairwise based estimator [Hodges and Lehmann, 1963] n x i + x j o HL n = med ; i < j 2 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 16 / 77
Location parameters In‡uence functions 2 1 IF 0 -1 -2 -4 -2 0 2 4 x µ Q 0.5 HL µ 0.25 µ 0.05 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 17 / 77
Location parameters Comparing properties of location estimators Asymptotic Computational ASV ( � , Φ ) Estimator breakdown complexity value µ n 1 0% O ( n ) 8 > > if α = 0 . 05 1 . 0263 > < µ α 100 α % O ( n ) 1 . 0604 if α = 0 . 10 n > > > : 1 . 1952 if α = 0 . 25 π / 2 = 1 . 5708 O ( n ) Q 0 . 5 ; n 50% HL n π / 3 = 1 . 0472 29% O ( n log n ) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 18 / 77
Location parameters Stata example Clean dataset Contaminated dataset clear clear set seed 1234 set seed 1234 set obs 10000 set obs 10000 drawnorm z drawnorm z gen x=z gen x=z+10 in 1/100 sum x, d sum x, d robstat x, stat(hl) robstat x, stat(hl) µ n Q 0 . 5 ; n HL n µ n Q 0 . 5 ; n HL n Value -0.00 -0.01 -0.01 Value 1.00 0.00 0.01 Time 0.01 0.01 0.40 Time 0.01 0.01 0.41 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 19 / 77
Recommend
More recommend