Robust Statistics Part 1: Introduction and univariate data Peter - PDF document

Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 1 General references General references Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. Robust Statistics: the Approach based on Influence Functions. Wiley Series in Probability and Mathematical Statistics. Wiley, John Wiley and Sons, New York, 1986. Rousseeuw, P.J., Leroy, A. Robust Regression and Outlier Detection . Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, New York, 1987. Maronna, R.A., Martin, R.D., Yohai, V.J. Robust Statistics: Theory and Methods . Wiley Series in Probability and Statistics. John Wiley and Sons, Chichester, 2006. Hubert, M., Rousseeuw, P.J., Van Aelst, S. (2008), High-breakdown robust multivariate methods, Statistical Science , 23, 92–119. wis.kuleuven.be/stat/robust Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 2

General references Outline of the course General notions of robustness Robustness for univariate data Multivariate location and scatter Linear regression Principal component analysis Advanced topics Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 3 General notions of robustness General notions of robustness: Outline Introduction: outliers and their effect on classical estimators 1 Measures of robustness: breakdown value, sensitivity curve, 2 influence function, gross-error sensitivity, maxbias curve. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 4

General notions of robustness Introduction What is robust statistics? Real data often contain outliers. Most classical methods are highly influenced by these outliers. Robust statistical methods try to fit the model imposed by the majority of the data. They aim to find a ’robust’ fit, which is similar to the fit we would have found without the outliers. This allows for outlier detection : flag those observations deviating from the robust fit. What is an outlier? How much is the majority? Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 5 General notions of robustness Introduction Assumptions We assume that the majority of the observations satisfy a parametric model and we want to estimate the parameters of this model. E.g. x i ∼ N ( µ, σ 2 ) x i ∼ N p ( µ , Σ) y i = β 0 + β 1 x i + ε i with ε i ∼ N (0 , σ 2 ) Moreover, we assume that some of the observations might not satisfy this model. We do NOT model the outlier generating process. We do NOT know the proportion of outliers in advance. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 6

General notions of robustness Introduction Example The classical methods for estimating the parameters of the model may be affected by outliers. Example. Location-scale model: x i ∼ N ( µ, σ 2 ) for i = 1 , . . . , n . Data: X n = { x 1 , . . . , x 10 } are the natural logarithms of the annual incomes (in US dollars) of 10 people. 9.52 9.68 10.16 9.96 10.08 9.99 10.47 9.91 9.92 15.21 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 7 General notions of robustness Introduction Example The income of person 10 is much larger than the other values. Normality cannot be rejected for the remaining (’regular’) observations: Normal Q−Q plot of all obs. Normal Q−Q plot except largest obs. 15 10.4 14 10.2 Sample Quantiles Sample Quantiles 13 10.0 12 9.8 11 9.6 10 −1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5 Theoretical Quantiles Theoretical Quantiles Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 8

General notions of robustness Introduction Classical versus robust estimators Location: Classical estimator: arithmetic mean n x n = 1 � µ = ¯ ˆ x i n i =1 Robust estimator: sample median   x ( n +1 if n is odd  )  µ = med ( X n ) = ˆ 2 � � 1  x ( n 2 ) + x ( n if n is even  2 +1)  2 with x (1) � x (2) � . . . � x ( n ) the ordered observations. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 9 General notions of robustness Introduction Classical versus robust estimators Scale: Classical estimator: sample standard deviation � n � 1 � � x n ) 2 σ = Stdev n = ˆ ( x i − ¯ � n − 1 i =1 Robust estimator: interquartile range 1 σ = IQRN ( X n ) = ˆ 2Φ − 1 (0 . 75)( x ( n − [ n/ 4]+1) − x ([ n/ 4]) ) Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 10

General notions of robustness Introduction Classical versus robust estimators For the data of the example we obtain: the 9 regular observations all 10 observations x n ¯ 9.97 10.49 med 9.96 9.98 Stdev n 0.27 1.68 IQRN 0.13 0.17 The classical estimators are highly influenced by the outlier 1 The robust estimators are less influenced by the outlier 2 The robust estimate computed from all observations is comparable with 3 the classical estimate applied to the non-outlying data. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 11 General notions of robustness Introduction Classical versus robust estimators Robustness: being less influenced by outliers Efficiency: being precise at uncontaminated data Robust estimators aim to combine high robustness with high efficiency Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 12

General notions of robustness Introduction Outlier detection The usual standardized values ( z -scores, standardized residuals) are: r i = x i − ¯ x n Stdev n Classical rule: if | r i | > 3 , then observation x i is flagged as an outlier. Here: | r 10 | = 2 . 8 → ? Outlier detection based on robust estimates: r i = x i − med ( X n ) IQRN ( X n ) Here: | r 10 | = 31 . 0 → very pronounced outlier! MASKING is when actual outliers are not detected. SWAMPING is when regular observations are flagged as outliers. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 13 General notions of robustness Introduction Remark In this example the classical and the robust fits are quite different, and from the robust residuals we see that one of the observations deviates strongly from the others. For the remaining 9 observations a normal model seems appropriate. It could also be argued that the normal model may not be appropriate itself, and that all 10 observations could have been generated from a single long-tailed or skewed distribution. We could try to decide which of the two models is more appropriate if we had a much bigger sample. Then we could fit a long-tailed distribution and apply a goodness-of-fit test of that model, and compare it with the goodness-of-fit of the normal model on the non-outlying data. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 14

General notions of robustness Introduction What is an outlier? An outlier is an observation that deviates from the fit suggested by the majority of the observations. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 15 General notions of robustness Introduction How much is the majority? Some estimators (e.g. the median) already work reasonably well when 50% or more of the observations are uncontaminated. They thus allow for almost 50% of outliers. Other estimators (e.g. the IQRN) require that at least 75% of the observations are uncontaminated. They thus allow for almost 25% of outliers. This can be measured in general. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 16

Robust Statistics Part 1: Introduction and univariate data Peter - PDF document

Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 1 General references General references Hampel,

Univariate Continuous Data MATH 185 Introduction to Computational Statistics University of

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of

On Univariate Extreme Value Statistics and the Estimation of Reinsurance Premiums B. Vandewalle

Chapter 2: Analysis of univariate data Objective: Show how graphics and numerical measures can be

Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries

Part 3 Robust Bayesian statistics & applications in reliability networks by Gero Walter 69

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 4, part B

Univariate Graphics STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 4, part A

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 3, part B

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 2, part B

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 6, part B

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 5, part B

Discussion: Robust Sparse Quadratic Discrimination Han Xiao Department of Statistics &

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 7, part B

How Robust are Thresholds for Community Detection? Ankur Moitra (MIT) Robust Statistics Summer

Outline Part I. Introduction Part II. ML for DI Part III. DI for ML Training data

"RobExtremes" Robust Extreme Value Statistics Outline a New Member in the

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

Statistics 762 Nonlinear Statistical Models for Univariate and Multivariate Response Instructor:

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 9, part A

Exploring Data Graphing and Summarizing Univariate Data Graphing the Data Graphical

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Balancing robust statistics and data mining in ratemaking: Gradient Boosting Modeling . . . .

Robust Statistics Part 1: Introduction and univariate data Peter - PDF document

Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 1 General references General references Hampel,

Univariate Continuous Data MATH 185 Introduction to Computational Statistics University of

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of

On Univariate Extreme Value Statistics and the Estimation of Reinsurance Premiums B. Vandewalle

Chapter 2: Analysis of univariate data Objective: Show how graphics and numerical measures can be

Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries

Part 3 Robust Bayesian statistics &amp; applications in reliability networks by Gero Walter 69

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 4, part B

Univariate Graphics STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 4, part A

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 3, part B

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 2, part B

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 6, part B

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 5, part B

Discussion: Robust Sparse Quadratic Discrimination Han Xiao Department of Statistics &amp;

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 7, part B

How Robust are Thresholds for Community Detection? Ankur Moitra (MIT) Robust Statistics Summer

Outline Part I. Introduction Part II. ML for DI Part III. DI for ML Training data

&quot;RobExtremes&quot; Robust Extreme Value Statistics Outline a New Member in the

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

Statistics 762 Nonlinear Statistical Models for Univariate and Multivariate Response Instructor:

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 9, part A

Exploring Data Graphing and Summarizing Univariate Data Graphing the Data Graphical

Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter

Balancing robust statistics and data mining in ratemaking: Gradient Boosting Modeling . . . .

Part 3 Robust Bayesian statistics & applications in reliability networks by Gero Walter 69

Discussion: Robust Sparse Quadratic Discrimination Han Xiao Department of Statistics &

"RobExtremes" Robust Extreme Value Statistics Outline a New Member in the