Robust multivariate methods for compositional data Peter Filzmoser - PowerPoint PPT Presentation

Robust multivariate methods for compositional data Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology Compstat – Paris, France August 23, 2010 Vienna University of Technology

Contents • Characterization of compositional data • Examples • Transformations • Factor analysis • Robustness • Conclusions

Joint work with . . . Karel Hron , Univ. Olomouc, Czech Republic Clemens Reimann , Geological Survey of Norway Robert Garrett , Geological Survey of Canada

Example household expenditures Household Expenditures in former HK$ (Aitchison, 1986) Person Housing Foodstuff Alcohol Tobacco Other goods Total 1 640 328 147 169 196 1480 2 1800 484 515 2291 912 6002 3 2085 445 725 8373 1732 13360 4 616 331 126 117 149 1339 5 875 368 191 290 275 1999 6 770 364 196 242 236 1808 7 990 415 284 588 420 2697 8 414 305 94 68 112 993 . . . . . . . . . . . . . . . . . . . . . 18 1195 443 329 974 523 3464 19 2180 521 553 2781 1010 7045 20 1017 410 225 419 345 2416

Characterization of compositional data Definition: Compositional data consist of real-valued vectors x = ( x 1 , . . . , x D ) t with D strictly positive components describing the parts on a whole, and which carry only relative information (Aitchison, 1986; Egozcue, 2009). Consequences: • The values x 1 , . . . , x D as such are not informative, but only their ratios are of interest. • The parts x 1 , . . . , x D do not need to sum up to 1. • Compositional data follow the so-called Aitchison geometry on the simplex (and not the Euclidean geometry). Most important reference: J. Aitchison. The Statistical Analysis of Compositional Data . Chapman and Hall, London, U.K., 1986.

Example Kola data Kola data: library(StatDA) about 600 samples from 4 soil layers N Barents Sea Clemens Reimann . Peter Filzmoser . Robert Garrett . Rudolf Dutter Norway . . . . . . . . . . . Statistical Nikel Zapoljarnij Data Murmansk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Russia Explained . . . . Monchegorsk Finland Apatity Kovdor Applied Environmental Statistics with R Rovaniemi 0 50 100 km

Example Kola data Two dominant parts in the C-horizon: 20 2.0 Al2O3 in C−horizon [wt.−%] 15 log(Al2O3/TiO2) 1.5 10 8 1.0 7 6 40 50 60 70 80 1.5 2.0 2.5 SiO2 in C−horizon [wt.−%] log(SiO2/TiO2)

Example factor analysis (Reimann, Filzmoser, Garrett, 2002, Appl. Geochem. ) Kola moss data: −0.2 0.0 0.2 0.4 0.6 0.8 1.0 library(StatDA) + 1.0 10 + data(moss) Al Th 0.8 8 594 samples + U Si Fe Factor 2 (16.5%) 0.6 31 variables 6 + + Sr + Ba B Ca 0.4 Cr Factor analysis: V 4 Mo + Na + + + As • log-transformation + Pb + P Mg 0.2 + + + 2 + + + + + + + Sb + + Hg Ag + + + + + + + + + + + Bi Co Tl + • results presented in biplots + + Cd + + + + + + + + + + + + + + + + + S + + + + + + + + + + ++ + + + + + + + + + + + Cu ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + K + + + + + + + ++ + + 0.0 + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + + + + + + + Ni ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + 0 ++ + + ++ + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + ++ ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + Zn + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Rb + + + + + + + + + + + + + + + + + + + + + + + + + = ⇒ industrial −0.2 + −2 Mn contamination! −2 0 2 4 6 8 10 Factor 1 (26.5%) BUT: We have compositional data!

Example household expenditures Household Expenditures in former HK$ (Aitchison, 1986) Person Housing Foodstuff Alcohol Tobacco Other goods Total 1 640 328 147 169 196 1480 2 1800 484 515 2291 912 6002 3 2085 445 725 8373 1732 13360 4 616 331 126 117 149 1339 5 875 368 191 290 275 1999 6 770 364 196 242 236 1808 7 990 415 284 588 420 2697 8 414 305 94 68 112 993 . . . . . . . . . . . . . . . . . . . . . 18 1195 443 329 974 523 3464 19 2180 521 553 2781 1010 7045 20 1017 410 225 419 345 2416

Example household expenditures Two versions: Data with and without Tobacco Data are normalized with the total expenditures Normalized data without Tobacco Normalized data with Tobacco 0.30 0.30 0.25 0.25 0.20 Foodstuff* Foodstuff 0.20 0.15 0.15 0.10 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.42 0.44 0.46 0.48 0.50 0.52 Housing Housing*

Example household expenditures Solution: consider (log-)ratios Normalized data without Tobacco Normalized data with Tobacco 0.6 0.6 log(Housing*/Foodstuff*) log(Housing/Foodstuff) 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 5 10 15 20 5 10 15 20 Index Index Normalization not necessary: same result with original data in HK$

Geometrical properties Compositional data with only 2 parts 1 1 Second part Second part 0.6 ● 0.5 ● ● ● ● ● ● 0.2 ● ● 0.1 ● 0 0 0 1 0 1 First part First part � 2 � x j − ln ˜ x ) = 1 � D − 1 ln x i x i � D Aitchison distance: d A ( x , ˜ i =1 j = i +1 ˜ D x j

Transformations Special transformations from the simplex to the Euclidean space: • alr ( additive logratio ) transformation: Divide values by the j -th part, j ∈ { 1 , . . . , D } : � t � , . . . , ln x j − 1 , ln x j +1 ln x 1 , . . . , ln x D x ( j ) = x j x j x j x j • clr ( centered logratio ) transformation: Divide values by the geometric mean : t   x 1 x D y =  ln , . . . , ln   �� D �� D  D D i =1 x i i =1 x i • ilr ( isometric logratio ) transformation: take an orthonormal basis in the clr-space = ⇒ difficult to interpret

Factor analysis for compositional data Given a D -dimensional random variable y . FA model: y = Λf + e with Λ . . . loadings matrix f . . . “ factors ” of dimension k < D e . . . error term With the usual assumptions this results in Cov ( y ) = ΛΛ t + Ψ with the diagonal matrix Ψ = Cov ( e ) ( uniquenesses ).

Factor analysis for compositional data (Filzmoser, Hron, Reimann, Garret, 2009, Comp. & Geosci. ) For an interpretation, FA must be related to the original variables! = ⇒ ilr transformation ( z ), covariance estimation (Cov ( z ) ), back-transformation to the clr-space: Cov ( y ) = V Cov ( z ) V t Next problem: Cov ( y ) is singular, which is in conflict with Cov ( y ) = ΛΛ t + Ψ with a diagonal form of Ψ . Solution: Projection of the diagonal matrix Ψ on the hyperplane y 1 + . . . + y D = 0 formed by the clr-space. ⇒ resulting Ψ ∗ is no longer a diagonal matrix =

Robust parameter estimation The basis for parameter estimation in the FA model is the estimation of the covariance matrix . The classical estimation is sensitive with respect to outliers . = ⇒ robust estimation of the covariance matrix leads to robust estimation of the parameters for FA (Pison, Rousseeuw, Filzmoser, Croux, 2003, J. Multiv. Anal. ) Classical estimation Robust estimation

Robust covariance estimation Minimum Covariance Determinant estimator (MCD):

Robust covariance estimation Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix

Robust covariance estimation Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix − → Arithm. mean is robust estimator of location ●

Robust covariance estimation Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix − → Arithm. mean is robust estimator of location − → classical covariance , ● multiplied by a factor, is robust covariance estimator

Robust FA for compositional data Kola moss data: library(StatDA) N Barents Sea data(moss) 594 samples Norway 31 variables Nikel Zapoljarnij Murmansk Compare: • classical and robust FA for Russia • log-transformed and ilr- Monchegorsk transformed data Finland Apatity Kovdor Rovaniemi 0 50 100 km

Robust multivariate methods for compositional data Peter Filzmoser - PowerPoint PPT Presentation

Robust multivariate methods for compositional data Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology Compstat Paris, France August 23, 2010 Vienna University of Technology Contents

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Robust method for EnKF in the presence of observation outliers/Multivariate localization methods

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Imprecise Compositional Data Analysis: Alternative Statistical Methods Michael Smithson The

Unusual compositional dependence of the Unusual compositional dependence of the exciton reduced

Compositional Analysis of Compositional Analysis of Soluble Salts in Bresle Bresle Extraction

High-level State Machines & RTL Design Prof. Usagi Recap: Clock signal 0ns 10ns 20ns

Combining ACL2 and an Automated Verification Tool to Verify a Multiplier Jun Sawada and Erik

Computational Optimization Augmented Lagrangian NW 17.3 Upcoming Schedule No class April 18

An Integrated View into Multivariate Associations Inferred from TCGA Cancer Data Dick Kreisberg

On the Measure of Distortions Hugo Hopenhayn May 11, 2012 1 / 38 Introduction Recent

Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

In the name of Allah In the name of Allah the compassionate, the merciful the compassionate, the

Sambuz

Useful Links

Newsletter

Mail Us

Robust multivariate methods for compositional data Peter Filzmoser - PowerPoint PPT Presentation

Robust multivariate methods for compositional data Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology Compstat Paris, France August 23, 2010 Vienna University of Technology Contents

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Robust method for EnKF in the presence of observation outliers/Multivariate localization methods

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Imprecise Compositional Data Analysis: Alternative Statistical Methods Michael Smithson The

Unusual compositional dependence of the Unusual compositional dependence of the exciton reduced

Compositional Analysis of Compositional Analysis of Soluble Salts in Bresle Bresle Extraction

High-level State Machines &amp; RTL Design Prof. Usagi Recap: Clock signal 0ns 10ns 20ns

Combining ACL2 and an Automated Verification Tool to Verify a Multiplier Jun Sawada and Erik

Computational Optimization Augmented Lagrangian NW 17.3 Upcoming Schedule No class April 18

An Integrated View into Multivariate Associations Inferred from TCGA Cancer Data Dick Kreisberg

On the Measure of Distortions Hugo Hopenhayn May 11, 2012 1 / 38 Introduction Recent

Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

In the name of Allah In the name of Allah the compassionate, the merciful the compassionate, the

Sambuz

Useful Links

Newsletter

Mail Us

High-level State Machines & RTL Design Prof. Usagi Recap: Clock signal 0ns 10ns 20ns