exploring the multivariate structure of missing values
play

Exploring the multivariate structure of missing values using the R - PowerPoint PPT Presentation

Exploring the multivariate structure of missing values using the R package VIM Matthias Templ 1 , 2 , Andreas Alfons 1 , Peter Filzmoser 1 1 Department of Statistics and Probability Theory, Vienna University of Technology 2 Department of


  1. Exploring the multivariate structure of missing values using the R package VIM Matthias Templ 1 , 2 , Andreas Alfons 1 , Peter Filzmoser 1 1 Department of Statistics and Probability Theory, Vienna University of Technology 2 Department of Methodology, Statistics Austria Rennes, July 8, 2009 Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 1 / 18

  2. Content Motivation 1 Visualization of missing values 2 Conclusions 3 Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 2 / 18

  3. Motivation Missing values Real data sets often contain missing values:   x 11 x 1 p . . . . . . . . . .   . NA .     X = NA   ,  . .  . .   . NA .   x n 1 x np . . . . . . with n observations, p variables, and some missing values. (NA) Examples: nonresponse in surveys, element concentration below detection limit in chemical analyses. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 3 / 18

  4. Motivation Comments on missing values Most statistical methods can only be applied to complete data. In order to select an appropriate imputation method (especially for model-based imputation), it is necessary to know the multivariate structure of the missing values beforehand. Visualizing missing values may not only help to detect the missing value mechanisms, but also to gain insight into the quality and various other aspects of the data. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 4 / 18

  5. Motivation Missing value mechanisms Three important cases (e.g., Little and Rubin 2002): MCAR ( M issing C ompletely A t R andom): P ( X miss | X ) = P ( X miss ) MAR ( M issing A t Random ): P ( X miss | X ) = P ( X miss | X obs ) MNAR ( M issing N ot A t R andom): P ( X miss | X ) = P ( X miss | X obs , X miss ) where X = ( X obs , X miss ) denotes the complete data, and X obs and X miss are the observed and missing parts, respectively. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 5 / 18

  6. Visualization of missing values Visualization of missing values Famous books and almost all articles about missing values do not address vizualization. Visualization tools for missing values are rarely or not at all implemented in SAS, SPSS, STATA or even R. Through linking, missing values can be highlighted in GGobi (Cook and Swayne 2007) and Mondrian (Theus 2002). MANET (Unwin et al. 1996, Theus et al. 1997) is quite powerful, but only available for older Apple systems with PowerPC architecture and Mac OS. Visualization tools for missing values need to be available for the R community so that visualization of missing valuess, imputation and analysis can all be done from within R, without the need of additional software. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 6 / 18

  7. Visualization of missing values Histogram and spinogram 700 1.0 600 missing/observed in py010n missing/observed in py010n 0.8 500 0.6 400 300 0.4 200 0.2 100 0 0.0 20 30 40 50 60 70 80 15 30 40 50 65 age age Figure: Austrian EU-SILC data from 2004 with missings generated in variable age. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 7 / 18

  8. Visualization of missing values Marginplot 4.5 4.0 py130n 3.5 3.0 ● ● 15 ● ● ● ● ● 0 0 3.0 3.5 4.0 4.5 5.0 pek_n Figure: Austrian EU-SILC data from 2004. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 8 / 18

  9. Visualization of missing values Scatterplot matrix 20 30 40 50 60 70 80 12 pek_n 10 8 6 4 2 0 80 age 70 60 50 40 30 20 1.0 py010n 0.8 0.6 0.4 0.2 0.0 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 Figure: Austrian EU-SILC data from 2004. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 9 / 18

  10. Visualization of missing values Matrixplot 4000 3000 Index 2000 1000 0 P001000 r007000 py010n py035n py050n py090n py100n pek_n bundesld age Figure: Austrian EU-SILC data from 2004. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 10 / 18

  11. Visualization of missing values Parallel coordinate plot sex pek_g P033000 P029000 P014000 P001000 age bundesld Figure: Austrian EU-SILC data from 2004 Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 11 / 18

  12. Visualization of missing values Parallel boxplots 5 4 ● pek_n 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4626 4132 4530 96 4479 4612 14 4622 4 4592 34 4560 66 4618 8 4611 15 4625 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 obs. in py010n miss. in py010n obs. in py035n miss. in py035n obs. in py050n miss. in py050n obs. in py070n miss. in py070n obs. in py080n miss. in py080n obs. in py090n miss. in py090n obs. in py100n miss. in py100n obs. in py110n miss. in py110n obs. in py130n miss. in py130n obs. in py140n miss. in py140n Figure: Austrian EU-SILC data from 2004. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 12 / 18

  13. Conclusions General Statements The detection of missing value mechanisms is quite complex when using models or tests. Statistical methods frequently lead to only vague statements about the missing value mechanisms. Non-robust methods lead to erroneous statements about missing value mechanisms for data containing outliers. Visualization tools are easier to handle and more powerful, but flexible, easy-to-use visualization software is required. Templ, Alfons, Filzmoser (TUW) Exploring the structure of missing values Rennes, July 8, 2009 13 / 18

Recommend


More recommend