Morceaux choisis It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. data tidying: structuring datasets to facilitate analysis. This paper [...] provides a comprehensive ``philosophy of data'' Since most real world datasets are not tidy... Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).
http://hadley.nz/
http://hadley.nz/
https://www.youtube.com/results?search_query=hadley+wicham http://ggplot2.org/ http://ggplot2.org/resources/2007-past-present-future.pdf http://ggplot2.org/resources/2007-vanderbilt.pdf http://docs.ggplot2.org/current/
Data cleaunp tidyr Data handling dplyr Data ggpot2 Visualization
Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
« Les familles heureuses se ressemblent toutes. Les familles malheureuses sont malheureuses chacune à leur manière. » Le principe d'Anna Karenine En d’autres termes, le succès demande que plusieurs conditions soient réunies. Une seule condition manquée est suffisante pour conduire à l’échec. https://deselection.wordpress.com/2010/11/12/le-principe-danna-karenine/
Version Aristote https://en.wikipedia.org/wiki/Anna_Karenina_principle Much earlier, Aristotle states the same principle in the Nichomachean Ethics (Book 2): Again, it is possible to fail in many ways (for evil belongs to the class of the unlimited, as the Pythagoreans conjectured, and good to that of the limited), while to succeed is possible only in one way (for which reason also one is easy and the other difficult – to miss the mark easy, to hit it difficult); for these reasons also, then, excess and defect are characteristic of vice, and the mean of virtue; For men are good in but one way, but bad in many.
Logique : quantificateurs universel et existentiel https://fr.wikipedia.org/wiki/Quantificateur_(logique) ● ∀ x P(x) se lit « pour tout x P(x) » et signifie « tout objet du domaine considéré possède la propriété P » ● ∃ x P(x) signifie il existe au moins un x tel que P(x) (un objet au moins du domaine considéré possède la propriété P) Négation des quantificateurs ∀ ¬P(x) ¬ x P(x), soit : x ∃ ∃ La négation de x P(x) est : La négation de ∀ x P(x) est : ¬ ∀ x P(x), soit : ∃ x ¬P(x) https://fr.wikipedia.org/wiki/Logique_classique Logique classique ● Le tiers exclu énonce que pour toute proposition mathématique considérée, elle-même ou sa négation est vraie : A ∨ ¬ A ● Le raisonnement par l'absurde : ¬¬ A ⇒ A ● La contraposition : ( ¬Β ⇒ ¬ A) ⇒ ( A ⇒ B) ● L'implication matérielle : ( Α ⇒ B) ⇔ ( ¬Α ∨ B)
ANOVA https://fr.wikipedia.org/wiki/Analyse_de_la_variance H0 : toutes les moyennes sont égales H1 : non H0 Si rejet de H0, on sait qu'au moins une moyenne est différente des autres, mais laquelle ? → test post-hoc
Ils sont tous égaux ! Ils ne sont pas tous égaux ! → Ils sont tous différents Ils ne sont pas tous égaux ! → un seul est différent des autres
messy tidy
Tidy data Le terme tidy fait référence à une façon optimale (?) de présenter les données pour une analyse statistique. Une version messy peut être préférable pour une meilleure lisibilité des données. messy tidy Dans une publi, cette version ↑ , plus compacte, est peut-etre préférable. Mais : impression d'avoir affaire à une table de contingence → test de chi2...
Tidy data 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. Messy data is any other arrangement of the data.
Messy data Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analyzing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies: ● Column headers are values, not variable names. ● Multiple variables are stored in one column. ● Variables are stored in both rows and columns. ● Multiple types of observational units are stored in the same table. ● A single observational unit is stored in multiple tables. Surprisingly, most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting.
Column headers are values, not variable names ... 3 variables : Chaque colonne - religion représente une variable ; chaque - revenu ligne, une - effectif observation
Tidying when column headers are values: melting Columns corresponds to RNAseq data of different conditions (B, C, D) and 3 biological replicates ⇒ Wide dataset = natural initial format, nice format to summarise the data but not so nice to model or to plot melt() function allows to turn columns into rows ⇒ Molten datset is a nice format for models across times for example
Variables are stored in both rows and columns Une variable par colonne, une observation par ligne Cette colonne contient un nom de variable !
Tidying when multiple variables are stored in one column: casting Casting changes rows into columns (inverse of melting) Values of the 2 variables tmax and tmin are recorded in the same column but on 2 rows After casting the 2 variables are recorded in 2 columns
Tidying when … ● Variables are stored in both rows and columns: combination of melting and casting ● Multiple types in one table (e.g. values collected at multiple levels needed in the same table): merging ● One type in multiple tables : plyr package helps to read a list of file ( ldplyr )
Tidy tools 1) Manipulation 2) Visualisation 3) Modélisation
Manipulation ● Filter : subsetting or removing observations based on some condition. ● Transform : adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume). ● Aggregate : collapsing multiple values into a single value (e.g., by summing or taking means). ● Sort : changing the order of observations. All these operations are made easier when there is a consistent way to refer to variables. Tidy data provides this because each variable resides in its own column. Ensure input and output-tidiness plyr,dplyr packages
Visualisation Tidy visualization tools only need to be input-tidy as their output is visual. It provides a comprehensive ''philosophy of data": one that underlies my work in the plyr (Wickham 2011) and ggplot2 (Wickham 2009) packages. Logique ggplot2 : syntaxe adaptée à un input tidy. ggplot2 package
Hadley Wicham dixit: Source: http://ggplot2.org/resources/2007-past-present-future.pdf
str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... head (mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
ggplot( mtcars , aes(x=mpg,y=hp)) + plot( mtcars $mpg, mtcars $hp, geom_point() + col= mtcars $cyl-3, geom_point(aes(colour = cyl, pch= mtcars $gear+15, size=carb, cex= mtcars $carb/3) shape=factor(gear)))
toto =mtcars colnames( toto )=NULL plot( toto [,1], toto [,4], col= toto [,2]-3, pch= toto [,10]+15, cex= toto [,11]/3)cex= mtcars $carb/3)
toto =mtcars ggplot( toto , colnames( toto )=NULL aes(x= toto [,1],y= toto [,4])) plot( toto [,1], toto [,4], geom_point() + col= toto [,2]-3, geom_point(aes(colour = toto [,2], pch= toto [,10]+15, size= toto [,11], cex= toto [,11]/3)cex= mtcars shape=factor( toto [,10]))) $carb/3) Error in geom_point() geom_point(aes(colour = toto[, 2], size = toto[, : non-numeric argument to binary operator
More recommend