An introduction to data vizualisation Christophe 19 f´ evrier 2013 1 Theory of data graphics - Edward R. Tufte 1.1 Graphical excellence Graphical excellence is nearly alway multivariate and requires telling the truth about the data. Graphical display should : 1. Show the data 2. Induce the viewer to think 3. Avoid distorting what the data have to say 4. Present many numbers in a small place 5. Make large data sets coherent 6. Encourage the eye to compare different pieces of data 7. Reveal data at several levels of details, from broad overview to fine struc- ture 8. Serve a reasonnable clear purpose 9. Be closely integrated with the statistical description of the dataset 1.2 Raconter une histoire en images 1
Figure 1: Carte figurative des pertes successives en hommes de l’arm´ ee Fran- ¸ caisee dans la campagne de Russie en 1812-1813, Charles Minard (1869) Figure 2: Horaires des trains entre Paris et Lyon, E.J. Marey (1885) This method is attributed to the French engineer Ibry, but new evidence suggests that Lt. Sergeev had developed this method approximately 30 years earlier in Russia. Source E. R. Tufte 2
Figure 3: Evolution de la consommation de ressources energ´ etiques ; Source <http://www.digital-leads.com/2012/11/informationen-gestalten/> Figure 4: Lignes a´ eriennes aux USA. Source Aaron Koblin <http://www. aaronkoblin.com/work/flightpatterns/index.html> 3
Figure 5: Niveaux de neige et pr´ ecipitations. Source M´ et´ eoFrance <http:// www.infoneige.com/nivose-cerdagne-canigou> 4
Figure 6: Statistical Breviary by William Playfair (1801) Source : E. R. Tufte 1.3 Rules 1. Above all else show the data 2. Maximize the data-ink ratio 3. Erase non-data-ink 4. Erase redudant data-ink 5. Revise and edit 2 Graphical Integrity Graphical excellence begins with telling the truth about the data, so a lie factor can be constructed to compute the misrepresentation. LieFactor = size of effect shown in graphic (1) size of effect in data 2.1 Exemples 5
Figure 7: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786) Figure 8: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786). In a note Playfair says that the spending are now in real and not nominal millions ! 6
Another example of a big lie. Thee real magnitude of change in cars consump- tions is of 18 mpg in 1978 to 27.5 mpg in 1985, so the change is of 53% in 7 years. On the graph, the horizontal line is 1.5 cm in 1978 and 13 cm in 1985, so the visual change is around 75% making the lie factor reaching 14.5% ! ! ! Figure 9: Fuel economy standards. Source : E. R. Tufte (from NY Times 1978) Figure 10: Fuel economy standards, another view. Source : E. R. Tufte from NY Times 1978 7
2.2 Exemples with MS-Excell Figure 11: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE Les d´ epenses de l’´ etat (vert) semblent croitre plus fortement 8
Figure 12: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE Figure 13: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE 9
2.3 Data-ink ratio Ink shoul present data-information. Data-ink is the non-erasable core of graphic. The E. Tufte defines the data-ink-ratio as : data − ink Data − ink − ratio = (2) total ink used to print the graphic in the following, we will analyse how much of the information could be era- sed... Figure 14: Dette des administrations publiques. Etat vs Ensemble Source : INSEE 10
3 Boxplots and Co Le Box-plot est surement le plus simple et le plus utilis´ e pour comparer des distributions entre groupes d’individus par exemple. Il n’est pas interdit d’utiliser des couleurs et les axes horizontaux et verticaux... 110 108 106 Response 104 102 100 98 g1 g2 g3 g4 g5 Groupe 11
3.1 Let’s erase stuff... 110 106 102 98 g1 g2 g3 g4 g5 Groupe Groupe 3.2 Let’s change the shape... Box-percentile plots are similiar to boxplots, except box-percentile plots supply more information about the univariate distributions. At any height the width of the irregular ”box” is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box. see http://had.co.nz/stat645/project-03/boxplots.pdf . 12
Boxplot Box−Percentile Plot 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 Normal Uniform Normal Uniform On peut voir la diff´ erence sur 2 groupes tir´ es al´ eatoirement de la mˆ eme distribution : Boxplot Box−Percentile Plot 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 Group 1 Group 2 Group 1 Group 2 The boxplot has friends... The first figure show the underlying density of the 13
random generated data : a normal mixture of two components. Then, from left to right are plotted variations around the idea of a boxplot. 1. Underlying bimodal density 2. The boxplot itself, which concentrates on the central bulk of the data 3. The HDR boxplot , which looks at the zone of highest density 4. The Violin plot , that uses kernel estimator of the density 5. The Box-Percentile plot , same as boxplot, but showing more informa- tion about the density Sur une distribution ”classique” et unimodale, on ne diff´ erencie pa les 4 box- plot : Underlying standard HDR violin plot Box−Percentile Plot density boxplot boxplot 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 x 0.0 0.2 0.4 1 Mais si l’on change la distribution, pour la rendre bi-modale. Only the violin plot and the HDR boxplot capture the bimodality in that dataset. Given that the dataset is truly bimodal, they are, in that case, better than the standard boxplot and the Box percentile plot. 14
Underlying standard HDR violin plot Box−Percentile Plot density boxplot boxplot 4 4 4 4 4 2 2 2 2 2 0 0 0 0 0 −2 −2 −2 −2 −2 x 0.05 0.15 1 Source http://gallery.r-enthusiasts.com/graph/The_boxplot_friends_ 102 . McGill, Tukey and Larsen (1978) introduced the Variable Width box- plot ,where width is used to represent the density, and this is believed to prevent misinterpretation of certain characteristics of the data, in particular the median. In the same paper he introduced the Notched boxplot, which adds yet ano- ther element to the original boxplot by displaying confidence intervals around the medians. Doing so allows one to visually determine whether or not the me- dians are significantly different between groups. 15
Boxplot with variable width with var. width and Notches 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 16
3.3 Context is important ! Passagers sur Air China 8.00e+08 6.00e+08 Total Passengers 4.00e+08 2.00e+08 0 2001 2002 2003 2004 Year Passagers sur Air China 1.50e+09 1.00e+09 Total Passengers 5.00e+08 0 1990 1995 2000 2005 2010 Year 17
Total Passengers 0 5.00e+08 1.00e+09 1.50e+09 1990 1995 Air China Passagers 2000 Year 18 British Airways 2005 2010 0 5.00e+08 1.00e+09 1.50e+09 2.00e+09 Total Passengers
4 Visualiser des relations 4.1 De l’int´ erˆ et de visualiser - F.J. Anscombe Considerons les 3 jeux de donn´ ees propos´ es par F. J Anscombe ( X 1 , Y 1 ),( X 2 , Y 2 ) & ( X 3 , Y 3 ) � Variable n Min q 1 x ¯ x q 3 Max X1 11 4 6.50 9 9 11.50 14 X2 11 4 6.50 9 9 11.50 14 X3 11 4 6.50 9 9 11.50 14 X4 11 8 8.00 8 9 8.00 19 Table 1: Summary of the 3 data sets : Xs � Variable n Min q 1 x ¯ x q 3 Max Y1 11 4.26 6.31 7.58 7.50 8.57 10.84 Y2 11 3.10 6.70 8.14 7.50 8.95 9.26 Y3 11 5.39 6.25 7.11 7.50 7.98 12.74 Y4 11 5.25 6.17 7.04 7.50 8.19 12.50 Table 2: Summary of the 3 data sets : Ys Notons que les correlations sont cor ( X 1 , Y 1 ) = 0.8164, cor ( X 2 , Y 2 ) = 0.8162, cor ( X 3 , Y 3 ) = 0.8163 et enfin cor ( X 4 , Y 4 ) = 0.8165 . Maintenant, regardons vraiment ces donn´ ees : X1−Y1 X2−Y2 14 14 10 10 8 8 6 6 4 4 2 2 0 0 5 10 15 20 5 10 15 20 dataAnscombe$x1 dataAnscombe$x2 X3−Y3 X4−Y4 14 14 10 10 8 8 6 6 4 4 2 2 0 0 5 10 15 20 5 10 15 20 19
4.2 Scatterplot with Tufte axes Old Faithful Eruptions (271 samples) 306 250 Duration (sec) 200 Previous duration 150 96 43 50 55 60 65 70 75 80 85 90 96 Time till next eruption (min) La version 3D de la densit´ e (estim´ ee avec le package np de R) 0.04 0.03 Joint Density 0.02 100 0.01 80 waiting 60 40 0.00 2 3 4 5 eruptions 20
5 Visualier “d’autres choses” 5.1 Visualiser des r´ eseaux Figure 15: Relations entre les personnages de Mark Twain Source : Pajek http://pajek.imfm.si/doku.php?id=links 21
Figure 16: Relations entre les diff´ erentes marques et groupes dans les IAA http://www.convergencealimentaire.info/?attachment_id=238 17: Relations entre diff´ erents co-auteurs http://www. Figure bordalierinstitute.com/target11.html 22
Recommend
More recommend