Outline • Boxplots: De fi nition, Strengths & W eaknesses Letter V alue Boxplot • Letter V alue Statistics Heike Hofmann, Karen Kafadar, Hadley Wickham • Letter V alue Boxplots I OWA S TATE U NIVERSITY • Examples • Conclusion Boxplot: Strengths Boxplots 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • Early V ersion: Tukey 1972 ( Snedecor Festzeitschrift, at • Quick summary without overwhelming amount of Iowa State University ) detail • Most common version in EDA ( 1977 ) : • Approximate location, spread, shape of distribution • Median ( Center Line ) , Fourths ( Box Edges ) , adjacent values ( ends of whiskers ) and extreme values • Outlier identi fi cation • All marks correspond to actual data values • Associations among variables
Boxplots: W eaknesses Modi fi cations • Expected rate of labeled outliers approx 0.4+ 0.007n • Notched box - and - whisker ( McGill, Larsen, Tukey 1987 ) • For n = 100000 expect approx. 700 outliers! • Nonparametric density estimates 12 • V ase plots ( Benjamini, 1988 ) 5 8 10 6 4 • Violin plots ( Hintze, Nelson 1998 ) 6 8 3 4 Exponential 6 4 • Box - percentile plots ( Esty, Ban fi eld 2003 ) 2 Distribution, 4 2 2 Implementations: S routines ( David James ) , package vioplot ( Adler, 1 n= 100, 1000, 2 Romain ) , package HMisc bpplot ( Harre � , Ban fi eld ) , examples at R 10000, 100000 0 0 0 0 Graph Ga � ery Letter V alue Statistics Letter V alue Boxplot LVboxplot(rnorm(1000)) • Estimate quantiles corresponding to tail areas 2 - j • Median ( 1/2 ) : depth = d M = (1 + n ) / 2 • Fourths ( 1/4 ) : depth = d F = (1 + ⌊ d M ⌋ ) / 2 -3 -2 -1 0 1 2 3 • Eights ( 1/8 ) : depth = d E = (1 + ⌊ d F ⌋ ) / 2 x • Boxplots show median, fourths • How many boxes to show? • Large Data Sets: tail quantiles become more reliable • Outlier identi fi cation? include LV s beyond Fourths • All marks are based on actual data values
Gaussian, Exponential & Normal Stopping Rules & Outliers Gaussian, n=10000 • EDA: 5 - 8 outliers k = ⌊ log 2 n ⌋ − 4 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x � • Percentage of data, e.g. 0.5 - 1 % Exponential, n=10000 • uncertainty in LV i extends beyond or into LV i - 1 ( i.e. upper limit for LV i crosses LV i - 1 ) 0 2 4 6 8 0 2 4 6 8 � � �� 4 z 2 k = log 2 n − log 2 + 1 x 1 − α / 2 Uniform, n=10000 Rules lead to similar answers ... Examples 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x Gene Expression V alues Conclusion Letter V alue Boxplots are T1-1 T1-2 T1-3 T2-1 T2-2 T2-3 WT-1 WT-2 WT-3 • appropriate for large number of values 14 14 14 14 14 14 14 14 14 • based on actual data values 12 12 12 12 12 12 12 12 12 • simple to compute 10 10 10 10 10 10 10 10 10 x x x x x x x x x • reduce number of labeled outliers shown in conventional boxplots 8 8 8 8 8 8 8 8 8 • do not depend on a smoothing parameter 6 6 6 6 6 6 6 6 6 Download ( for now ) at http://www.public.iastate.edu/ ~ hofmann
Graphical Displays of Large Data Sets “ The greatest value of a picture us when it forces us to notic e w hat we never expected to see ” ( Tukey 1977 ) • Quick summary without overwhelming amount of detail • Approximate location, spread, shape of distribution • Outlier identi fi cation • Associations among variables
Recommend
More recommend