SUMMARIZING DATA Business Statistics
CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study
DATA SUMMARIES Summarizing data = losing info: so why losing information? ▪ to see the essential features at a glance
DATA SUMMARIES Many options, depending on: ▪ nature of variables ▪ numerical vs. categorical ▪ numerical: discrete vs. continuous ▪ categorical: binary (dichotomous) or not ▪ number of variables ▪ univariate ▪ bivariate ▪ multivariate ▪ range of data/number of categories ▪ level of detail and precision ▪ audience
DATA SUMMARIES Summarizing means data reduction, so losing information ▪ sometimes a bit ▪ sometimes a lot
DATA SUMMARIES Often important first step: sorting subjects (rows) ▪ original 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 ▪ sorted 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 , such that 𝑦 1 ≤ 𝑦 2 ≤ ⋯ ≤ 𝑦 𝑜 𝑦 12 𝑦 12
DATA SUMMARIES Some definitions on ordered data vectors 𝑦 𝑜+1 (𝑜 odd) median: 𝑁 = 𝑅 2 = 𝑞 50 = ൞ 2 ▪ 𝑦 𝑜 (𝑜 even) 2 Why ≈ ? first quartile: 𝑅 1 = 𝑞 25 ≈ 𝑦 0.25 𝑜+1 ▪ Consider the case 𝑜 = 4 . third quartile: 𝑅 3 = 𝑞 75 ≈ 𝑦 0.75 𝑜+1 ▪ quintiles, deciles, percentiles ▪ minimum: 𝑦 min = 𝑦 1 ▪ maximum: 𝑦 max = 𝑦 𝑜 ▪ We use simplified rule for 𝑙 th percentile: position is 𝑙 100 𝑜 + 1 precisely between two points: take average of these two points ▪ otherwise round to nearest data point ▪
EXERCISE 1 Given are the following data: 9,7,11,5,9,8 . Find a. the mean b. the median c. the first quartile
UNIVARIATE SUMMARIES Two cases: ▪ numerical data ▪ categorical data Two forms: ▪ graphical summaries ▪ statistical summaries
UNIVARIATE SUMMARIES Important concepts – numerical data ▪ measures of centrality ▪ measures of dispersion ▪ measures of range ▪ measures of shape Important concepts – categorical data ▪ measures of frequency ▪ proportion, odds (2 categories only)
UNIVARIATE SUMMARIES Numerical data – graphical summaries ▪ “A picture is worth a thousands words” Note our approximation of a “dot plot” (we misued bivariate scatterplot) index
UNIVARIATE SUMMARIES Numerical data – log-transforming variables ▪ to make highly skewed data less skewed ▪ to make patterns more visible ▪ to meet assumptions of inferential statistics index index
UNIVARIATE SUMMARIES Numerical data – statistical summaries ▪ center (central tendency): mean, median, etc. ▪ variability (dispersion): variance, standard deviation, etc. ▪ range: minimum, maximum, etc.
UNIVARIATE SUMMARIES Numerical data – boxplots ▪ agreement on the “ box ” ( 𝑅 1 , 𝑅 2 , and 𝑅 3 ) ▪ but different conventions on the “ whiskers ” ▪ 𝑦 min and 𝑦 max ▪ 𝑅 1 − 1.5 𝑅 3 − 𝑅 1 and 𝑅 3 + 1.5 𝑅 3 − 𝑅 1 (next page) ▪ with outliers indicated by symbols (* and/or ⁰ )
UNIVARIATE SUMMARIES Numerical data – boxplots ▪ minimum, first quartile, median, third quartile, maximum ▪ individual outliers IQR: see below extreme outliers mild outliers 1.5IQR 1.5IQR Whiskers always IQR at a data point
UNIVARIATE SUMMARIES Numerical data – histograms Note the effect of bin size ▪ frequency distribution on the shape
UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ less useful? index
UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ frequency distribution ▪ as a piechart ▪ as a barchart (histogram)
UNIVARIATE SUMMARIES Categorical data – statistical summaries ▪ frequency distribution Or preferably using value labels: Expressed in 0 = non-member 1=member ▪ counts (frequencies) ▪ proportions ▪ percentages
ҧ UNIVARIATE SUMMARIES Numerical data – centrality ▪ mean 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 = 𝑦 𝑗 ▪ median ▪ sorting of 𝑦 : 𝑦 𝑗 → 𝑦 𝑗 𝑦 𝑜+1 𝑜 odd 2 ▪ 𝑁 = ൞ 𝑦 𝑜 +𝑦 𝑜 2+1 2 𝑜 even 2 mean median
UNIVARIATE SUMMARIES Numerical data – centrality (less used) ▪ mode ▪ most frequently occurring value ▪ not very useful for continuous data ▪ geometric mean ▪ only for positive data 1 𝑜 𝑜 𝑦 1 𝑦 2 ⋯ 𝑦 𝑜 = 𝑓 𝑜 σ 𝑗=1 ln 𝑦 𝑗 ▪ ▪ midrange 1 1 ▪ 2 max 𝑦 𝑗 + min 𝑦 𝑗 = 2 𝑦 1 + 𝑦 𝑜 𝑗 𝑗 ▪ 𝑙 % trimmed mean ▪ mean, skipping the highest 𝑙% and the lowest 𝑙%
UNIVARIATE SUMMARIES Numerical data – centrality Statistic Properties Example life expectancy Mean much used, employs all information, a bit sensitive to 64.5 yr outliers Median much used, discards some information, insensitive to 69 yr outliers Mode not useful for continuous data, discards some information, 71.5 yr sensitive to “binning” Geometric only positive data, more difficult to interpret, useful for index 63.1 yr mean numbers and growth rates Midrange easy to calculate, discards a lot of information, very 58.2 yr sensistive to outliers Trimmed mean discards some information, insensitive to outliers, depends 65.1 yr (with 𝑙 = 5 ) on value of 𝑙
ҧ UNIVARIATE SUMMARIES Numerical data – dispersion ▪ variance 1 ▪ 𝑡 2 = 𝑜 𝑦 2 𝑜−1 σ 𝑗=1 𝑦 𝑗 − ҧ ▪ standard deviation ▪ 𝑡 = 𝑡 2 ▪ Interquartile range (width of box in boxplot) 𝑡 ▪ 𝐽𝑅𝑆 = 𝑅 3 − 𝑅 1 ▪ coefficient of variation 𝐽𝑅𝑆 𝑡 𝑦 (provided ҧ 𝑦 ≠ 0 ) ▪ 𝐷𝑊 =
UNIVARIATE SUMMARIES Numerical data – dispersion (less used) ▪ range ▪ max 𝑦 𝑗 − min 𝑦 𝑗 = 𝑦 𝑜 − 𝑦 1 𝑗 𝑗 ▪ mean absolute deviation 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 𝑗 − ҧ 𝑦
UNIVARIATE SUMMARIES Numerical data – dispersion Statistic Properties Example life expectancy Variance much used, square units, “additive” 163.7 yr 2 Standard deviation much used 12.8 yr Interquartile range discards some information, insensitive to outliers 20.5 yr Range easy to calculate, discards a lot of information, 45.4 yr very sensitive to outliers Mean absolute deviation easy to interpret 10.8 yr Coefficient of variation much used, dimensionless, problematic for mean 0.20 close to 0
EXERCISE 2 Given are two data vectors, 𝐲 and 𝐳 , with Which has a larger coefficient of variation?
UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ skewness ▪ a measure of asymmetry 3 𝑦 𝑗 − ҧ 1 𝑦 𝑜 ▪ approximately 𝑁 3 ≈ 𝑜 σ 𝑗=1 𝑡 𝑦 ▪ mainly used as a benchmark for normality or symmetry Symmetric
UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ kurtosis ▪ a measure of flatness 4 1 𝑦 𝑗 − ҧ 𝑦 𝑜 ▪ approximately 𝑁 4 ≈ 𝑜 σ 𝑗=1 − 3 𝑡 𝑦 ▪ mainly used as a benchmark for normality ▪ also known as excess kurtosis ▪ 𝑁 4 = 0 for “normal” data platykurtic (platypus) leptokurtic (leaping kangaroos) 𝑁 4 < 0 𝑁 4 > 0
BIVARIATE SUMMARIES ▪ Several data types → different options: ▪ two numerical data vectors ▪ data vectors of “similar” type ▪ data vector s of “different” type ▪ one numerical and one categorical data vector ▪ two categorical data vectors ▪ Numerical summaries and graphical summaries
BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ treat as one numerical difference vector
BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ correlation analysis ▪ scatterplot
BIVARIATE SUMMARIES ▪ Two numerical data vectors of “different” type ▪ correlation analysis ▪ scatterplot
BIVARIATE SUMMARIES ▪ One numerical and one categorical data vector ▪ split numerical data vector into several data vectors Note: we actually have two (or more) groups/populations
BIVARIATE SUMMARIES ▪ Two categorical data vectors ▪ cross tables (contingency tables) ▪ cells contain “counts” (frequencies)
EXERCISE 3 Consider Which group (GATT=0 or GATT=1) has a higher mean and which group has a higher standard deviation?
STATISTICAL SUMMARIES ▪ Many choices ▪ centrality ▪ dispersion ▪ association ▪ What to use depends on ▪ the nature of the problem ▪ the nature of the data ▪ the audience to address
STATISTICAL SUMMARIES ▪ Remember that a summary summarizes ... ▪ ... and that data reduction reduces the amount of information ▪ Example: Anscombe’s quartet: ▪ 𝑜 𝑌 = 11 , ത 𝑌 = 9 , 𝑡 𝑌 = 11 ▪ 𝑜 𝑍 = 11 , ത 𝑍 = 7.50 , 𝑡 𝑍 = 4. 12 ▪ 𝑡 𝑌,𝑍 = 0.816 ▪ OLS-regression: 𝑍 = 3.00 + 0.500𝑌 These summaries are too drastic!
STATISTICAL SUMMARIES Almost all jokes on statistics and statisticians are based on this ▪ but there is some truth in it
FURTHER STUDY Doane & Seward 5/E 3.1-3.9, 4.1-4.3, 4.5-4.6, 4.8 Tutorial exercises week 1 data summaries, box plots and histograms
Recommend
More recommend