business statistics
play

Business Statistics CONTENTS Data summaries Univariate summaries - PowerPoint PPT Presentation

SUMMARIZING DATA Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study DATA SUMMARIES Summarizing data = losing info: so why losing information? to see the essential


  1. SUMMARIZING DATA Business Statistics

  2. CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study

  3. DATA SUMMARIES Summarizing data = losing info: so why losing information? ▪ to see the essential features at a glance

  4. DATA SUMMARIES Many options, depending on: ▪ nature of variables ▪ numerical vs. categorical ▪ numerical: discrete vs. continuous ▪ categorical: binary (dichotomous) or not ▪ number of variables ▪ univariate ▪ bivariate ▪ multivariate ▪ range of data/number of categories ▪ level of detail and precision ▪ audience

  5. DATA SUMMARIES Summarizing means data reduction, so losing information ▪ sometimes a bit ▪ sometimes a lot

  6. DATA SUMMARIES Often important first step: sorting subjects (rows) ▪ original 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 ▪ sorted 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 , such that 𝑦 1 ≤ 𝑦 2 ≤ ⋯ ≤ 𝑦 𝑜 𝑦 12 𝑦 12

  7. DATA SUMMARIES Some definitions on ordered data vectors 𝑦 𝑜+1 (𝑜 odd) median: 𝑁 = 𝑅 2 = 𝑞 50 = ൞ 2 ▪ 𝑦 𝑜 (𝑜 even) 2 Why ≈ ? first quartile: 𝑅 1 = 𝑞 25 ≈ 𝑦 0.25 𝑜+1 ▪ Consider the case 𝑜 = 4 . third quartile: 𝑅 3 = 𝑞 75 ≈ 𝑦 0.75 𝑜+1 ▪ quintiles, deciles, percentiles ▪ minimum: 𝑦 min = 𝑦 1 ▪ maximum: 𝑦 max = 𝑦 𝑜 ▪ We use simplified rule for 𝑙 th percentile: position is 𝑙 100 𝑜 + 1 precisely between two points: take average of these two points ▪ otherwise round to nearest data point ▪

  8. EXERCISE 1 Given are the following data: 9,7,11,5,9,8 . Find a. the mean b. the median c. the first quartile

  9. UNIVARIATE SUMMARIES Two cases: ▪ numerical data ▪ categorical data Two forms: ▪ graphical summaries ▪ statistical summaries

  10. UNIVARIATE SUMMARIES Important concepts – numerical data ▪ measures of centrality ▪ measures of dispersion ▪ measures of range ▪ measures of shape Important concepts – categorical data ▪ measures of frequency ▪ proportion, odds (2 categories only)

  11. UNIVARIATE SUMMARIES Numerical data – graphical summaries ▪ “A picture is worth a thousands words” Note our approximation of a “dot plot” (we misued bivariate scatterplot) index 

  12. UNIVARIATE SUMMARIES Numerical data – log-transforming variables ▪ to make highly skewed data less skewed ▪ to make patterns more visible ▪ to meet assumptions of inferential statistics index  index 

  13. UNIVARIATE SUMMARIES Numerical data – statistical summaries ▪ center (central tendency): mean, median, etc. ▪ variability (dispersion): variance, standard deviation, etc. ▪ range: minimum, maximum, etc.

  14. UNIVARIATE SUMMARIES Numerical data – boxplots ▪ agreement on the “ box ” ( 𝑅 1 , 𝑅 2 , and 𝑅 3 ) ▪ but different conventions on the “ whiskers ” ▪ 𝑦 min and 𝑦 max ▪ 𝑅 1 − 1.5 𝑅 3 − 𝑅 1 and 𝑅 3 + 1.5 𝑅 3 − 𝑅 1 (next page) ▪ with outliers indicated by symbols (* and/or ⁰ )

  15. UNIVARIATE SUMMARIES Numerical data – boxplots ▪ minimum, first quartile, median, third quartile, maximum ▪ individual outliers IQR: see below extreme outliers mild outliers 1.5IQR 1.5IQR Whiskers always IQR at a data point

  16. UNIVARIATE SUMMARIES Numerical data – histograms Note the effect of bin size ▪ frequency distribution on the shape

  17. UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ less useful? index 

  18. UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ frequency distribution ▪ as a piechart ▪ as a barchart (histogram)

  19. UNIVARIATE SUMMARIES Categorical data – statistical summaries ▪ frequency distribution Or preferably using value labels: Expressed in 0 = non-member 1=member ▪ counts (frequencies) ▪ proportions ▪ percentages

  20. ҧ UNIVARIATE SUMMARIES Numerical data – centrality ▪ mean 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 = 𝑦 𝑗 ▪ median ▪ sorting of 𝑦 : 𝑦 𝑗 → 𝑦 𝑗 𝑦 𝑜+1 𝑜 odd 2 ▪ 𝑁 = ൞ 𝑦 𝑜 +𝑦 𝑜 2+1 2 𝑜 even 2 mean median

  21. UNIVARIATE SUMMARIES Numerical data – centrality (less used) ▪ mode ▪ most frequently occurring value ▪ not very useful for continuous data ▪ geometric mean ▪ only for positive data 1 𝑜 𝑜 𝑦 1 𝑦 2 ⋯ 𝑦 𝑜 = 𝑓 𝑜 σ 𝑗=1 ln 𝑦 𝑗 ▪ ▪ midrange 1 1 ▪ 2 max 𝑦 𝑗 + min 𝑦 𝑗 = 2 𝑦 1 + 𝑦 𝑜 𝑗 𝑗 ▪ 𝑙 % trimmed mean ▪ mean, skipping the highest 𝑙% and the lowest 𝑙%

  22. UNIVARIATE SUMMARIES Numerical data – centrality Statistic Properties Example life expectancy Mean much used, employs all information, a bit sensitive to 64.5 yr outliers Median much used, discards some information, insensitive to 69 yr outliers Mode not useful for continuous data, discards some information, 71.5 yr sensitive to “binning” Geometric only positive data, more difficult to interpret, useful for index 63.1 yr mean numbers and growth rates Midrange easy to calculate, discards a lot of information, very 58.2 yr sensistive to outliers Trimmed mean discards some information, insensitive to outliers, depends 65.1 yr (with 𝑙 = 5 ) on value of 𝑙

  23. ҧ UNIVARIATE SUMMARIES Numerical data – dispersion ▪ variance 1 ▪ 𝑡 2 = 𝑜 𝑦 2 𝑜−1 σ 𝑗=1 𝑦 𝑗 − ҧ ▪ standard deviation ▪ 𝑡 = 𝑡 2 ▪ Interquartile range (width of box in boxplot) 𝑡 ▪ 𝐽𝑅𝑆 = 𝑅 3 − 𝑅 1 ▪ coefficient of variation 𝐽𝑅𝑆 𝑡 𝑦 (provided ҧ 𝑦 ≠ 0 ) ▪ 𝐷𝑊 =

  24. UNIVARIATE SUMMARIES Numerical data – dispersion (less used) ▪ range ▪ max 𝑦 𝑗 − min 𝑦 𝑗 = 𝑦 𝑜 − 𝑦 1 𝑗 𝑗 ▪ mean absolute deviation 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 𝑗 − ҧ 𝑦

  25. UNIVARIATE SUMMARIES Numerical data – dispersion Statistic Properties Example life expectancy Variance much used, square units, “additive” 163.7 yr 2 Standard deviation much used 12.8 yr Interquartile range discards some information, insensitive to outliers 20.5 yr Range easy to calculate, discards a lot of information, 45.4 yr very sensitive to outliers Mean absolute deviation easy to interpret 10.8 yr Coefficient of variation much used, dimensionless, problematic for mean 0.20 close to 0

  26. EXERCISE 2 Given are two data vectors, 𝐲 and 𝐳 , with Which has a larger coefficient of variation?

  27. UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ skewness ▪ a measure of asymmetry 3 𝑦 𝑗 − ҧ 1 𝑦 𝑜 ▪ approximately 𝑁 3 ≈ 𝑜 σ 𝑗=1 𝑡 𝑦 ▪ mainly used as a benchmark for normality or symmetry Symmetric

  28. UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ kurtosis ▪ a measure of flatness 4 1 𝑦 𝑗 − ҧ 𝑦 𝑜 ▪ approximately 𝑁 4 ≈ 𝑜 σ 𝑗=1 − 3 𝑡 𝑦 ▪ mainly used as a benchmark for normality ▪ also known as excess kurtosis ▪ 𝑁 4 = 0 for “normal” data platykurtic (platypus) leptokurtic (leaping kangaroos) 𝑁 4 < 0 𝑁 4 > 0

  29. BIVARIATE SUMMARIES ▪ Several data types → different options: ▪ two numerical data vectors ▪ data vectors of “similar” type ▪ data vector s of “different” type ▪ one numerical and one categorical data vector ▪ two categorical data vectors ▪ Numerical summaries and graphical summaries

  30. BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ treat as one numerical difference vector

  31. BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ correlation analysis ▪ scatterplot

  32. BIVARIATE SUMMARIES ▪ Two numerical data vectors of “different” type ▪ correlation analysis ▪ scatterplot

  33. BIVARIATE SUMMARIES ▪ One numerical and one categorical data vector ▪ split numerical data vector into several data vectors Note: we actually have two (or more) groups/populations

  34. BIVARIATE SUMMARIES ▪ Two categorical data vectors ▪ cross tables (contingency tables) ▪ cells contain “counts” (frequencies)

  35. EXERCISE 3 Consider Which group (GATT=0 or GATT=1) has a higher mean and which group has a higher standard deviation?

  36. STATISTICAL SUMMARIES ▪ Many choices ▪ centrality ▪ dispersion ▪ association ▪ What to use depends on ▪ the nature of the problem ▪ the nature of the data ▪ the audience to address

  37. STATISTICAL SUMMARIES ▪ Remember that a summary summarizes ... ▪ ... and that data reduction reduces the amount of information ▪ Example: Anscombe’s quartet: ▪ 𝑜 𝑌 = 11 , ത 𝑌 = 9 , 𝑡 𝑌 = 11 ▪ 𝑜 𝑍 = 11 , ത 𝑍 = 7.50 , 𝑡 𝑍 = 4. 12 ▪ 𝑡 𝑌,𝑍 = 0.816 ▪ OLS-regression: 𝑍 = 3.00 + 0.500𝑌 These summaries are too drastic!

  38. STATISTICAL SUMMARIES Almost all jokes on statistics and statisticians are based on this ▪ but there is some truth in it

  39. FURTHER STUDY Doane & Seward 5/E 3.1-3.9, 4.1-4.3, 4.5-4.6, 4.8 Tutorial exercises week 1 data summaries, box plots and histograms

Recommend


More recommend