Business Statistics CONTENTS Data summaries Univariate summaries - PowerPoint PPT Presentation

SUMMARIZING DATA Business Statistics

CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study

DATA SUMMARIES Summarizing data = losing info: so why losing information? ▪ to see the essential features at a glance

DATA SUMMARIES Many options, depending on: ▪ nature of variables ▪ numerical vs. categorical ▪ numerical: discrete vs. continuous ▪ categorical: binary (dichotomous) or not ▪ number of variables ▪ univariate ▪ bivariate ▪ multivariate ▪ range of data/number of categories ▪ level of detail and precision ▪ audience

DATA SUMMARIES Summarizing means data reduction, so losing information ▪ sometimes a bit ▪ sometimes a lot

DATA SUMMARIES Often important first step: sorting subjects (rows) ▪ original 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 ▪ sorted 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 , such that 𝑦 1 ≤ 𝑦 2 ≤ ⋯ ≤ 𝑦 𝑜 𝑦 12 𝑦 12

DATA SUMMARIES Some definitions on ordered data vectors 𝑦 𝑜+1 (𝑜 odd) median: 𝑁 = 𝑅 2 = 𝑞 50 = ൞ 2 ▪ 𝑦 𝑜 (𝑜 even) 2 Why ≈ ? first quartile: 𝑅 1 = 𝑞 25 ≈ 𝑦 0.25 𝑜+1 ▪ Consider the case 𝑜 = 4 . third quartile: 𝑅 3 = 𝑞 75 ≈ 𝑦 0.75 𝑜+1 ▪ quintiles, deciles, percentiles ▪ minimum: 𝑦 min = 𝑦 1 ▪ maximum: 𝑦 max = 𝑦 𝑜 ▪ We use simplified rule for 𝑙 th percentile: position is 𝑙 100 𝑜 + 1 precisely between two points: take average of these two points ▪ otherwise round to nearest data point ▪

EXERCISE 1 Given are the following data: 9,7,11,5,9,8 . Find a. the mean b. the median c. the first quartile

UNIVARIATE SUMMARIES Two cases: ▪ numerical data ▪ categorical data Two forms: ▪ graphical summaries ▪ statistical summaries

UNIVARIATE SUMMARIES Important concepts – numerical data ▪ measures of centrality ▪ measures of dispersion ▪ measures of range ▪ measures of shape Important concepts – categorical data ▪ measures of frequency ▪ proportion, odds (2 categories only)

UNIVARIATE SUMMARIES Numerical data – graphical summaries ▪ “A picture is worth a thousands words” Note our approximation of a “dot plot” (we misued bivariate scatterplot) index 

UNIVARIATE SUMMARIES Numerical data – log-transforming variables ▪ to make highly skewed data less skewed ▪ to make patterns more visible ▪ to meet assumptions of inferential statistics index  index 

UNIVARIATE SUMMARIES Numerical data – statistical summaries ▪ center (central tendency): mean, median, etc. ▪ variability (dispersion): variance, standard deviation, etc. ▪ range: minimum, maximum, etc.

UNIVARIATE SUMMARIES Numerical data – boxplots ▪ agreement on the “ box ” ( 𝑅 1 , 𝑅 2 , and 𝑅 3 ) ▪ but different conventions on the “ whiskers ” ▪ 𝑦 min and 𝑦 max ▪ 𝑅 1 − 1.5 𝑅 3 − 𝑅 1 and 𝑅 3 + 1.5 𝑅 3 − 𝑅 1 (next page) ▪ with outliers indicated by symbols (* and/or ⁰ )

UNIVARIATE SUMMARIES Numerical data – boxplots ▪ minimum, first quartile, median, third quartile, maximum ▪ individual outliers IQR: see below extreme outliers mild outliers 1.5IQR 1.5IQR Whiskers always IQR at a data point

UNIVARIATE SUMMARIES Numerical data – histograms Note the effect of bin size ▪ frequency distribution on the shape

UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ less useful? index 

UNIVARIATE SUMMARIES Categorical data – graphical summaries ▪ frequency distribution ▪ as a piechart ▪ as a barchart (histogram)

UNIVARIATE SUMMARIES Categorical data – statistical summaries ▪ frequency distribution Or preferably using value labels: Expressed in 0 = non-member 1=member ▪ counts (frequencies) ▪ proportions ▪ percentages

ҧ UNIVARIATE SUMMARIES Numerical data – centrality ▪ mean 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 = 𝑦 𝑗 ▪ median ▪ sorting of 𝑦 : 𝑦 𝑗 → 𝑦 𝑗 𝑦 𝑜+1 𝑜 odd 2 ▪ 𝑁 = ൞ 𝑦 𝑜 +𝑦 𝑜 2+1 2 𝑜 even 2 mean median

UNIVARIATE SUMMARIES Numerical data – centrality (less used) ▪ mode ▪ most frequently occurring value ▪ not very useful for continuous data ▪ geometric mean ▪ only for positive data 1 𝑜 𝑜 𝑦 1 𝑦 2 ⋯ 𝑦 𝑜 = 𝑓 𝑜 σ 𝑗=1 ln 𝑦 𝑗 ▪ ▪ midrange 1 1 ▪ 2 max 𝑦 𝑗 + min 𝑦 𝑗 = 2 𝑦 1 + 𝑦 𝑜 𝑗 𝑗 ▪ 𝑙 % trimmed mean ▪ mean, skipping the highest 𝑙% and the lowest 𝑙%

UNIVARIATE SUMMARIES Numerical data – centrality Statistic Properties Example life expectancy Mean much used, employs all information, a bit sensitive to 64.5 yr outliers Median much used, discards some information, insensitive to 69 yr outliers Mode not useful for continuous data, discards some information, 71.5 yr sensitive to “binning” Geometric only positive data, more difficult to interpret, useful for index 63.1 yr mean numbers and growth rates Midrange easy to calculate, discards a lot of information, very 58.2 yr sensistive to outliers Trimmed mean discards some information, insensitive to outliers, depends 65.1 yr (with 𝑙 = 5 ) on value of 𝑙

ҧ UNIVARIATE SUMMARIES Numerical data – dispersion ▪ variance 1 ▪ 𝑡 2 = 𝑜 𝑦 2 𝑜−1 σ 𝑗=1 𝑦 𝑗 − ҧ ▪ standard deviation ▪ 𝑡 = 𝑡 2 ▪ Interquartile range (width of box in boxplot) 𝑡 ▪ 𝐽𝑅𝑆 = 𝑅 3 − 𝑅 1 ▪ coefficient of variation 𝐽𝑅𝑆 𝑡 𝑦 (provided ҧ 𝑦 ≠ 0 ) ▪ 𝐷𝑊 =

UNIVARIATE SUMMARIES Numerical data – dispersion (less used) ▪ range ▪ max 𝑦 𝑗 − min 𝑦 𝑗 = 𝑦 𝑜 − 𝑦 1 𝑗 𝑗 ▪ mean absolute deviation 1 𝑜 ▪ 𝑜 σ 𝑗=1 𝑦 𝑗 − ҧ 𝑦

UNIVARIATE SUMMARIES Numerical data – dispersion Statistic Properties Example life expectancy Variance much used, square units, “additive” 163.7 yr 2 Standard deviation much used 12.8 yr Interquartile range discards some information, insensitive to outliers 20.5 yr Range easy to calculate, discards a lot of information, 45.4 yr very sensitive to outliers Mean absolute deviation easy to interpret 10.8 yr Coefficient of variation much used, dimensionless, problematic for mean 0.20 close to 0

EXERCISE 2 Given are two data vectors, 𝐲 and 𝐳 , with Which has a larger coefficient of variation?

UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ skewness ▪ a measure of asymmetry 3 𝑦 𝑗 − ҧ 1 𝑦 𝑜 ▪ approximately 𝑁 3 ≈ 𝑜 σ 𝑗=1 𝑡 𝑦 ▪ mainly used as a benchmark for normality or symmetry Symmetric

UNIVARIATE SUMMARIES More complicated Numerical data – shape formula in book ▪ kurtosis ▪ a measure of flatness 4 1 𝑦 𝑗 − ҧ 𝑦 𝑜 ▪ approximately 𝑁 4 ≈ 𝑜 σ 𝑗=1 − 3 𝑡 𝑦 ▪ mainly used as a benchmark for normality ▪ also known as excess kurtosis ▪ 𝑁 4 = 0 for “normal” data platykurtic (platypus) leptokurtic (leaping kangaroos) 𝑁 4 < 0 𝑁 4 > 0

BIVARIATE SUMMARIES ▪ Several data types → different options: ▪ two numerical data vectors ▪ data vectors of “similar” type ▪ data vector s of “different” type ▪ one numerical and one categorical data vector ▪ two categorical data vectors ▪ Numerical summaries and graphical summaries

BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ treat as one numerical difference vector

BIVARIATE SUMMARIES ▪ Two numerical data vectors of “similar” type ▪ correlation analysis ▪ scatterplot

BIVARIATE SUMMARIES ▪ Two numerical data vectors of “different” type ▪ correlation analysis ▪ scatterplot

BIVARIATE SUMMARIES ▪ One numerical and one categorical data vector ▪ split numerical data vector into several data vectors Note: we actually have two (or more) groups/populations

BIVARIATE SUMMARIES ▪ Two categorical data vectors ▪ cross tables (contingency tables) ▪ cells contain “counts” (frequencies)

EXERCISE 3 Consider Which group (GATT=0 or GATT=1) has a higher mean and which group has a higher standard deviation?

STATISTICAL SUMMARIES ▪ Many choices ▪ centrality ▪ dispersion ▪ association ▪ What to use depends on ▪ the nature of the problem ▪ the nature of the data ▪ the audience to address

STATISTICAL SUMMARIES ▪ Remember that a summary summarizes ... ▪ ... and that data reduction reduces the amount of information ▪ Example: Anscombe’s quartet: ▪ 𝑜 𝑌 = 11 , ത 𝑌 = 9 , 𝑡 𝑌 = 11 ▪ 𝑜 𝑍 = 11 , ത 𝑍 = 7.50 , 𝑡 𝑍 = 4. 12 ▪ 𝑡 𝑌,𝑍 = 0.816 ▪ OLS-regression: 𝑍 = 3.00 + 0.500𝑌 These summaries are too drastic!

STATISTICAL SUMMARIES Almost all jokes on statistics and statisticians are based on this ▪ but there is some truth in it

FURTHER STUDY Doane & Seward 5/E 3.1-3.9, 4.1-4.3, 4.5-4.6, 4.8 Tutorial exercises week 1 data summaries, box plots and histograms

Business Statistics CONTENTS Data summaries Univariate summaries - PowerPoint PPT Presentation

SUMMARIZING DATA Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study DATA SUMMARIES Summarizing data = losing info: so why losing information? to see the essential

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Global assessment of linking trade statistics and the business register Nancy Snyder United

Introduction to Business Statistics Professor Jarad Niemi STAT 226 - Iowa State University

Business and Business Environment Business and Business Environment Introduction Business is

Business statistics and Globalisation UN Committee of Experts on Business Statistics First

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 3 t

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 4 t

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

LECTURE 12: COHESION CSE 442 Software Engineering Sprint 1 Demo Prep What Have We Done?

Software lifecycle (simplified) Testing Problem statement requirements analysis

#GHC18 WHY MONITORING #GHC18 Run reliable services at scale Understand service

r trt rs

An Efgective Resource Management Approach in a FaaS Environment Andreas Christoforou , Andreas

Grounding Issues in Parallel and Multi-Engine ASP Solving Francesco Ricca Dipartimento di

PLOVER: A new framework for political event data Philip A. Schrodt Parus Analytics LLC and Open

The Salinas Stream Maintenance Program January 22, 2015 Panel members: Abby Hart (TNC), Abby

Business Statistics CONTENTS Data summaries Univariate summaries - PowerPoint PPT Presentation

SUMMARIZING DATA Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study DATA SUMMARIES Summarizing data = losing info: so why losing information? to see the essential

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Global assessment of linking trade statistics and the business register Nancy Snyder United

Introduction to Business Statistics Professor Jarad Niemi STAT 226 - Iowa State University

Business and Business Environment Business and Business Environment Introduction Business is

Business statistics and Globalisation UN Committee of Experts on Business Statistics First

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 3 t

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 4 t

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

LECTURE 12: COHESION CSE 442 Software Engineering Sprint 1 Demo Prep What Have We Done?

Software lifecycle (simplified) Testing Problem statement requirements analysis

#GHC18 WHY MONITORING #GHC18 Run reliable services at scale Understand service

r trt rs

An Efgective Resource Management Approach in a FaaS Environment Andreas Christoforou , Andreas

Grounding Issues in Parallel and Multi-Engine ASP Solving Francesco Ricca Dipartimento di

PLOVER: A new framework for political event data Philip A. Schrodt Parus Analytics LLC and Open

The Salinas Stream Maintenance Program January 22, 2015 Panel members: Abby Hart (TNC), Abby

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning