DATA Business Statistics
CONTENTS The role of data The data matrix Data types Aspects of data Obtaining data Further study
THE ROLE OF DATA Data refers to observed facts ▪ “there are 82 persons in this train” ▪ “the weight of this pizza is 283 gram” ▪ “this museum hosts paintings by Picasso” Data helps ▪ to suggest theories (“pizzas with a high price are less popular”) ▪ to test hypotheses (“advertising increase sales”) ▪ to calibrate coefficients of theories (“ 𝑟 = 𝑏 − 𝑐𝑞 , but what are 𝑏 and 𝑐 ?”)
THE DATA MATRIX Columns: variables (may have identifying name like “age”) Rows: subjects/cases (may have identifying name like “John”) Cells: observations Variable Entire table: data matrix Observation Subject/Case
THE DATA MATRIX Missing Variable unit Variable name observation Subject name Binary data Nominal data Numerical data Ordinal data
THE DATA MATRIX Information to extract from a data matrix ▪ One variable ▪ mean age at inauguration ▪ odds of republicans vs. democrats ▪ univariate analysis ▪ Two variables ▪ association between handedness and party ▪ correlation between age and number of terms ▪ bivariate analysis ▪ Many variables ▪ predict terms as a function of height and handednes ▪ multivariate analysis
THE DATA MATRIX The data matrix can represent: ▪ all data (the population) ▪ a list of all US presidents ▪ a non-random selection of data ▪ a list of all US presidents since 1969 ▪ a random selection of data (a sample) ▪ a subset of randomly picked presidents from the full list ▪ descriptive statistics is applicable to all three cases ▪ inferential statistics focuses on how to draw conclusions for a population on the basis of information on a random sample
EXERCISE 1 You find data on the body size of 5 men and 5 women Organize these data in a data matrix
ASPECTS OF DATA ▪ Type of data ▪ categorical, numerical ▪ Countability ▪ discrete, continuous ▪ Range ▪ restricted, infinite, semi-infinite ▪ Coded ▪ numbers for text ▪ Recoded ▪ text for ranges of numbers (or ranges of texts)
ASPECTS OF DATA Type of data ▪ categorical ▪ e.g., dog, cat, horse ▪ numerical (cardinal) ▪ e.g., 12, 45.29 Has consequences for: ▪ transformations (income per capita vs. car type per capita) ▪ statistical summaries (average income vs. average car type) Special cases ▪ Likert scale (5 or 7- point scale: “strongly agree”, “somewhat agree”, etc.) ▪ binary variable (0/1, yes/no, Dutch/foreign)
ASPECTS OF DATA Countability ▪ discrete ▪ e.g., eggs ▪ (semi-)continuous ▪ e.g., waiting time Has consequences for: ▪ recoding (“binning”) ▪ statistical summaries (modal income vs. median income)
ASPECTS OF DATA Range ▪ (semi-)infinite ▪ e.g., income ▪ restricted ▪ e.g., percentage of satisfied customers Has consequences for: ▪ dealing with outliers (exceptional data points)
ASPECTS OF DATA Coding ▪ replacing nominal categories by numbers ▪ e.g., Ford=1, Audi=2, Volkswagen=3, Opel=4 ▪ replacing ordinal categories by numbers ▪ e.g., tiny=1, small=2, normal=3, big=4, huge=5 Has consequences for: ▪ preventing recording mistakes (e.g., Vlokswgaen) ▪ preparing for statistical calculations (SPSS, Stata, R, etc)
ASPECTS OF DATA Recoding ▪ grouping categorical data ▪ e.g., “Volkswagen”+“Audi”+“Opel”=“German car” ▪ grouping numerical data ▪ e.g., 𝑦 ∈ 20.000,25.000 =“middle income” Has consequences for: ▪ statistical summaries (histograms, modal values)
ASPECTS OF DATA Coding of categories into numbers
ASPECTS OF DATA Coding of categories into several binary variables ▪ using dummy variables (or dummies for short) ▪ 𝑜 dummies = 𝑜 categories (redundant!) ▪ 𝑜 dummies = 𝑜 categories − 1 (with omitted category)
ASPECTS OF DATA Some pitfalls: ▪ missing data ▪ blank? 0? 99? ▪ treating coded categories or number-like categories as numbers ▪ e.g., if Volkswage=1, Audi=2, BMW=3, the average car in this street 1.92? ▪ units of data ▪ see Math course ▪ decimals ▪ see Math course
EXERCISE 2 Describe the appropriate data characteristic (categorical, ordinal, nominal, numerical, continuous, discrete, dummy, etc.) for a. body size (171, 184, etc.) b. pet (cat, dog, rabbit) c. righthandedness (0, 1) d. income group (low, medium, high) e. number of children (0, 1, 2, etc.)
OBTAINING DATA Typing ▪ from books, etc. Downloading ▪ from online databases (like CBS) ▪ from general webpages (like Wikipedia)
OBTAINING DATA Purchasing ▪ commercial databases
OBTAINING DATA Generating ▪ from secondary sources ▪ combining multiple sources ▪ by primary research ▪ doing interviews ▪ doing observations ▪ doing experiments
FURTHER STUDY Doane & Seward 5/E 2.1-2.2 Tutorial exercises week 1 data
Recommend
More recommend