I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring 2012 1
Key measures Key measures Describing data Moment Non-mean based measure Mean Mode, median Center Variance Range, Spread Spread (standard deviation) Interquartile range Skewness - Skew Kurtosis - Peaked 2
Key distinction Key distinction Population vs. Sample Notation Population Population vs. Sample Sample vs Greeks Romans μ , σ , β β s, b b 3
X 4 x i n i 1 n Mean
Variance, Standard Deviation of Variance, Standard Deviation of a Population n ( x i ) 2 2 , n i 1 ( x i ) 2 n n i 1 5
V Variance, S i S.D. of D f a S Sampl le n ( x ) 2 n s , 2 i 1 1 i Degrees of freedom ( x i ) ) 2 ( n s i n 1 i 1 6
Bi Binary data d t X prob ( X ) 1 x 1 time proportion of 2 x (1 x ) s x x ( 1 ) s x x 7
Example of this, using today’s NBC News/M /Marist P i Poll ll in Mi i Michigan hi gen santorum = 1 if candidate==“Santorum” Candidate Pct. replace santorum = 0 if Santorum 35 candidate~=“Santorum” Romney y 37 th the command summ Paul 13 santorum produces Gingrich 8 Mean Mean = .35 35 [U [Unaccounted t d [7] [7] for] Var = .35(1.35)=.2275 S.d. = . 4769696 8
Normal l distributi di t ib t ion example l IQ # PEOPLE SAT Height 600 y c 400 n e “N “No skew” k ” u q e r F “Zero skew” 200 Symmetrical Symmetrical HEIGHT (inches) Mean = median = mode 46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare. 1 1 2 2 ( x ) / 2 f ( x ) e 2 9
Skewness Skewness Asymmetrical distribution Income Frequency Contribution to candidates did t Populations of countries countries “Residual vote” rates “Positive skew” Value “Right skew” Right skew 10
Distribution of the average $$ of dividends/tax return (in K’s) 1.5 1 Density Hyde County SD Hyde County, SD D .5 Mitsubishi iMiEV (which is supposed to be all electric) .1 0 0 5 10 15 .08 dividends_pc .06 6 Density .04 .02 Fuel economy of cars for sale in 0 the US 0 50 100 150 var1 11
Skewness Skewness Asymmetrical distribution GPA of MIT students Frequency “Negative skew” “Left skew” Value 12
Placement of Republican Party Placement of Republican Party on 100point scale .08 .06 ty Densit .04 .02 . 0 0 20 40 60 80 100 place on ideological scale republican party place on ideological scale republican party 13
Value 14 Skewness Frequency Sk
Placement of Republican Party Placement of Republican Party on 100point scale .1 .08 .06 sity Dens .04 02 .0 0 0 20 40 60 80 100 place on ideological scale democratic party Mean = 26.8; median = 25; mode = 25 15
K Kurtosis t i leptokurtic k > 3 Frequency Image by MIT OpenCourseWare. mesokurtic k = 3 k < 3 pla tykurtic Value Image by MIT OpenCourseWare. 16
.15 .1 Density .05 Mean s.d. Skew. Kurt. 0 0 20 40 60 80 100 place on ideological scale yourself Self 55.1 26.4 0.14 2.21 placement 1 .1 Rep. pty. 26.8 21.2 0.87 3.59 .08 Dem. pty 74.7 21.8 1.18 4.29 .06 Density .04 .02 0 0 20 40 60 80 100 place on ideological scale democratic party .1 .08 .06 ensity De .04 .02 Source: Cooperative Congressional Election Study, 2008 0 0 20 40 60 80 100 place on ideological scale republican party 17
N Normal distribution l di t ib t i # PEOPLE Skewness = 0 600 Kurtosis = 3 Frequency 400 200 HEIGHT (inches) 46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare. 1 2 ( x ) / 2 f ( x ) f e e 2 18
More words ab bout th t t he normal l curve 0.4 0.3 34.1% 34.1% 0.2 0.1 2.1% 2.1% 0.1% 0.1% 13.6% 13.6% 0.0 -3 σ -2 σ -1 σ 1 σ 2 σ 2 σ µ Image by MIT OpenCourseWare. 19
x x x x “standardized score” standardized score 20 The z- score z or the
Commands in STATA for Commands in STATA for univariate statistics summarize varname summarize varname detail summarize varname , detail histogram varname , bin() start() width() density/fraction/frequency normal density/fraction/frequency normal graph box varnames tabulate 21
E Example of Florida voters l f Fl id t Question: does the age of voters vary by race? Combine Florida voter extract files, 2008 gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b) gen age= 2010birth 2010 bi h_year 22
Look at distribution of birth year L k t di t ib t i f bi t h .025 .02 .015 Density 01 .0 .005 0 1850 1900 1950 2000 birth_year 23
E Explore age by voti l b t ing mode d . table race if birth_year>1900,c(mean age) ---------------------- race | mean(age) ----------+----------- 1 | 45.61229 2 | 2 | 42 89916 42.89916 3 | 42.6952 4 | 45.09718 52.08628 5 | 6 | 44.77392 9 | 40.86704 3 = Black ---------------------- 4 = Hispanic 5 5 = White Whit 24
G Graph birth year h bi th .03 .02 Density .01 0 0 0 20 20 40 40 60 60 80 80 100 100 age . hist age if birth_year>1900 (bin=71, start=9, width=1.3802817) 25
Divide into bins so that each bar Divide into “bins” so that each bar represents 1 year .02 .015 Density .01 .005 0 0 20 40 60 80 100 age . hist age if birth_year>1900,width(1) hi t if bi th 1900 idth(1) 26
Add ticks at 10year intervals Add i k t t 10 i t l histogram totalscore, width(1) xlabel(-.2 (.1) 1) .02 .015 . Density .01 .005 0 20 30 40 50 60 70 80 90 100 age 27
Superimpose the normal curve S perimpose the normal c r e (with the same mean and s.d. as the empirical distribution) hist age if birth_year>1900,wid(1) xlabel(20 (10) 100) i i i i l l normal .02 5 .015 Density .01 .005 0 20 30 40 50 60 70 80 90 100 age 28
. summ age if birth_year>1900,det age ------------------------------------------------------------- Percentiles Smallest 1% 1% 18 18 9 5% 21 16 16 10% 24 Obs 12612114 16 25% 34 Sum of Wgt. 12612114 Mean 50% 48 49.47549 Largest Std. Dev. 19.01049 75% 75% 63 63 107 107 90% 77 107 Variance 361.3986 95% 83 107 Skewness .2629496 107 99% 91 Kurtosis 2.222442 29
Histograms by race Histograms by race hist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1) xlabel(20 (10) 100) normal by(race) 3 4 .03 3 = Black .02 4 = Hispanic .01 5 = White 5 = White 0 Density 20 30 40 50 60 70 80 90 100 5 .03 .02 . .01 0 20 30 40 50 60 70 80 90 100 age Density D it normal age Graphs by race 30
Main i M i issues with histograms ith hi t Proper level of aggregation Nonregular data categories Non regular data categories 31
Draw the previous graph with a box Draw the previous graph with a box plot graph box age if birth_year>1900 } } } 100 0 1.5 x IQR } Upper quartile Inter-quartile age Median 50 range Lower quartile q 0 32
Draw the box plots for the different Draw the box plots for the different races graph box age if birth_year>1900&race>=3&race<=5,by(race) 3 4 100 3 = Black 50 4 = Hispanic 5 = White 5 = White 0 age 5 100 50 0 Graphs by race 33
Draw the box plots for the different races using “over” option graph box age if birth graph box age if birth_year>1900&race>=3&race<=5,over(race) _year>1900&race>=3&race<=5,over(race) 100 3 = Black 4 = Hispanic 5 = White 5 = White age 50 0 3 4 5 34
A note about histograms with A note about histograms with unnatural categories From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? 9 No Response 3 Refused 2 Don't know 1 Not in universe 1 Less than 1 month 2 16 months 3 711 months 7 11 months 4 12 years 5 34 years 6 5 years or longer 35
Solution, Ste p p 1 Map artificial category onto “natural” midpoint natural midpoint 9 No Response missing 3 Refused missing 2 Don't know missing 1 Not in universe missing 1 Less than 1 month 1/24 = 0.042 2 1 6 months 3 5/12 = 0 29 16 months 3.5/12 = 0.29 3 711 months 9/12 = 0.75 4 12 years 1.5 5 34 years 3.5 6 5 years or longer 10 (arbitrary) recode live_length (min/1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10) recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10) 36
Graph h of recod f ded data d d t histogram longevity, fraction histogram longevity, fraction .557134 .557134 Fraction 0 0 1 2 3 4 5 6 7 8 9 10 longevity 37
Density pl D i t lot of data t f d t Total area of last bar = Total area of last bar = .557 557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051 0 0 1 2 3 4 5 6 7 8 9 10 15 longevity 38
Recommend
More recommend