i t introduction to d t i t descriptive descriptive
play

I t Introduction to d t i t Descriptive Descriptive - PowerPoint PPT Presentation

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring 2012 1 Key measures Key measures Describing data Moment Non-mean based measure Mean Mode, median Center Variance Range, Spread Spread


  1. I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring 2012 1

  2. Key measures Key measures Describing data Moment Non-mean based measure Mean Mode, median Center Variance Range, Spread Spread (standard deviation) Interquartile range Skewness ­- Skew Kurtosis ­- Peaked 2

  3. Key distinction Key distinction Population vs. Sample Notation Population Population vs. Sample Sample vs Greeks Romans μ , σ , β β s, b b 3

  4.    X  4 x i n i  1 n Mean

  5. Variance, Standard Deviation of Variance, Standard Deviation of a Population n ( x i   ) 2     2 , n i  1 ( x i   ) 2 n     n i  1 5

  6. V Variance, S i S.D. of D f a S Sampl le n ( x   ) 2   n  s , 2 i  1  1 i Degrees of freedom ( x i    ) ) 2 ( n    s i n  1 i  1 6

  7. Bi Binary data d t X  prob ( X )  1  x  1 time proportion of 2  x (1  x )  s x  x  ( 1 ) s x x 7

  8. Example of this, using today’s NBC News/M /Marist P i Poll ll in Mi i Michigan hi  gen santorum = 1 if candidate==“Santorum” Candidate Pct.  replace santorum = 0 if Santorum 35 candidate~=“Santorum” Romney y 37  th the command summ Paul 13 santorum produces Gingrich 8  Mean Mean = .35 35 [U [Unaccounted t d [7] [7] for]  Var = .35(1­.35)=.2275  S.d. = . 4769696 8

  9. Normal l distributi di t ib t ion example l  IQ # PEOPLE  SAT  Height 600 y c 400 n e  “N “No skew” k ” u q e r F  “Zero skew” 200  Symmetrical Symmetrical HEIGHT (inches)  Mean = median = mode 46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare. 1 1 2 2  ( x   ) / 2  f ( x )  e  2  9

  10. Skewness Skewness Asymmetrical distribution  Income Frequency  Contribution to candidates did t  Populations of countries countries  “Residual vote” rates  “Positive skew” Value  “Right skew” Right skew 10

  11. Distribution of the average $$ of dividends/tax return (in K’s) 1.5 1 Density Hyde County SD Hyde County, SD D .5 Mitsubishi i­MiEV (which is supposed to be all electric) .1 0 0 5 10 15 .08 dividends_pc .06 6 Density .04 .02 Fuel economy of cars for sale in 0 the US 0 50 100 150 var1 11

  12. Skewness Skewness Asymmetrical distribution  GPA of MIT students Frequency  “Negative skew”  “Left skew” Value 12

  13. Placement of Republican Party Placement of Republican Party on 100­point scale .08 .06 ty Densit .04 .02 . 0 0 20 40 60 80 100 place on ideological scale republican party place on ideological scale ­ republican party 13

  14. Value 14 Skewness Frequency Sk

  15. Placement of Republican Party Placement of Republican Party on 100­point scale .1 .08 .06 sity Dens .04 02 .0 0 0 20 40 60 80 100 place on ideological scale ­ democratic party Mean = 26.8; median = 25; mode = 25 15

  16. K Kurtosis t i leptokurtic k > 3 Frequency Image by MIT OpenCourseWare. mesokurtic k = 3 k < 3 pla tykurtic Value Image by MIT OpenCourseWare. 16

  17. .15 .1 Density .05 Mean s.d. Skew. Kurt. 0 0 20 40 60 80 100 place on ideological scale ­ yourself Self­ 55.1 26.4 ­0.14 2.21 placement 1 .1 Rep. pty. 26.8 21.2 0.87 3.59 .08 Dem. pty 74.7 21.8 ­1.18 4.29 .06 Density .04 .02 0 0 20 40 60 80 100 place on ideological scale ­ democratic party .1 .08 .06 ensity De .04 .02 Source: Cooperative Congressional Election Study, 2008 0 0 20 40 60 80 100 place on ideological scale ­ republican party 17

  18. N Normal distribution l di t ib t i # PEOPLE  Skewness = 0 600  Kurtosis = 3 Frequency 400 200 HEIGHT (inches) 46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare. 1 2  ( x   ) / 2   f ( x )  f e e  2  18

  19. More words ab bout th t t he normal l curve 0.4 0.3 34.1% 34.1% 0.2 0.1 2.1% 2.1% 0.1% 0.1% 13.6% 13.6% 0.0 -3 σ -2 σ -1 σ 1 σ 2 σ 2 σ µ Image by MIT OpenCourseWare. 19

  20. x x  x x  “standardized score” standardized score 20 The z- score z or the

  21. Commands in STATA for Commands in STATA for univariate statistics  summarize varname  summarize varname detail  summarize varname , detail  histogram varname , bin() start() width() density/fraction/frequency normal density/fraction/frequency normal  graph box varnames  tabulate 21

  22. E Example of Florida voters l f Fl id t  Question: does the age of voters vary by race?  Combine Florida voter extract files, 2008  gen new_birth_date=date(birth_date,"MDY")  gen birth_year=year(new_b)  gen age= 2010­birth 2010 bi h_year 22

  23. Look at distribution of birth year L k t di t ib t i f bi t h .025 .02 .015 Density 01 .0 .005 0 1850 1900 1950 2000 birth_year 23

  24. E Explore age by voti l b t ing mode d . table race if birth_year>1900,c(mean age) ---------------------- race | mean(age) ----------+----------- 1 | 45.61229 2 | 2 | 42 89916 42.89916 3 | 42.6952 4 | 45.09718 52.08628 5 | 6 | 44.77392 9 | 40.86704 3 = Black ---------------------- 4 = Hispanic 5 5 = White Whit 24

  25. G Graph birth year h bi th .03 .02 Density .01 0 0 0 20 20 40 40 60 60 80 80 100 100 age . hist age if birth_year>1900 (bin=71, start=9, width=1.3802817) 25

  26. Divide into bins so that each bar Divide into “bins” so that each bar represents 1 year .02 .015 Density .01 .005 0 0 20 40 60 80 100 age . hist age if birth_year>1900,width(1) hi t if bi th 1900 idth(1) 26

  27. Add ticks at 10­year intervals Add i k t t 10 i t l histogram totalscore, width(1) xlabel(-.2 (.1) 1) .02 .015 . Density .01 .005 0 20 30 40 50 60 70 80 90 100 age 27

  28. Superimpose the normal curve S perimpose the normal c r e (with the same mean and s.d. as the empirical distribution) hist age if birth_year>1900,wid(1) xlabel(20 (10) 100) i i i i l l normal .02 5 .015 Density .01 .005 0 20 30 40 50 60 70 80 90 100 age 28

  29. . summ age if birth_year>1900,det age ------------------------------------------------------------- Percentiles Smallest 1% 1% 18 18 9 5% 21 16 16 10% 24 Obs 12612114 16 25% 34 Sum of Wgt. 12612114 Mean 50% 48 49.47549 Largest Std. Dev. 19.01049 75% 75% 63 63 107 107 90% 77 107 Variance 361.3986 95% 83 107 Skewness .2629496 107 99% 91 Kurtosis 2.222442 29

  30. Histograms by race Histograms by race hist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1) xlabel(20 (10) 100) normal by(race) 3 4 .03 3 = Black .02 4 = Hispanic .01 5 = White 5 = White 0 Density 20 30 40 50 60 70 80 90 100 5 .03 .02 . .01 0 20 30 40 50 60 70 80 90 100 age Density D it normal age Graphs by race 30

  31. Main i M i issues with histograms ith hi t  Proper level of aggregation  Non­regular data categories  Non regular data categories 31

  32. Draw the previous graph with a box Draw the previous graph with a box plot graph box age if birth_year>1900 } } } 100 0 1.5 x IQR } Upper quartile Inter-quartile age Median 50 range Lower quartile q 0 32

  33. Draw the box plots for the different Draw the box plots for the different races graph box age if birth_year>1900&race>=3&race<=5,by(race) 3 4 100 3 = Black 50 4 = Hispanic 5 = White 5 = White 0 age 5 100 50 0 Graphs by race 33

  34. Draw the box plots for the different races using “over” option graph box age if birth graph box age if birth_year>1900&race>=3&race<=5,over(race) _year>1900&race>=3&race<=5,over(race) 100 3 = Black 4 = Hispanic 5 = White 5 = White age 50 0 3 4 5 34

  35. A note about histograms with A note about histograms with unnatural categories From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? ­9 No Response ­3 Refused ­2 Don't know ­1 Not in universe 1 Less than 1 month 2 1­6 months 3 7­11 months 7 11 months 4 1­2 years 5 3­4 years 6 5 years or longer 35

  36. Solution, Ste p p 1 Map artificial category onto “natural” midpoint natural midpoint ­9 No Response  missing ­3 Refused  missing ­2 Don't know  missing ­1 Not in universe  missing 1 Less than 1 month  1/24 = 0.042 2 1 6 months  3 5/12 = 0 29 1­6 months  3.5/12 = 0.29 3 7­11 months  9/12 = 0.75 4 1­2 years  1.5 5 3­4 years  3.5 6 5 years or longer  10 (arbitrary) recode live_length (min/­1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10) recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10) 36

  37. Graph h of recod f ded data d d t histogram longevity, fraction histogram longevity, fraction .557134 .557134 Fraction 0 0 1 2 3 4 5 6 7 8 9 10 longevity 37

  38. Density pl D i t lot of data t f d t Total area of last bar = Total area of last bar = .557 557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051 0 0 1 2 3 4 5 6 7 8 9 10 15 longevity 38

Recommend


More recommend