descriptive statistics
play

Descriptive Statistics 17.871 Spring 2015 Reasons for paying - PowerPoint PPT Presentation

Introduction to Descriptive Statistics 17.871 Spring 2015 Reasons for paying attention to data description Double-check data acquisition Data exploration Data explanation Key measures Describing data Non-moment based location


  1. Introduction to Descriptive Statistics 17.871 Spring 2015

  2. Reasons for paying attention to data description • Double-check data acquisition • Data exploration • Data explanation

  3. Key measures Describing data Non-moment based location Moment parameters Mean Mode, median Center Variance Range, Spread (standard deviation) Interquartile range Skewness -- Skew Kurtosis -- Peaked

  4. Key distinction Population vs. Sample Notation Population vs. Sample Greeks Romans μ , σ , β s, b

  5. Mean n  x i     i 1 X n

  6. Guess the Mean .4 .3 .2 .1 0 0 somewhat approve strongly disapprove strongly approve somewhat disapprove institution approval - supreme court Source: CCES

  7. Guess the Mean .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court Source: CCES

  8. Guess the Mean .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court 2.8 Source: CCES

  9. Guess the Mean .8 .6 .4 .2 0 0 10 20 30 Number of medals

  10. Guess the Mean .8 .6 .4 .2 0 0 10 20 30 Number of medals 3.3

  11. Variance, Standard Deviation of a Population   2 n ( x )    2 i , n  i 1   2 n ( x )    i n  i 1

  12. Variance, S.D. of a Sample   2 n ( x )   2 i s ,  n 1  i 1 Degrees of freedom   2 n ( x )   i s  n 1  i 1

  13. Guess What was the mean and standard deviation of the age of the MIT undergraduate population on Registration Day, Fall 2014? 18 19 20 22 21

  14. Guess What was the mean and standard deviation of the MIT undergraduate population on Registration Day, Fall 2014? 18 19 20 22 21 My guess: Mean probably ~ 19.5 (if everyone is 18, 19, 20, or 21, and they are evenly distributed. s.d. probably ~ 1

  15. Guess the Standard Deviation .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court Source: CCES

  16. Guess the Standard Deviation .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court σ = 0.89 Source: CCES

  17. Guess the Standard Deviation .8 .6 .4 .2 0 0 10 20 30 Number of medals 3.3

  18. Guess the Standard Deviation .8 .6 .4 .2 0 0 10 20 30 Number of medals σ = 7.2 3.3

  19. Binary data     X prob ( X ) 1 proportion of time x 1      2 s x ( 1 x ) s x ( 1 x ) x x

  20. Example of this, using the most recent Gallup approval rating of Pres. Obama • gen o_approve = 1 if gallup==“Approve” • replace o_approve = 0 if gallup==“Disapprove” • the command summ o_approve produces • Mean = 0.51 • Var = 0.51(1-0.51)=.2499 • S.d. = .49989999

  21. Therefore, reporting the standard deviation (or variance) of a binary variable is redundant information. Don’t do it for papers written for 17.871.

  22. Non-moment base measures of center or spread • Central tendency – Mode – Median • Spread – Range – Interquartile range

  23. Mode • The most common value

  24. Guess the Mode .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court 2.8 Source: CCES

  25. Guess the Mode .8 .6 .4 .2 0 0 10 20 30 Number of medals 3.3

  26. Guess the Mode .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Number of years the respondent has lived in his/her current home

  27. Guess the Mode pew religion | Freq. Percent Cum. --------------------------+----------------------------------- protestant | 26,241 47.40 47.40 roman catholic | 12,348 22.30 69.70 mormon | 931 1.68 71.38 eastern or greek orthodox | 275 0.50 71.88 jewish | 1,678 3.03 74.91 muslim | 164 0.30 75.21 buddhist | 445 0.80 76.01 hindu | 89 0.16 76.17 agnostic | 2,885 5.21 81.38 nothing in particular | 7,641 13.80 95.18 something else | 2,667 4.82 100.00 --------------------------+----------------------------------- Total | 55,364 100.00

  28. The mode is rarely an informative statistic about the central tendency of the data. It’s most useful in describing the “typical” observation of a categorical variable

  29. Median • The numerical value separating the upper half of a distribution from the lower half of the distribution – If N is odd, there is a unique median – If N is even, there is no unique median --- the convention is to average the two middle values

  30. Guess the Median .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court 2.0 2.8 Source: CCES

  31. Guess the Median .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court 2.0 2.8 3.0 Source: CCES

  32. Guess the Median .8 .6 .4 .2 0 0 10 20 30 Number of medals 3.3

  33. Guess the Median .8 .6 .4 .2 0 0 10 20 30 Number of medals 0 3.3

  34. Guess the Median .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Mode = 0 Mean = 11.8 Number of years the respondent has lived in his/her current home

  35. Guess the Median .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Mode = 0 Mean = 11.8 Median = 8 Number of years the respondent has lived in his/her current home

  36. Guess the Median .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Mode = 0 Note with right-skewed data: Mean = 11.8 Mode<median<mean Median = 8 Number of years the respondent has lived in his/her current home

  37. Median frequently preferred for income data

  38. The (uninformative) graph 5.0e-05 4.0e-05 3.0e-05 2.0e-05 1.0e-05 0 0 1.00e+07 2.00e+07 3.00e+07 Income Mean = 68,735 Median = 35,000 Mode = 0 (probably)

  39. Spread • Range – Max( x ) – Min( x ) • Interquartile range (IQR) – Q 3 ( x ) – Q 1 ( x ) Q 1 = CDF -1 (.25) Q 3 = CDF -1 (.75)

  40. Guess the IQR .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court σ = 0.89 Source: CCES

  41. Guess the IQR .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court σ = 0.89 IQR = 2 Source: CCES

  42. Guess the IQR .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Mode = 0 σ = 11.7 Mean = 11.8 Median = 8 Number of years the respondent has lived in his/her current home

  43. Guess the IQR .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Mode = 0 σ = 11.7 Mean = 11.8 IQR = 14 (17-3) Median = 8 Number of years the respondent has lived in his/her current home

  44. Don’t guess the IQR .15 .1 .05 0 0 10000000 20000000 30000000 income Mean = 68,735 σ = 371,799 Median = 35,000 IQR = 50,000 (62,500-12,500) Mode = 0 (probably)

  45. Lopsidedness and peakedness

  46. Normal distribution example • IQ • SAT Frequency • Height • Symmetrical • Mean = median = mode Value 1      2 ( x ) / 2 f ( x ) e   2

  47. Skewness Asymmetrical distribution • Income Frequency • Contribution to candidates • Populations of countries • Age of MIT undergraduates • “Positive skew” • “Right skew” Value

  48. Distribution of the average $$ of dividends/tax return (in K’s) 1.5 1 Hyde County, SD .5 Mitsubishi i-MiEV (which is supposed to be all electric) .1 0 0 5 10 15 .08 dividends_pc .06 .04 .02 Fuel economy of cars for sale in 0 0 50 100 150 the US var1

  49. Skewness Asymmetrical distribution • GPA of MIT students Frequency • Age of MIT faculty • “Negative skew” • “Left skew” Value

  50. Placement of Republican Party on 100- point scale .08 .06 .04 .02 0 0 20 40 60 80 100 place on ideological scale - republican party

  51. Skewness Frequency Value

  52. Guess the sign of the skew .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court Source: CCES

  53. Guess the sign of the skew .4 .3 .2 .1 0 0 1 2 3 4 institution approval - supreme court γ = 0.89 Source: CCES

  54. Guess the sign of the skew .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years Number of years the respondent has lived in his/her current home

  55. Guess the sign of the skew .08 .06 .04 .02 0 0 20 40 60 80 100 How long lived in current residence - Years γ = 1.5 Number of years the respondent has lived in his/her current home

  56. Note: It is really rare to find a naturally occurring variable with a negative skew .15 .1 .05 0 40 50 60 70 80 Life expectancy Mean = 68.3 s.d. = 8.7 Skew: -0.80

  57. Kurtosis leptokurtic k > 3 Frequency mesokurtic k = 3 k < 3 platykurtic Value

  58. .25 .4 .2 .3 .15 .2 .1 .1 .05 0 0 0 2 4 6 8 0 2 4 6 8 ideology - yourself ideology - dem party Mean s.d. Skew. Kurt. Self- 4.5 1.9 -0.28 1.9 placement Dem. pty 2.2 1.4 1.1 3.9 Rep. pty 5.6 1.4 -0.98 3.7 Tea party 6.1 1.3 -2.1 7.5 .6 .3 .4 .2 Source: CCES, 2012 .2 .1 0 0 0 2 4 6 8 0 2 4 6 8 ideology - rep party ideology - tea party movement

Recommend


More recommend