frequency distribution and
play

Frequency Distribution and Summary Statistics Dongmei Li - PowerPoint PPT Presentation

Frequency Distribution and Summary Statistics Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawaii at M noa Outline 1. Stemplot 2. Frequency table 3. Summary statistics 2 1. Stem-and-leaf


  1. Frequency Distribution and Summary Statistics Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawai’i at M ā noa

  2. Outline 1. Stemplot 2. Frequency table 3. Summary statistics 2

  3. 1. Stem-and-leaf plots (stemplots)  Always start by looking at the data with graphs and plots  Our favorite technique for looking at a single variable is the stemplot  A stemplot is a graphical technique that organizes data into a histogram-like display You can observe a lot by looking – Yogi Berra 3

  4. Stemplot Illustrative Example  Select an SRS of 10 ages  List data as an ordered array 05 11 21 24 27 28 30 42 50 52  Divide each data point into a stem-value and leaf-value  In this example the “tens place” will be the stem-value and t he “ones place” will be the leaf value, e.g., 21 has a stem value of 2 and leaf value of 1 4

  5. Stemplot illustration (cont.)  Draw an axis for the stem-values: 0| 1| 2| 1 3| 4| 5| × 10  axis multiplier (important!)  Place leaves next to their stem value  21 plotted (animation) 5

  6. Stemplot illustration continued …  Plot all data points and rearrange in rank order: 0|5 1|1 2|1478 3|0 4|2 5|02 8 × 10 7  Here is the plot horizontally: 4 2 5 1 1 0 2 0 (for demonstration purposes) ------------ 0 1 2 3 4 5 ------------ Rotated stemplot 6

  7. Interpreting Stemplots  Shape ◦ Symmetry ◦ Modality (number of peaks) ◦ Kurtosis (width of tails) ◦ Departures (outliers)  Location ◦ Gravitational center  mean ◦ Middle value  median  Spread ◦ Range and inter-quartile range ◦ Standard deviation and variance 7

  8. Shape  “Shape” refers to the pattern when plotted  Here’s the silhouette of our data X X X X X X X X X X ----------- 0 1 2 3 4 5 -----------  Consider: symmetry, modality, kurtosis 8

  9. Shape: Idealized Density Curve A large dataset is introduced An density curve is superimposed to better discuss shape 9

  10. Symmetrical Shapes 10

  11. Asymmetrical shapes 11

  12. Modality (no. of peaks) 12

  13. Kurtosis (width of tails)  fat tails Mesokurtic (medium) Platykurtic (flat)  skinny tails Leptokurtic (steep) Kurtosis is not be easily judged by eye 13

  14. Stemplot – Second Example  Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42  Stem = ones-place |1|5  Leaves = tenths-place |2|14  Round to keep one digit |3|4789 |4|4 after decimal point ( × 1) (e.g., 1.47  1.5) Do not plot decimal  Shape: asymmetric, skewed to the left, unimodal, no outliers 14

  15. Draw a stemplot using JMP Analyze---Distribution---Data---Stem and Leaf Open the JMP data set named Stem_and_leaf_plot.jmp 15

  16. Third Illustrative Example ( n = 26)  Age data set from 26 subjects  {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38} Data set: Stem_and_leaf_plot_example2.jmp Distribution of the age variable? 16

  17. 2. Frequency Table AGE | Freq Rel.Freq Cum.Freq. ------+----------------------- 3 | 2 0.3% 0.3%  Frequency = 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% count 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9%  Relative frequency 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% = proportion or % 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7%  Cumulative 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% frequency  % less 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% than or equal to 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0% level ------+----------------------- Total | 654 100.0% 17

  18. Frequency Table with Class Intervals  When data are sparse, group data into class intervals  Create 4 to 12 class intervals  Classes can be uniform or non-uniform  End-point convention: e.g., first class interval of 0 to 10 will include 0 but exclude 10 (0 to 9.99)  Talley frequencies  Calculate relative frequency  Calculate cumulative frequency 18

  19. Class Intervals Uniform class intervals table (width 10) for data: 05 11 21 24 27 28 30 42 50 52 Class Freq Relative Cumulative Freq. (%) Freq (%) 0 – 9 1 10 10 10 – 19 1 20 – 29 4 30 – 39 1 40 – 49 1 10 80 50 – 59 2 20 100 Total 10 100 -- 19

  20. Histogram A histogram is a frequency chart for a quantitative measurement. Notice how the bars touch. 5 4 3 2 1 0 10_19 0-9 20-29 30-39 40-49 50-59 Age Class 20

  21. Bar Chart A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals 500 450 400 350 300 250 200 150 100 50 0 Pre- Elem. Middle High School-level 21

  22. 3. Summary Statistics  Central location ◦ Mean ◦ Median ◦ Mode  Spread ◦ Range and interquartile range (IQR) ◦ Variance and standard deviation 22

  23. Location: Mean “Eye - ball method”  visualize where plot would balance Arithmetic method = sum values and divide by n Eye-ball method  8 around 25 to 30 7 4 2 (takes practice) 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ Arithmetic method ^ mean = 290 / 10 = 29 Grav.Center 23

  24. Notation  n  sample size  X  the variable (e.g., ages of subjects)  x i  the value of individual i for variable X    sum all values (capital sigma)  Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n = 10 X = AGE variable x 1 = 21, x 2 = 42, …, x 10 = 52  x i = x 1 + x 2 + … + x 10 = 21 + 42 + … + 52 = 290 24

  25. Central Location: Sample Mean  “Arithmetic average”  Traditional measure of central location  Sum the values and divide by n  “xbar” refers to the sample mean n 1 1          x x x x x 1 2 n i n n  i 1 25

  26. Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10,  x i = 21 + 42 + … + 52 = 290, and 1 1     x x ( 290 ) 29 . 0 i n 10 The sample mean is the gravitational center of a distribution 10 30 40 60 20 0 50 Mean = 29 26

  27. Uses of the Sample Mean The sample mean can be used to predict:  The value of an observation drawn at random from the sample  The value of an observation drawn at random from the population  The population mean 27

  28. Population Mean  x 1    i  x i N N  Same operation as sample mean except based on entire population ( N ≡ population size)  Conceptually important  Usually not available in practice  Sometimes referred to as the expected value 28

  29. Central Location: Median  Ordered array: 05 11 21 24 27 28 30 42 50 52  When n is even, the median is the average of the ( n ÷ 2)th data and the (n ÷ 2+1)th data.  When n is odd, the median is the ((n+1) ÷ 2)th data.  For illustrative data: n = 10 → the median falls between 27 and 28=(27+28) ÷ 2 =27.5 05 11 21 24 27 28 30 42 50 52  median Average the adjacent values: M = 27.5 29

  30. More Examples of Medians  Example A: 2 4 6 Median = 4  Example B: 2 4 6 8 Median = 5 (average of 4 and 6)  Example C: 6 2 4 Median  2 (Values must be ordered first) 30

  31. The Median is Robust The median is more resistant to skews and outliers than the mean; it is more robust . This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 Here’s the same data set with a data entry error “outlier” ( highlighted ). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 The median is 1614 in both instances, demonstrating its robustness in the face of outliers. 31

  32. Mode  The mode is the most commonly encountered value in the dataset  This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9}  This data set has no mode {4, 6, 7, 8} (each point appears only once)  The mode is useful only in large data sets with repeating values 32

  33. Comparison of Mean, Median, Mode Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew 33

  34. Spread: Quartiles Site 1| |Site 2  Two distributions can be quite --------------- different yet can have the same 42|2| mean 8|2|  This data compares particulate 2|3|234 86|3|6689 matter in air samples ( μ g/m 3 ) at 2|4|0 two sites. Both sites have a |4| mean of 36, but Site 1 exhibits |5| much greater variability. We |5| would miss the high pollution |6| days if we relied solely on the 8|6| ×10 mean. 34

  35. Spread: Range  Range = maximum – minimum Site 1| |Site 2  Illustrative example: ---------------- 42|2| Site 1 range = 68 – 22 = 46 8|2| Site 2 range = 40 – 32 = 8 2|3|234  Beware: the sample range will 86|3|6689 2|4|0 tend to underestimate the |4| population range. |5|  Always supplement the range |5| with at least one addition |6| measure of spread 8|6| ×10 35

Recommend


More recommend