Frequency Distribution and Summary Statistics Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawai’i at M ā noa
Outline 1. Stemplot 2. Frequency table 3. Summary statistics 2
1. Stem-and-leaf plots (stemplots) Always start by looking at the data with graphs and plots Our favorite technique for looking at a single variable is the stemplot A stemplot is a graphical technique that organizes data into a histogram-like display You can observe a lot by looking – Yogi Berra 3
Stemplot Illustrative Example Select an SRS of 10 ages List data as an ordered array 05 11 21 24 27 28 30 42 50 52 Divide each data point into a stem-value and leaf-value In this example the “tens place” will be the stem-value and t he “ones place” will be the leaf value, e.g., 21 has a stem value of 2 and leaf value of 1 4
Stemplot illustration (cont.) Draw an axis for the stem-values: 0| 1| 2| 1 3| 4| 5| × 10 axis multiplier (important!) Place leaves next to their stem value 21 plotted (animation) 5
Stemplot illustration continued … Plot all data points and rearrange in rank order: 0|5 1|1 2|1478 3|0 4|2 5|02 8 × 10 7 Here is the plot horizontally: 4 2 5 1 1 0 2 0 (for demonstration purposes) ------------ 0 1 2 3 4 5 ------------ Rotated stemplot 6
Interpreting Stemplots Shape ◦ Symmetry ◦ Modality (number of peaks) ◦ Kurtosis (width of tails) ◦ Departures (outliers) Location ◦ Gravitational center mean ◦ Middle value median Spread ◦ Range and inter-quartile range ◦ Standard deviation and variance 7
Shape “Shape” refers to the pattern when plotted Here’s the silhouette of our data X X X X X X X X X X ----------- 0 1 2 3 4 5 ----------- Consider: symmetry, modality, kurtosis 8
Shape: Idealized Density Curve A large dataset is introduced An density curve is superimposed to better discuss shape 9
Symmetrical Shapes 10
Asymmetrical shapes 11
Modality (no. of peaks) 12
Kurtosis (width of tails) fat tails Mesokurtic (medium) Platykurtic (flat) skinny tails Leptokurtic (steep) Kurtosis is not be easily judged by eye 13
Stemplot – Second Example Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42 Stem = ones-place |1|5 Leaves = tenths-place |2|14 Round to keep one digit |3|4789 |4|4 after decimal point ( × 1) (e.g., 1.47 1.5) Do not plot decimal Shape: asymmetric, skewed to the left, unimodal, no outliers 14
Draw a stemplot using JMP Analyze---Distribution---Data---Stem and Leaf Open the JMP data set named Stem_and_leaf_plot.jmp 15
Third Illustrative Example ( n = 26) Age data set from 26 subjects {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38} Data set: Stem_and_leaf_plot_example2.jmp Distribution of the age variable? 16
2. Frequency Table AGE | Freq Rel.Freq Cum.Freq. ------+----------------------- 3 | 2 0.3% 0.3% Frequency = 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% count 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9% Relative frequency 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% = proportion or % 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7% Cumulative 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% frequency % less 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% than or equal to 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0% level ------+----------------------- Total | 654 100.0% 17
Frequency Table with Class Intervals When data are sparse, group data into class intervals Create 4 to 12 class intervals Classes can be uniform or non-uniform End-point convention: e.g., first class interval of 0 to 10 will include 0 but exclude 10 (0 to 9.99) Talley frequencies Calculate relative frequency Calculate cumulative frequency 18
Class Intervals Uniform class intervals table (width 10) for data: 05 11 21 24 27 28 30 42 50 52 Class Freq Relative Cumulative Freq. (%) Freq (%) 0 – 9 1 10 10 10 – 19 1 20 – 29 4 30 – 39 1 40 – 49 1 10 80 50 – 59 2 20 100 Total 10 100 -- 19
Histogram A histogram is a frequency chart for a quantitative measurement. Notice how the bars touch. 5 4 3 2 1 0 10_19 0-9 20-29 30-39 40-49 50-59 Age Class 20
Bar Chart A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals 500 450 400 350 300 250 200 150 100 50 0 Pre- Elem. Middle High School-level 21
3. Summary Statistics Central location ◦ Mean ◦ Median ◦ Mode Spread ◦ Range and interquartile range (IQR) ◦ Variance and standard deviation 22
Location: Mean “Eye - ball method” visualize where plot would balance Arithmetic method = sum values and divide by n Eye-ball method 8 around 25 to 30 7 4 2 (takes practice) 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ Arithmetic method ^ mean = 290 / 10 = 29 Grav.Center 23
Notation n sample size X the variable (e.g., ages of subjects) x i the value of individual i for variable X sum all values (capital sigma) Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n = 10 X = AGE variable x 1 = 21, x 2 = 42, …, x 10 = 52 x i = x 1 + x 2 + … + x 10 = 21 + 42 + … + 52 = 290 24
Central Location: Sample Mean “Arithmetic average” Traditional measure of central location Sum the values and divide by n “xbar” refers to the sample mean n 1 1 x x x x x 1 2 n i n n i 1 25
Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, x i = 21 + 42 + … + 52 = 290, and 1 1 x x ( 290 ) 29 . 0 i n 10 The sample mean is the gravitational center of a distribution 10 30 40 60 20 0 50 Mean = 29 26
Uses of the Sample Mean The sample mean can be used to predict: The value of an observation drawn at random from the sample The value of an observation drawn at random from the population The population mean 27
Population Mean x 1 i x i N N Same operation as sample mean except based on entire population ( N ≡ population size) Conceptually important Usually not available in practice Sometimes referred to as the expected value 28
Central Location: Median Ordered array: 05 11 21 24 27 28 30 42 50 52 When n is even, the median is the average of the ( n ÷ 2)th data and the (n ÷ 2+1)th data. When n is odd, the median is the ((n+1) ÷ 2)th data. For illustrative data: n = 10 → the median falls between 27 and 28=(27+28) ÷ 2 =27.5 05 11 21 24 27 28 30 42 50 52 median Average the adjacent values: M = 27.5 29
More Examples of Medians Example A: 2 4 6 Median = 4 Example B: 2 4 6 8 Median = 5 (average of 4 and 6) Example C: 6 2 4 Median 2 (Values must be ordered first) 30
The Median is Robust The median is more resistant to skews and outliers than the mean; it is more robust . This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 Here’s the same data set with a data entry error “outlier” ( highlighted ). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 The median is 1614 in both instances, demonstrating its robustness in the face of outliers. 31
Mode The mode is the most commonly encountered value in the dataset This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} This data set has no mode {4, 6, 7, 8} (each point appears only once) The mode is useful only in large data sets with repeating values 32
Comparison of Mean, Median, Mode Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew 33
Spread: Quartiles Site 1| |Site 2 Two distributions can be quite --------------- different yet can have the same 42|2| mean 8|2| This data compares particulate 2|3|234 86|3|6689 matter in air samples ( μ g/m 3 ) at 2|4|0 two sites. Both sites have a |4| mean of 36, but Site 1 exhibits |5| much greater variability. We |5| would miss the high pollution |6| days if we relied solely on the 8|6| ×10 mean. 34
Spread: Range Range = maximum – minimum Site 1| |Site 2 Illustrative example: ---------------- 42|2| Site 1 range = 68 – 22 = 46 8|2| Site 2 range = 40 – 32 = 8 2|3|234 Beware: the sample range will 86|3|6689 2|4|0 tend to underestimate the |4| population range. |5| Always supplement the range |5| with at least one addition |6| measure of spread 8|6| ×10 35
Recommend
More recommend