Descriptive Statistics Fall 2017 Instructor: Ajit Rajwade 1
Topic Overview Some important terminology Methods of data representation: frequency tables, graphs, pie-charts, scatter-plots Data mean, median, mode, quantiles Chebyshev’s inequality Correlation coefficient 2
Terminology Population : The collection of all elements which we wish to study, example: data about occurrence of tuberculosis all over the world In this case, “population” refers to the set of people in the entire world. The population is often too large to examine/study. So we study a subset of the population – called as a sample . In an experiment, we basically collect values for attributes of each member of the sample – also called as a sample point . Example of a relevant attribute in the tuberculosis study would be whether or not the patient yielded a positive result on the serum TB Gold test. See http://www.who.int/tb/publications/global_report/en/ for more information. 3
Terminology Discrete data: Data whose values are restricted to a finite set. Eg: letter grades at IITB, genders, marital status (single, married, divorced), income brackets in India for tax purposes Continuous data: Data whose values belong to an uncountably infinite set (Eg : a person’s height, temperature of a place, speed of a car at a time instant). 4
Methods of Data Representation/Visualization 5
Frequency Tables For discrete data having a relatively small number of values , one can use a frequency table . Each row of the table lists the data value followed by the number of sample points with that value ( frequency of that value). The values need not always be numeric! The definition of an Grade Number of students ideal course (per AA 100 student perspective) AB 0 at IITB ;-) BB 0 BC 0 CC 0 6
Frequency Tables The frequency table can be visualized using a line graph or a bar graph or a frequency polygon . 35 Grade Number of students 30 AA 5 25 Number of students AB 10 20 BB 30 BC 35 15 CC 20 10 A bar graph plots the distinct 5 data values on the X axis and their frequency on the Y axis by 0 50 60 70 80 90 means of the height of a thick Marks 7 vertical bar!
35 Grade Number of students 30 AA 5 25 AB 10 Number of students BB 30 20 BC 35 15 CC 20 10 5 0 50 55 60 65 70 75 80 85 90 Marks A line diagram plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a vertical line! 8
35 Grade Number of students 30 AA 5 Number of students 25 AB 10 BB 30 20 BC 35 CC 20 15 10 5 50 55 60 65 70 75 80 85 90 Marks A frequency polygon plots the frequency of each data value on the Y axis, and connects consecutive plotted points by means of a line. 9
Relative frequency tables Sometimes the actual frequencies are not important. We may be interested only in the percentage or fraction of those frequencies for each data value – i.e. relative frequencies . Grade Fraction of number of students AA 0.05 AB 0.10 BB 0.30 BC 0.35 CC 0.20 10
Pie charts For a small number of distinct data values which are non-numerical, one can use a pie-chart (it can also be used for numerical values). It consists of a circle divided into sectors corresponding to each data value. The area of each sector = relative frequency for that data value. Population of native English speakers: https://en.wikipedia.org/wiki/Pie_chart 11
Pie charts can be confusing A big no-no with too many categories. http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html 12
Dealing with continuous data Many a time the data can acquire continuous values (eg: temperature of a place at a time instant, speed of a car at a given time instant, weight or height of an animal, etc.) In such cases, the data values are divided into intervals called as bins . The frequency now refers to the number of sample points falling into each bin. The bins are often taken to be of equal length, though that is not strictly necessary. 13
Dealing with continuous data Let the sample points be { x i }, 1 <= i <= N . Let there be some K ( K << N ) bins, where the j th bin has interval [ a j , b j ). Thus frequency f j for the j th bin is defined as follows: | { : , 1 } | f x a x b i N j i j i j Such frequency tables are also called histograms and they can also be used to store relative frequency instead of frequency. 14
Example of a histogram: in image processing A grayscale image is a 2D array of size (say) H x W . Each entry of this array is called a pixel and is indexed as ( x , y ) where x is the column index and y is the row index. At each pixel, we have an intensity value which tells us how bright the pixel is (smaller values = darker shades, larger value = brighter shades). Commonly, pixel values in grayscale photographic are 8 bit (ranging from 0 to 255). Histograms are widely used in image processing – in fact a histogram is often used in image retrieval. 15
Example: histogram of the well- known “ barbara image”, using bins of length 10. This image has values from 0 to 255 and hence there are 26 bins. 16
Cumulative frequency plot The cumulative (relative) frequency plot (also called ogive ) tells you the (proportion) number of sample points whose value is less than or equal to a given data value. The cumulative frequency plot for the frequency plot on the previous slide! 17
Digression: A curious looking histogram in image processing Given the image I ( x , y ), let’s say we compute the x - gradient image in the following manner: , , 1 , 1 , x y x W y H ( , ) ( 1 , ) ( , ) I x y I x y I x y x And we plot the histogram of the absolute values of the x-gradient image. The next slide shows you how these histograms typically look! What do you observe? 18
19
20
Summarizing the Data 21
Summarizing a sample-set There are some values that can be considered “representative” of the entire sample -set. Such values are called as a “statistic”. The most common statistic is the sample (arithmetic) mean : N 1 x x i N 1 i It is basically what is commonly regarded as “average value”. 22
Summarizing a sample-set Another common statistic is the sample median , which is the “middle value”. We sort the data array A from smallest to largest. If N is odd, then the median is the value at the ( N +1)/2 position in the sorted array. If N is even, the median can take any value in the interval (A[ N /2],A[ N /2+1]) – why? 23
Properties of the mean and median Consider each sample point x i were replaced by ax i + b for some constants a and b . What happens to the mean? What happens to the median? Consider each sample point x i were replaced by its square. What happens to the mean? What happens to the median? 24
Properties of the mean and median Question: Consider a set of sample points x 1 , x 2 , …, x N . For what value y , is the sum total of the squared difference with every sample point, the least? That is, what is: Answer: mean Total squared deviation N (proof done in (or total squared loss) 2 arg min ( ) ? y x class) y i 1 i Question: For what value y , is the sum total of the absolute difference with every sample point, the least? That is, what is: Answer: median Total absolute deviation (two proofs done N (or total absolute loss) in class – with arg min | | ? y x y i and without 1 i calculus) 25
Properties of the mean and median The mean need not be a member of the original sample-set. The median is always a member of the original sample-set if N is odd. The median is not unique if N is even and will not be a member of the set. 26
Properties of the mean and median Consider a set of sample points x 1 , x 2 , …, x N . Let us say that some of these values get grossly corrupted. What happens to the mean? What happens to the median? 27
Example Let A ={1,2,3,4,6} Mean (A) = 3.2, median (A) = 3 Now consider A = {1,2,3,4,20} Mean (A) = 6, median(A) = 3. 28
Concept of quantiles The sample 100 p percentile (0 ≤ p ≤ 1 ) is defined as the data value y such that 100 p % of the data have a value less than or equal to y , and 100(1- p )% of the data have a larger value. For a data set with n sample points, the sample 100 p percentile is that value such that at least np of the values are less than or equal to it. And at least n (1- p ) of the values are greater than it. 29
Concept of quantiles The sample 25 percentile = first quartile. The sample 50 percentile = second quartile. The sample 75 percentile = third quartile. Quantiles can be inferred from the cumulative relative frequency plot (how?). Or by sorting the data values (how?). 30
1 0.9 0.8 3 rd quartile 0.7 0.6 2 nd quartile 0.5 0.4 1 st quartile 0.3 0.2 0.1 0 -3 -2 -1 0 1 2 3 4 31
Concept of mode The value that occurs with the highest frequency is called the mode. 32
Recommend
More recommend