descriptive statistics
play

Descriptive Statistics Fall 2017 Instructor: Ajit Rajwade 1 - PowerPoint PPT Presentation

Descriptive Statistics Fall 2017 Instructor: Ajit Rajwade 1 Topic Overview Some important terminology Methods of data representation: frequency tables, graphs, pie-charts, scatter-plots Data mean, median, mode, quantiles


  1. Descriptive Statistics Fall 2017 Instructor: Ajit Rajwade 1

  2. Topic Overview  Some important terminology  Methods of data representation: frequency tables, graphs, pie-charts, scatter-plots  Data mean, median, mode, quantiles  Chebyshev’s inequality  Correlation coefficient 2

  3. Terminology  Population : The collection of all elements which we wish to study, example: data about occurrence of tuberculosis all over the world  In this case, “population” refers to the set of people in the entire world.  The population is often too large to examine/study.  So we study a subset of the population – called as a sample .  In an experiment, we basically collect values for attributes of each member of the sample – also called as a sample point .  Example of a relevant attribute in the tuberculosis study would be whether or not the patient yielded a positive result on the serum TB Gold test.  See http://www.who.int/tb/publications/global_report/en/ for more information. 3

  4. Terminology  Discrete data: Data whose values are restricted to a finite set. Eg: letter grades at IITB, genders, marital status (single, married, divorced), income brackets in India for tax purposes  Continuous data: Data whose values belong to an uncountably infinite set (Eg : a person’s height, temperature of a place, speed of a car at a time instant). 4

  5. Methods of Data Representation/Visualization 5

  6. Frequency Tables  For discrete data having a relatively small number of values , one can use a frequency table .  Each row of the table lists the data value followed by the number of sample points with that value ( frequency of that value).  The values need not always be numeric! The definition of an Grade Number of students ideal course (per AA 100 student perspective) AB 0 at IITB ;-) BB 0 BC 0 CC 0 6

  7. Frequency Tables  The frequency table can be visualized using a line graph or a bar graph or a frequency polygon . 35 Grade Number of students 30 AA 5 25 Number of students AB 10 20 BB 30 BC 35 15 CC 20 10 A bar graph plots the distinct 5 data values on the X axis and their frequency on the Y axis by 0 50 60 70 80 90 means of the height of a thick Marks 7 vertical bar!

  8. 35 Grade Number of students 30 AA 5 25 AB 10 Number of students BB 30 20 BC 35 15 CC 20 10 5 0 50 55 60 65 70 75 80 85 90 Marks A line diagram plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a vertical line! 8

  9. 35 Grade Number of students 30 AA 5 Number of students 25 AB 10 BB 30 20 BC 35 CC 20 15 10 5 50 55 60 65 70 75 80 85 90 Marks A frequency polygon plots the frequency of each data value on the Y axis, and connects consecutive plotted points by means of a line. 9

  10. Relative frequency tables  Sometimes the actual frequencies are not important.  We may be interested only in the percentage or fraction of those frequencies for each data value – i.e. relative frequencies . Grade Fraction of number of students AA 0.05 AB 0.10 BB 0.30 BC 0.35 CC 0.20 10

  11. Pie charts  For a small number of distinct data values which are non-numerical, one can use a pie-chart (it can also be used for numerical values).  It consists of a circle divided into sectors corresponding to each data value.  The area of each sector = relative frequency for that data value. Population of native English speakers: https://en.wikipedia.org/wiki/Pie_chart 11

  12. Pie charts can be confusing A big no-no with too many categories. http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html 12

  13. Dealing with continuous data  Many a time the data can acquire continuous values (eg: temperature of a place at a time instant, speed of a car at a given time instant, weight or height of an animal, etc.)  In such cases, the data values are divided into intervals called as bins .  The frequency now refers to the number of sample points falling into each bin.  The bins are often taken to be of equal length, though that is not strictly necessary. 13

  14. Dealing with continuous data  Let the sample points be { x i }, 1 <= i <= N .  Let there be some K ( K << N ) bins, where the j th bin has interval [ a j , b j ).  Thus frequency f j for the j th bin is defined as follows:      | { : , 1 } | f x a x b i N j i j i j  Such frequency tables are also called histograms and they can also be used to store relative frequency instead of frequency. 14

  15. Example of a histogram: in image processing  A grayscale image is a 2D array of size (say) H x W .  Each entry of this array is called a pixel and is indexed as ( x , y ) where x is the column index and y is the row index.  At each pixel, we have an intensity value which tells us how bright the pixel is (smaller values = darker shades, larger value = brighter shades).  Commonly, pixel values in grayscale photographic are 8 bit (ranging from 0 to 255).  Histograms are widely used in image processing – in fact a histogram is often used in image retrieval. 15

  16. Example: histogram of the well- known “ barbara image”, using bins of length 10. This image has values from 0 to 255 and hence there are 26 bins. 16

  17. Cumulative frequency plot  The cumulative (relative) frequency plot (also called ogive ) tells you the (proportion) number of sample points whose value is less than or equal to a given data value. The cumulative frequency plot for the frequency plot on the previous slide! 17

  18. Digression: A curious looking histogram in image processing  Given the image I ( x , y ), let’s say we compute the x - gradient image in the following manner:      , , 1 , 1 , x y x W y H    ( , ) ( 1 , ) ( , ) I x y I x y I x y x  And we plot the histogram of the absolute values of the x-gradient image.  The next slide shows you how these histograms typically look! What do you observe? 18

  19. 19

  20. 20

  21. Summarizing the Data 21

  22. Summarizing a sample-set  There are some values that can be considered “representative” of the entire sample -set. Such values are called as a “statistic”.  The most common statistic is the sample (arithmetic) mean : N 1   x x i N  1 i  It is basically what is commonly regarded as “average value”. 22

  23. Summarizing a sample-set  Another common statistic is the sample median , which is the “middle value”.  We sort the data array A from smallest to largest. If N is odd, then the median is the value at the ( N +1)/2 position in the sorted array.  If N is even, the median can take any value in the interval (A[ N /2],A[ N /2+1]) – why? 23

  24. Properties of the mean and median  Consider each sample point x i were replaced by ax i + b for some constants a and b .  What happens to the mean? What happens to the median?  Consider each sample point x i were replaced by its square.  What happens to the mean? What happens to the median? 24

  25. Properties of the mean and median  Question: Consider a set of sample points x 1 , x 2 , …, x N . For what value y , is the sum total of the squared difference with every sample point, the least? That is, what is: Answer: mean Total squared deviation N (proof done in  (or total squared loss)  2 arg min ( ) ? y x class) y i  1 i  Question: For what value y , is the sum total of the absolute difference with every sample point, the least? That is, what is: Answer: median Total absolute deviation (two proofs done N  (or total absolute loss)  in class – with arg min | | ? y x y i and without  1 i calculus) 25

  26. Properties of the mean and median  The mean need not be a member of the original sample-set.  The median is always a member of the original sample-set if N is odd.  The median is not unique if N is even and will not be a member of the set. 26

  27. Properties of the mean and median  Consider a set of sample points x 1 , x 2 , …, x N . Let us say that some of these values get grossly corrupted.  What happens to the mean?  What happens to the median? 27

  28. Example  Let A ={1,2,3,4,6}  Mean (A) = 3.2, median (A) = 3  Now consider A = {1,2,3,4,20}  Mean (A) = 6, median(A) = 3. 28

  29. Concept of quantiles  The sample 100 p percentile (0 ≤ p ≤ 1 ) is defined as the data value y such that 100 p % of the data have a value less than or equal to y , and 100(1- p )% of the data have a larger value.  For a data set with n sample points, the sample 100 p percentile is that value such that at least np of the values are less than or equal to it. And at least n (1- p ) of the values are greater than it. 29

  30. Concept of quantiles  The sample 25 percentile = first quartile.  The sample 50 percentile = second quartile.  The sample 75 percentile = third quartile.  Quantiles can be inferred from the cumulative relative frequency plot (how?).  Or by sorting the data values (how?). 30

  31. 1 0.9 0.8 3 rd quartile 0.7 0.6 2 nd quartile 0.5 0.4 1 st quartile 0.3 0.2 0.1 0 -3 -2 -1 0 1 2 3 4 31

  32. Concept of mode  The value that occurs with the highest frequency is called the mode. 32

Recommend


More recommend