introduction to statistics
play

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - PDF document

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html Other courses Microarrays, Lausanne, 30 March 1 April (3


  1. Introduction to statistics Frédéric Schütz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html

  2. Other courses � Microarrays, Lausanne, 30 March – 1 April (3 days) – Lower-level analysis – Normalization – Finding differentially-expressed genes, – Introduction to GSEA and other group-level analysis methods – Classification. � Advanced statistics course: planned ! � Other courses: depending on your needs ! Consumption of meat per person per year in Switzerland (in kg) Elly Tzogalis et Michel Jeanneret (15 août 2008), Le Matin http://www.lematin.ch/fr/actu/suisse/va-t-on-bientot-tuer-le-steak_9-220826

  3. What is statistics good for? � Descriptive statistics: summarizing datasets by a few numbers � Exploratory data analysis and visualization: find patterns and construct hypotheses � Significance testing: do the data support the existence of a significant trend or is it just noise? � Clustering: finding patterns in the noise � Regression: can you explain the behaviour of a variable as a function of the others? � Classification : putting objects into the right drawers � Not a complete list !

  4. Exploratory data analysis � Also called descriptive statistics. � Process of looking at the data prior to formal analysis . � Data examined in two ways: – Numerical summaries of data (mean, standard deviation, 5- numbers summary, etc) – Graphical summaries: viewing your data in graphs to detect errors, unusual values, trends and patterns. � Particularly relevant to large datasets � Remember: summarising means losing some information ! – See “The Median Isn't the Message” by Stephen Jay Gould http://www.edwardtufte.com/tufte/gould Measures of location: mean � “Arithmetic mean” � Sum of the values divided by the number of values � All observations treated equally � Suitable for symmetrical distributions � Sensitive to presence of outliers (“unusual values”) � Trimmed mean: – “Olympic scoring” – Remove extreme values (e.g. 10%) on each side before calculating the mean � In R: > mean(data) > mean(data, trim=0.1)

  5. Mean: (lack of) robustness Mean Trimmed mean Trimmed mean (0.3)

  6. Side note: removing data � In the past, data was removed if it “looked” incorrect – Gregor Mendel’s peas (results too good to be true) – Albert Michelson’s data on the speed of light – Johannes Kepler on planet orbits � Outliers (unusual observations, far away from the rest of the data) do occur naturally. � Data points can be removed (e.g. trimmed mean) – if the decision is made before looking at the data; or – if the discrepancies can be explained. � Otherwise, this is akin to data snooping. � There are statistical methods (called robust methods ) which can handle outliers.

  7. In R: > library(MASS) > data(phones) > ?phones Measures of location: Median Median 50% of the data 50% of the data � More appropriate for skewed distributions � Mean=Median if the distribution is symmetrical � Not sensitive to the presence of outliers since it “ignores” almost all the values In R: > median(data)

  8. Quartiles and percentiles 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data x th percentile x% of the data In R: > quantile(data, 0.25) > quantile(data, 0.5) # Same as median(data) > quantile(data, x) Median: resistance to outliers Median Mean

  9. Mode � For discrete data, the mode is the most-common value in the data. � For continuous-valued data, the mode is an infinitesimal concept: it is defined as the maximum of the density. � There is no simple finite-sample estimator of the mode, all depend on some sort of smoothing. Mean=Median=Mode Bimodal and multimodal data � Most often, we are not interested in “the” mode of the data � Of interest is whether the distribution has several prominent “peaks” (local maximums of the density), in which case it is bimodal or multimodal . � Bimodality often indicates that the data is not homogenous and is in fact made of two sub-populations. Most (if not all) the numerical summaries that we discuss here will break down if the data is bimodal !

  10. Spread Narrower spread Wider spread Same mean Standard Deviation Mean � The standard deviation (SD, σ ) of a variable is the square root of the average of squared deviations from the mean. � Used in conjunction with the mean . � Same unit as the data � In R: > sd(data) n 1 ∑ σ = − 2 ( ) x x − i n 1 = 1 i

  11. Interquartile range (IQR) IQR= 3 rd quartile – 1 st quartile 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data � Used in conjunction with the median � In R: > IQR(data) Histograms � Histograms are an intuitive way to represent a large number of data points: – The range of the data is converted into a number of intervals (“bins”), usually with the same width – The number of observations which falls into each histogram is counted and plotted as a bar – Alternatively, a density scale can be used (area of each bar represents the proportion of observations in each interval) � Helps visualizing the distribution of values for a numerical variable � Main complication: choice of bin width/number of bins � Most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

  12. Area of this bar represents the proportion of observations between 16 and 17. R default parameters (here: 1 bin for 5 units) User choice (1 bin for 0.5 units) User choice (1 bin for 0.006 units)

  13. Density � The density describes the theoretical probability distribution of a variable � Conceptually, it is obtained in the limit of infinitely many data points � When we estimate it from a finite set of data, we usually assume that the density is a smooth function � You can think of it as a “smoothed histogram” (but to actually compute it, there are much better methods!) Density for normal distribution and SD Area indicates the probability that a random observation will fall into this range. 1 SD from the mean mean

  14. Representing data: some bad practices Estimating an illegal phenomenon (unauthorized copy of computer programs) is hard, and the methodology is very contested. Estimations probably carry a large uncertaintly, which is not indicated, making comparisons between percentages very difficult. Calculations of actual losses is even more contentious ! Fourth Annual BSA and IDC Global Software Piracy Study, May 2007 More information: http://en.wikipedia.org/wiki/Business_Software_Alliance, version as of 16:15, 18 February 2008 Scientists seem to do better: a «random» sample

  15. Representing data: « bar+error » plot ** Mean + SD 2.7 2.4 2.0 SD 1.5 Mean 0 A B ** p<0.01 Legend: mean of measurement for groups A (25 subjects) and B (18 subjects); error bars indicate the standard deviation in each group; two- sided two-sample t-test.

  16. Boxplot 50% of the data is in the box “Interquartile range” Whisker Outliers Outlier Median 25% of the data is below the box 25% of the data is above the box 50% of the data is above 50% of the data is below � Outliers (unusual values) are those data points whose distance from the box is larger than 1.5 times the interquartile range. � The whiskers extend to the last point which is not an outlier. � A boxplot is a graphical representation of the Five-number summary : Minimum, First quartile, Median, Third quartile, Maximum Boxplots: example If there are only a few datapoints in the boxplot, it can be “degenerate” (i.e. not all features are present). From Moritz et al., Anal. Chem. 2004 Aug 15; 76(16):4811-24

  17. Boxplot: a different example With this definition, almost all datasets will produce outliers (20% of all points are “outliers”). In this case, the plots are made of several thousands of data points; a boxplot with outliers would not be very relevant because there would be too many of them. Comparisons of some graphs � In the next 4 slides, we are going to compare different methods for graphing univariate data � Four methods are shown in each case: – Individual data points on the x-axis; some random displacement (jitter) is added on the y-axis to avoid superimposition of too many points – Histogram with density superimposed – Mean +/- standard deviation – Boxplot � Other examples are given in the exercises.

  18. Dataset 1 (500 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot Dataset 2 (37 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

  19. Dataset 3 (100 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot (courtesy Nadine Zangger) Dataset 4 (4 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

Recommend


More recommend