outline
play

Outline Experimentation (6.1) Data Presentation (6.2) ata ese tat - PDF document

1/4/2007 219323 Probability y and Statistics for Software and Knowledge Engineers Lecture 7: Descriptive Statistics Monchai Sopitkamon, Ph.D. Outline Experimentation (6.1) Data Presentation (6.2) ata ese tat o (6 ) Sample


  1. 1/4/2007 219323 Probability y and Statistics for Software and Knowledge Engineers Lecture 7: Descriptive Statistics Monchai Sopitkamon, Ph.D. Outline � Experimentation (6.1) � Data Presentation (6.2) ata ese tat o (6 ) � Sample Statistics (6.3) � Examples (6.4) 1

  2. 1/4/2007 Experimentation I (6.1) The relationship betw een probability theory and statistical inference Experimentation: Samples I (6.1.1) � Populations and Samples – Population : all possible observations Population : all possible observations available from a particular probability distribution – Sample : a particular subset of the population that an experimenter measures and uses to investigate the unknown probability distribution. p y – Random sample : a sample where the elements of the sample are chosen at random from the population to ensure that the sample is representative of the population 2

  3. 1/4/2007 Experimentation: Samples I (6.1.1) � Data types: Data types: – Data observation ( x 1 , …, x n ) can be divided into two major types: categorical (or nominal) and numerical data types Experimentation: Examples I (6.1.2) observations Categorical Data set of m achine breakdow ns 3

  4. 1/4/2007 Experimentation: Examples II (6.1.2) Num erical observations of the num ber of defective com puter chips in each of 8 0 random ly sam pled boxes random ly-sam pled boxes Data summarization Outline � Experimentation (6.1) � Data Presentation (6.2) ata ese tat o (6 ) � Sample Statistics (6.3) � Examples (6.4) 4

  5. 1/4/2007 Data Presentation (6.2) � After data is gathered, how do we After data is gathered, how do we represent the collected data in such an informative way than using just tables of numbers? � By using graphs or charts Data Presentation: Bar Charts and Pareto Charts I (6.2.1) Bar charts are generally suitable for illustrating g categorical data sets Bar chart of m achine breakdow ns data set 5

  6. 1/4/2007 Data Presentation: Bar Charts and Pareto Charts II (6.2.1) Pareto charts are bar charts used in quality control where categories are sorted in order categories are sorted in order of decreasing frequency Data set and Pareto chart of customer complaints for Internet company Excel spreadsheet Data Presentation: Histograms I (6.2.3) Look similar to bar charts, but are used to present numerical data instead of categorical one data instead of categorical one Data set and histogram of computer chips data set 6

  7. 1/4/2007 Data Presentation: Histograms II (6.2.3) A histogram w ith A histogram w ith positive skew ness negative skew ness Data Presentation: Histograms III (6.2.3) A histogram for a bim odal distribution 7

  8. 1/4/2007 Data Presentation: Outliers I (6.2.4) � Data points that appear to be separate from the rest of the data set separate from the rest of the data set � Usually should be removed from data set before applying statistical inference techniques � In general, outliers are misrecorded data observation, which can be data observation, which can be corrected � Important issue: whether the outlier represents true variation or whether it is caused by an outside influence Data Presentation: Outliers II (6.2.4) Histogram of a data set with a possible outlier 8

  9. 1/4/2007 Outline � Experimentation (6.1) � Data Presentation (6.2) Data Presentation (6.2) � Sample Statistics (6.3) � Examples (6.4) Sample Statistics (6.3) Data set Probability distribution y � Sample mean � Expectation � Sample median � Median � Sample SD � SD 9

  10. 1/4/2007 Sample Statistics: Sample Mean I (6.3.1) � Arithmetic average of the data Arithmetic average of the data observations � If a data set consists of n observations x 1 , …, x n , then the sample mean is ∑ ∑ = n n x = i i 1 x n Sample Statistics: Sample Mean II (6.3.1) I llustrative data set 10

  11. 1/4/2007 Sample Statistics: Sample Median (6.3.2) � The value of the “middle” of the sorted data points � For odd number of n observations, the ⎡ ⎤ sample median is equal to n / 2 � For even number of n observations, the sample median is equal to the average p q g of the two middle values. � A symmetric sample has a sample mean quite equal to a sample median Sample Statistics: Sample Trimmed Mean I (6.3.3) � A trimmed mean is obtained by A trimmed mean is obtained by deleting some of its largest and smallest data observations, and by taking the mean of the remaining observations. � For example, a 10% trimmed mean For example, a 10% trimmed mean of sorted observations x 1 , …, x 50 is ∑ = equal to 45 x = i i 6 x 40 11

  12. 1/4/2007 Sample Statistics: Sample Trimmed Mean II (6.3.3) Relationship betw een the sam ple m ean, m edian, and trim m ed m ean for positively and negatively skew ed data sets Sample Statistics: Sample Variance (6.3.5) � Sample variance of a set of data Sample variance of a set of data observations x 1 , …, x n is defined as ( ) = ∑ = n − 2 x x i 2 i 1 s − n 1 � Alternate formulas for sample variance s 2 are ( ) ( ) ( ) ∑ ∑ ∑ 2 n − n − n 2 2 2 x n x x x / n = = = = = i i i 2 i 1 i 1 i 1 s − − n 1 n 1 12

  13. 1/4/2007 Sample Statistics: Sample Quantiles (6.3.6) � The p th sample quantile is a value that has a proportion p of the sample taking values i f h l ki l smaller than it and a proportion 1 – p taking values larger than it. � Sample median = 50 th percentile of the sample � Upper and lower sample quartiles = 75 th percentile and 25 th percentile of the sample percentile and 25 percentile of the sample. � Sample interquartile range = 75 th – 25 th percentiles of the sample Excel spreadsheet Sample Statistics: Boxplots I (6.3.7) � Schematic presentation of the sample median, the upper and lower p , pp sample quartiles, and the largest and smallest data observations. Half of observations 13

  14. 1/4/2007 Sample Statistics: Boxplots II (6.3.7) Sample mean = 3.725 Sample median = 4.25 Lower sample quartile = 2.65 Upper sample quartile = 4.675 Boxplot for data set in Figure 6 .2 2 Sample Statistics: Coefficient of Variation I (6.3.8) � Measures the spread of the data relative to the middle value l ti t th iddl l s Sample standard deviation CV = x Sample mean � Large values of CV imply that the g p y variability is large relative to the sample average. � Small values indicate that the variability is small relative to the sample average. 14

  15. 1/4/2007 Sample Statistics: Coefficient of Variation II (6.3.8) � Ex.42 pg.283: African elephants Ex.42 pg.283: African elephants x have average sample weight = e 4550 kg and a sample SD of s e = 150 kg, while mice have average x sample weight = 30 g and a m sample SD of s m = 1.67 g. p g m s 150 = = = ∴ Mice have more e 0 . 033 CV e variability in their x 4550 e weights than the elephants relative s 1 . 67 = = = m CV 0 . 056 to their respective m x 30 average weights m Outline � Experimentation (6.1) � Data Presentation (6.2) ata ese tat o (6 ) � Sample Statistics (6.3) � Examples (6.4) 15

  16. 1/4/2007 Examples (6.4) � Ex.44 pg.286: Excel spreadsheet outliers 16

Recommend


More recommend