computing for engineering simulation data analysis i ii
play

Computing for engineering simulation Data analysis I, II and - PowerPoint PPT Presentation

Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon Statistical Consulting Unit The Australian National University May 2020 What is Statistics Definitions In everyday usage, the term


  1. Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon Statistical Consulting Unit The Australian National University May 2020

  2. What is ” Statistics” Definitions ◮ In everyday usage, the term statistics : numerical facts or data, e.g., the unemployment rate 9.2%, or the average smartphone price is $1000 ◮ The field or study of statistics : more complex - collecting, summarising, analysing and interpreting data ◮ Both above : use of the word ’ data ’ 1 / 1

  3. Data Types of statistical data In Statistics, data can be considered to be one of two types: ◮ categorical data : generally non-numeric or qualitative, e.g., color, gender, religion, etc ◮ subdivided into two types: nominal and ordinal ◮ numerical data : quantitative and generally measurements, e.g., age, income, height, etc. ◮ subdivided into two types: discrete and continuous 2 / 1

  4. Data Types of statistical data “Methods for viewing and summarizing data depend on which type of data it is.” 3 / 1

  5. Data Types of statistical data Depend on the number of variables: ◮ univariate data : when a dataset consists of a single variable: graphical and numeric summaries of a dataset ◮ bivariate data : there are two variables in a dataset. ◮ multivariate data : two or more variables in a dataset. 4 / 1

  6. Data Population vs. sample ◮ ‘Data’ can refer either to the population or just sample selected from that population . ◮ Very important to distinguish between numerical measures of population and numerical measures of a sample. ◮ A parameter and a statistic : a numerical measure of a population and a sample, respectively 5 / 1

  7. Data Summarising data ◮ Summarise the (sample) data in order to present them in a more meaningful or more easily interpreted form . ◮ Descriptive Statistics and graphical summaries - methods for summarising, describing and displaying data ◮ Using descriptive or summary measures in order to learn about characteristic of the population - statistical inference. 6 / 1

  8. Summarising Data Example: machine breakdowns The engineer in charge of the maintenance of the machine keeps records on the breakdown causes over a period of a year. Altogether there are 46 breakdowns , of which 9 are “electrical causes” , 24 are “mechanical causes” , and 13 are “operator misuse” . Summaries this data? 7 / 1

  9. Summarising Data Example: machine breakdowns Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x 1 , . . . , x 46 , with each observation taking one of the values { electrical, mechanical, misuse } 8 / 1

  10. Summarising Data Example: machine breakdowns Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x 1 , . . . , x 46 , with each observation taking one of the values { electrical, mechanical, misuse } 9 / 1

  11. Summarising Data categorical data - table ◮ Table is the most common way to summarise categorical data Breakdown cause Frequency Electrical 9 Mechanical 24 Misuse 13 Total 46 ◮ Graphically with barplots, dot charts and pie charts. 10 / 1

  12. Summarising Data categorical data - barplot 25 20 15 Frequency 10 5 0 Electrical Mechanical Misuse Breakdown cause 11 / 1

  13. Summarising Data categorical data - misleading plots When we look at a barplot, we are trying to visualize differences between the data presented, assuming a scale starts at 0. The reader can be deliberately misled by making a graphic non-0 based . Such misleading barplots appear frequently in the media. 12 / 1

  14. Summarising Data categorical data - misleading plots Web Preference Web Preference 46 40 45 30 Thousands Thousands 44 20 43 10 0 42 IE Firefox Chrome IE Firefox Chrome 13 / 1

  15. Summarising Data categorical data - others, Pie chart Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A barchart or dotchart is a preferable way of displaying this type of data. 14 / 1

  16. Summarising Data numerical data Now consider a numerical data and basically what we want to understand is the distribution of the data ◮ what is the range of the data? ◮ what is the central tendency? ◮ how spread out are the values? Answer these questions graphically or numerically. 15 / 1

  17. Summarising Data measure of central tendency A measure of central tendency : a numerical measure that locates the centre of a distribution of measurements or describes ’typical value’ ◮ Most common measures of centre: 3M (Mode, Median, Mean) ◮ Not only simplify a description of the data but also comparing different data quantitatively 16 / 1

  18. Summarising Data measure of central tendency - 3M The most common measures of centre, 3M : ◮ Mode : the observation in the dataset that occurs most often (i.e., has the highest frequency of occurrence.) ◮ Median : the middle number in an ordered dataset. ◮ Mean : the arithmetic average of all the measurements in the dataset. 17 / 1

  19. Summarising Data measure of central tendency - example Find the 3M of the following sample of dataset: X = { 18 , 19 , 18 , 20 , 18 , 18 , 20 , 21 , 37 , 18 } ◮ Mode : 18 ◮ Median : 18.5 ◮ Mean : 20.7 18 / 1

  20. Summarising Data measures of variability - how spread out A measure of variability : a single value to measure the internal variation of the data - which data items vary from one another or from a central point. ◮ Three of the more commonly used measures of variability: Range, Variance, Standard deviation 19 / 1

  21. Summarising Data measure of variability ◮ Range : the difference between the largest and the smallest values in the data (the simplest one) ◮ Variance : a single value obtained by summing the squares of the deviations from the mean and dividing this sum by ( n − 1), n is the sample size ◮ Standard deviation : the square root of the variance 20 / 1

  22. Summarising Data measure of variability - variance and standard deviation How to calculate the variance of the sample data, x with sample size, n : ◮ First, calculate the sample mean, ¯ x , then ◮ Calculate the deviation from the mean or the residual, x − ¯ x , x ) 2 then take the squares ( x − ¯ ◮ Summing the squared residuals and dividing by ( n − 1) 21 / 1

  23. Summarising Data measures of variability - variance and standard deviation Consider the following sample of data: x = { 10 , 12 , 15 , 17 , 21 } ◮ The sample mean is 10+12+ ··· +21 = 15 5 10 12 15 17 21 75 x ( x − ¯ x ) -5 -3 0 2 6 0 ◮ x ) 2 ( x − ¯ 25 9 0 4 36 74 √ ◮ The variance of x is 74 / (5 − 1) = 18 . 5 and 18 . 5 = 4 . 30 22 / 1

  24. Summarising Data how to get the mean and variance (standard deviation in R) > x = c(10,12,15,17,21) # input data as a vector > xbar = mean(x) # calculate the mean directly > x - xbar [1] -5 -3 0 2 6 > (x-xbar)^2 [1] 25 9 0 4 36 > sum((x-xbar)^2) [1] 74 > sum((x-xbar)^2) / (length(x) - 1) [1] 18.5 > var(x) # calculate the variance directly [1] 18.5 23 / 1

  25. Summarising Data Five number summary The five number summary gives you a rough idea about what the dataset looks like and includes 5 values (items): ◮ The minimum (min) and the maximum (max) ◮ The first quartile (25%), the median (50%), and the third quartile (75%) 24 / 1

  26. Summarising Data Five number summary > x = c(10,12,15,17,21) > length(x) [1] 5 > sum(sort(x)[3]) # the median the hard way [1] 15 > median(x) # easy way [1] 15 > quantile(x, c(0.25,0.5,0.75)) 25% 50% 75% 12 15 17 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 12 15 15 17 21 25 / 1

  27. Graphical summaries of the data categorical data and numerical data In general, a table of numbers is not very informative, whereas a picture or graphical representation can be quite informative : ◮ Categorical (or Qualitative) data: pie charts, bar charts and dotplots - easily grasp the distribution of the data quickly ◮ Numerical (or Quantitative) data: boxplot, histogram and density curve 26 / 1

  28. Graphical summaries of the data numerical (quantitative) data Numerical data are summarized by graphically with histogram, boxplot, and density curve. ◮ Histogram : constructed by binning the data and counting the number of observations in each bin ◮ Density plot : thought of as plots of smoothed histogram ◮ Boxplot : visualization of five number summary (shown above) 27 / 1

  29. Graphical summaries of the data numerical data - example: histogram and density plot in R A random sample of 50 milk containers is selected and their milk contents are weighed and is shown below: 1.958 1.951 2.107 2.092 1.955 2.162 2.168 2.134 1.971 2.072 2.049 2.017 2.117 1.977 2.034 2.062 2.110 1.974 1.992 2.018 2.135 2.107 2.084 2.169 2.085 2.018 1.977 2.116 1.988 2.066 2.126 2.167 1.969 2.198 2.078 2.119 2.088 2.172 2.133 2.112 2.066 2.128 2.142 2.042 2.050 2.102 2.000 2.188 1.960 2.128 28 / 1

  30. Graphical summaries of the data numerical data - example: histogram and density plot in R Histogram of milkdata Histogram of milkdata 15 8 6 10 Frequency Density 4 5 2 0 0 1.95 2.00 2.05 2.10 2.15 2.20 1.95 2.00 2.05 2.10 2.15 2.20 milkdata milkdata 29 / 1

Recommend


More recommend