welcome to this second in the series of lectures on
play

Welcome to this second in the series of lectures on Managerial - PDF document

Welcome to this second in the series of lectures on Managerial statistics. In this lecture we will look at the first topic which is Classification, Summarization and Presentation of Data In the previous lecture we looked at the Overview of


  1. Welcome to this second in the series of lectures on Managerial statistics. In this lecture we will look at the first topic which is Classification, Summarization and Presentation of Data

  2. In the previous lecture we looked at the Overview of Statistical Analysis. In this lecture we will look at Classification, Summarization and processing of data followed by the presentation and communication of the data. The techniques for gathering data will be covered later as it is closely linked to research methodology and design. We will assume that data has been collected to address a research question and learn the basics of data classification, summarization and presentation.

  3. Lets assume that you are trying to establish a central fuel storage facility for a given geographical area because of the difficulties in sustaining many small ones. We would need to determine the optimal size for this large central fuel storage facility. Let us assume that you have collected daily off take of fuel in KL from the area for the past 5 financial years. Let us see how this data can be used to determine the optimal size for the central storage facility that needs to be built.

  4. First Look at the data. The top row has days numbered from 1 to 31. These are the days of the month. The first column has the year, starting from the current year at the bottom, annotated as CY and working its way up till CY -5 and CY -5. CY-5 is not visible only because of the scroll state of the data sheet otherwise be assured it is there. Questions that would naturally occur to you are what is the maximum offtake, what does the average look like, How many days have there been heavy offtakes? Etc. The first thing to do is to use excel to answer these questions.

  5. The maximum and the average are fairly straightforward to obtain using the functions max and average already built into excel. Just provide the complete data range and you should straightaway get the maximum and average values of this data set. We still have a question pending How many days have there been heavy offtakes? This question is answered by a frequency table and we will see how to construct it now.

  6. A frequency table achieves condensation of data by classification. By classification of data, for example in the fuel off take case, we mean calculating the number of days the fuel off take was less than 100 KL, between 100-200 KL and so on. Each such slot is called a class and the processes of condensing the data into these classes is called data classification. (P) How many classes should we have? It turns out that the ideal number of classes' that a data set should be divided into, depends on the number of data points in the data set and is give by the expression k = 1+3.3*log(n) where k is the number of classes and n is the number of data points. This is a bit counter intuitive since one would expect it to depend on the range of the data. But trust me there is formal proof for this although it is beyond the scope of this course. (P) Generally the number of classes for typical data sets range between 5 and 15 classes

  7. Now lets look at some simple terms related to Classes. (P) Lets Assume this straight line is the total range of the data. (P) The minimum and maximum are as indicated. (P) Let us assume that this data has been equally divided into 5 classes using the Strugess approximation. These classes are now shown divided using the triangular markers. (P) The upper and lower class limits for Class 2 is now shown on the figure. (P) The upper and lower class limits of all the five classes are now annotated. Notice other than the maximum and the minimum value each limit has a definition for both the upper and the lower class. For example the upper class limit of class 2 is also the lower class limit of class 3 and so on. (P) The class interval which is the difference between two class limits is now annotated. (P) The mid class mark or MCM is the mid point of the class interval and is obtained as the average of the upper and lower class limits. The MCM is now annotated as a mark in the center of each class. This mark in data is used for plotting purposes which we will see shortly.

  8. Earlier we showed how to determine the maximum and average of the entire data set using the max and average built in functions in excel. In this slide note that the range $c$5:$AG $79 shown in the formulae is the entire range of the data. This could vary in your worksheet if you had inserted or altered the format and position of your worksheet. We know that in order to calculate the class interval we need to determine the range of the data and divide it with the ideal number of classes that this data set should be divided into. (P) So first we will determine the number of data points (n) which is required to calculate the ideal number of classes using the count function. This data set has 2282 data points. (P) Next we will calculate the ideal number of classes using the strugess approximation 1+ 3.3 * Log (n) where n is the number of data points that we determined to be 2282. The result is 12 classes. (P) We next determine the range of the data by obtaining the difference between the maximum value in the data set and the minimum value in the data set which turns out to be 998. (P) Dividing this range by the ideal number of classes we get the class interval as 83. Since this is a sort of odd figure we could select 100 KL as our class interval. We are now ready to look at the anatomy of a frequency distribution table.

  9. Now we can look at a frequency distribution table for classifying data. The table typically has seven columns. (P) The first two are the Lower Class Limits and the Upper Class limits of each class. (P) The next column is the Mid Class Mark for each class. These definitions we have already seen previously. (P) The next column is frequency of occurrence in each class. This is fairly self explanatory although the mechanics of extracting the frequency of occurrence in each class needs some explanation. The next three columns i.e. Relative Frequency, Cumulative Frequency and Cumulative Relative frequencies are all different ways of presenting the frequency data and are used for plotting purposes. (P) The relative frequency is the percentage of the frequency in the current class to the total of all the frequencies in the data set. The cumulative frequency is the addition of all relative frequencies including and below a certain class. The cumulative relative frequency is the proportion of the relative frequency as a percentage of the relative frequency in the current class to the total of all the relative frequencies in the data set.

  10. Let us no build the frequency distribution table step by step and if you pay attention you will learn the trick straight away. (P) As we have seen earlier the first two columns of the frequency distribution table are the lower and upper class limits of each class. Remember we had chosen a class interval of 100 Kilo Liters. (P) The next comes the mid class mark which is essentially the average of the lower and upper class limits. (P) The next comes the column which has the most action, i.e. the frequency column which is the basic purpose of this table. Fortunately for us excel has a built in function called frequency. Simple select the complete column and start typing the frequency formula. It will automatically start typing in the top of the selected column. The frequency function, as you would expect requires two basic parameters which are the complete range of the data on the work sheet and the upper class limits. Feed these two ranges and hold the control and shift keys down while you press enter. This action will fill in the frequencies in the complete column. Viola you have already determined the frequencies. I will repeat this just in case it was a bit quick. Simple select the complete column and start typing the frequency formula. It will automatically start typing in the top of the selected column. The frequency function, as you would expect requires two basic parameters which are the complete range of the data on the work sheet and the upper class limits. Feed these two ranges and hold the control and shift keys down while you press enter. This action will fill in the frequencies in the complete column. (P) The next action is to simply total the frequencies to check that the total is same as the count figure that we got earlier. You can see now that they are the same. (P) The next action is to get the relative frequency in each class which is essentially the frequency of the class as a percentage of the total frequencies. (P) The cumulative frequency is then obtained by simply adding up all frequencies below the UCL of the given class. (P) The cumulative relative frequency is obtained by simply adding up all the relative frequencies below the UCL of a given class. We now have the completed the table and are ready to do some data presentation.

  11. There are three basic graphs which are used to present the data summarized in a frequency distribution table. They are histograms, Frequency Polygons and Ogives. We will now look at them one by one.

  12. A histogram is a typical bar graph (with no spacing between the bars) showing the classes on the x-axis and the occurrences on the y axis.

Recommend


More recommend