Experimental Software Engineering – Correlation analysis – Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Abstract � Correlation analysis vs. experimentation � Relations between variables � Sample size problem � Correlation � Parametric coefficients � Non-parametric coefficients 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 1
Relations between variables � The ultimate goal of every research or scientific analysis is finding relations between variables � The philosophy of science teaches us that there is no other way of representing "meaning" except in terms of relations between some quantities or qualities � Either way involves relations between variables � Thus, the advancement of Science must always involve finding and evaluating new relations between variables � Isn’t that what correlation is about? � Why care about experimentation, then? 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Correlation analysis vs. experimentation � Correlation analysis � We do not influence any variables but only measure them and look for relations (correlations) between some set of variables � Those correlations are quantified as coefficients ∈ [0%, 100%] � Example : practitioners’ expertise and defects found � Experimentation � We manipulate some variables and then measure the effects of this manipulation on other variables � Example : a researcher increases design complexity and then records defects found, keeping all other variables constant � Beware of learning effect when subjects are humans 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 2
Correlation analysis vs. experimentation � Experimentation can conclusively demonstrate causal relations between variables � If we find that whenever we change variable A, then variable B changes, then we can conclude that "A influences B.“ � Correlation analysis cannot conclusively prove causality � We can find “high” correlation values between variables such as average literacy and expected lifetime, but there is no proven causality between them � Question : why then, can we observe that correlation when analyzing data from countries worldwide? 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Correlation analysis vs. experimentation If experimental data may potentially provide � qualitatively better information than correlational data, why care about correlation analysis at all? Correlation analysis only allows us to measure the � association between variables, not their interdependence. Formally speaking: (interdependence ⇒ association) 1. ¬ (association ⇒ interdependence) 2. 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 3
Why care about correlation then? � Correlation analysis can be useful for: � to reduce the size of that set of explanatory variables � Highly correlated ones may be measuring the same attribute � Performing a preliminary assessment of the feasibility of an hypothesis � A very low correlation (association) between a dependent and an independent variable may lead us to discard considering the hypothesis of a causality � Most statistical tools allow us to produce cross- correlation tables (symmetrical matrices with one by one correlation values among considered variables) � The main diagonal is obviously filled with 1’s (100% correlation) 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Association between variables: properties � Magnitude or size � This property pertains to the strength of the association � Several correlation coefficients (e.g. Pearson, Spearman) allow to quantify this magnitude � Signal of the association � Positive – when a variable increases, the other increases as well � Negative – when a variable increases, the other decreases � Significance, reliability or truthfulness � This property pertains to the representativeness of the result found in our specific model for the entire population � It says how probable it is that a similar relation would be found if the experiment was replicated with other samples from the same population 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 4
Magnitude vs. reliability of relations � Usually, the larger the magnitude of the relation between variables, the more reliable the relation � But magnitude and reliability are not totally independent ! � Assuming that there is no relation between the respective variables in the population (null magnitude), the most likely outcome would be also finding no relation between those variables in the research sample � Thus, the weaker the relation found in the sample (less magnitude), the less likely it is that there is no corresponding relation in the population � Depending on sample size, a relation of a given strength can be either highly significant or no significant at all 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Sample size problem � The smaller the sample size, the more likely it is that we will obtain erroneous results comparing to the population parameters � The error would be to assume the existence of a relation between two variables obtained from a population in which such a relation does not exist � Technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size � Conclusion: a smaller sample size implies a smaller reliability of associations 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 5
Wrap-up � If the true association (in the population) between variables is: � very small , then there is no way to identify such a association in a study, unless the research sample is correspondingly large � very large , then it can be found to be highly significant even in a study based on a very small sample Conclusion: the smaller the association between variables, the larger the sample size required to prove it significant 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Correlation � Correlation is the extent to which values of two variables are "proportional" to each other � Proportional means linearly related � Correlation is high if it can be approximated by a straight line � The line is sloped upwards or downwards, depending on the signal of the association � That regression line or least squares line is so-called because it is determined such that the sum of the squared distances of all the data points from the line is the minimum 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 6
Correlation coefficients � The magnitude of the correlation can be expressed by a correlation coefficient � Several coefficients are proposed in the literature � Some are parametric and others non-parametric � The correlation coefficient does not depend on the specific measurement units used � for example, the correlation between Size and Effort will be identical regardless of whether Function Points and Man.Years , or KLOC and Man.Months are used as measurement units 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Parametric correlation coefficients � The most widely-used type of correlation coefficient is Pearson r (Pearson, 1896) � It is also called linear or product-moment correlation � Assumptions � Each pair of variables is bivariate normal � The two variables are measured on at least interval scales � SPSS : Analyse / Correlate / Bivariate 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 7
Nonparametric correlation coefficients � These statistics do not require that variables are normally distributed � Chi-square � Assumptions : nominal scales � Spearman R, Kendall Tau, Gamma � Assumptions : at least ordinal scales (ranks) � For ordinal scales, if ranks are represented by literal enumerations you have to recode them into integers 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Spearman R correlation coefficient � Spearman R is similar to the Pearson coefficient, except that can be computed from ranks SPSS : Analyse / Correlate / Bivariate 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 8
Recommend
More recommend