acms 20340 statistics for life sciences
play

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory Data Analysis Recall that exploratory data analysis has two guiding principles. 1. First examine each variable by itself. Then study the relationships


  1. ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation

  2. Exploratory Data Analysis Recall that exploratory data analysis has two guiding principles. 1. First examine each variable by itself. Then study the relationships between the variables. 2. Represent the data with graphs. Then add numerical summaries of aspects of the data. Now we’ll start to look at the relationships between variables.

  3. Relationships Between Variables Examples: ◮ Lung capacity decreases with number of cigarettes smoked in a day. ◮ The DMV warns that alcohol consumption reduces reflex time, and the effect becomes larger as more alcohol is consumed.

  4. Relationships Between Variables Statistical relationships are overall tendencies. They are not ironclad rules. Two variables can have a statistical relationship, even if some exceptions exist in the data. To compare two variables, always measure them on the same individuals. Examples: ◮ Smoking influences lung capacity. ◮ Blood alcohol content explains variations in reflex time. In a statistical relationship, one variable explains or influences the other.

  5. Explanatory and Response Variables A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a response variable. Sometimes referred to as dependent and independent variables . ◮ A response variable “depends on” an explanatory variable Studies often try to show that changes in a variable cause the changes in another. Many statistical relationships do not involve direct causation.

  6. Explanatory and Response Variables How to identify each type? Case 1: Values of one variable are set to see how it affects another. Case 2: Two variables are observed. This situation may or may not have explanatory/response variables. It depends on how the data is used.

  7. Analyzing Statistical Relationships Analyzing two-variable data expands on what we know: ◮ Plot the data. ◮ Look for overall patterns and any deviations from that pattern. ◮ Then obtain numerical summaries based on the data.

  8. Scatterplots A scatterplot is a common and useful graph to show the relationship between two quantitative variables. Values of one variable (explanatory, if applicable ) on the horizontal axis and the other variable (response) on the vertical axis. Each individual in the data is the point in the plot corresponding to the values of the two variables.

  9. Interpreting Scatterplots When you make a graph, ask yourself “What do I see” ◮ Deja Vu? ◮ Look for the overall pattern . ◮ Describe direction , form , and strength of the relationship. ◮ Check for any striking deviations , such as outliers .

  10. Interpreting Scatterplots “Two variables are positively associated when above-average values of one tend to accompany above-average values of the other, and below- average values also tend to occur together.” ◮ What? ◮ Think “upward trend”. Two variables are negatively associated when larger values of one variable tend to accompany smaller values of the other.

  11. Example Let’s look at the influence of the number of powerboats registered on manatee deaths from collisions with powerboats.

  12. Powerboats and Manatees Does the number of powerboats help explain yearly manatee deaths? What are the explanatory and response variables (if any)? Let’s take a look at the data.

  13. Scatterplots ◮ Scatterplots show the relationship between two quantitative variables. ◮ They are such a fundamental tool that many variations have been developed. ◮ One variation displays a third categorical variable by varying the dot style.

  14. Iris Data

  15. Iris Data The Iris Data from before. For three species of irises the petal and sepal lengths and widths were measured. Species P–Width P–Length S–Width S–Length Setosa 0.2 1.4 3.5 5.1 Setosa 0.2 1.4 3 4.9 Versicolor 1.3 4.1 2.8 5.7 Virginica 2.5 6 3.3 6.3 Virginica 1.9 5.1 2.7 5.8 . . .

  16. Petal Width by Sepal Width

  17. Petal Width by Sepal Width, with Species

  18. Petal Width by Sepal Width, with Species

  19. Running Speed vs. Energy expenditure This plot is easier to understand by indicating the different inclines.

  20. Linear Relationships Left: Vehicle horsepower vs. weight (100 lbs) Right: Powerboat registrations (thousands) vs. manatee deaths

  21. Linear Relationships ◮ While our eyes find it easy to see strong linear relationships, weak relationships are more difficult to see. ◮ The correlation between a pair of variables is a number measuring the strength of the linear relationship between them. ◮ It is denoted by the symbol r .

  22. Calculating Correlation The data is x 1 , x 2 , . . . , x n for one variable and y 1 , y 2 , . . . , y n for the other. The data is paired by individuals, so x 1 , y 1 are observations from the same individual. ¯ x , s x are mean and standard deviation of x data. ¯ y , s y are mean and standard deviation of y data � x i − ¯ � � y i − ¯ � 1 x y � r = n − 1 s x s y i

  23. Deconstructing the Correlation Formula � x i − ¯ � � y i − ¯ � 1 x y � r = n − 1 s x s y i � �� � � �� � Normalize x Normalize y We calculate distance of each value from the mean, and then divide by the standard deviation. This has the effect of rescailing the observations to be in terms of standard deviations from the mean. Standardizing turns r into a unitless measurement.

  24. Correlation is symmetric r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = − 0 . 78.

  25. Correlation is symmetric r treats both explanatory and response variables symmetrically. Change in non-exercise activity (Calories) and Fat gain (kg) Strong negative association. r = − 0 . 78.

  26. A Small Example x y 2.0 4.6 1.7 4.4 2.3 4.5 x = 2 , ¯ ¯ y = 4 . 5 , s x = 0 . 3 , s y = 0 . 1 First calculate the mean and s.d. of x and y .

  27. A Small Example x y x − ¯ x y − ¯ y 2.0 4.6 0 0.1 1.7 4.4 -0.3 -0.1 2.3 4.5 0.3 0 x = 2 , ¯ ¯ y = 4 . 5 , s x = 0 . 3 , s y = 0 . 1 Find distance from mean for both x and y .

  28. A Small Example x y ( x − ¯ x ) / s x ( y − ¯ y ) / s y 2.0 4.6 0 1 1.7 4.4 -1 -1 2.3 4.5 1 0 x = 2 , ¯ ¯ y = 4 . 5 , s x = 0 . 3 , s y = 0 . 1 Normalize by dividing by the corresponding s.d.

  29. A Small Example x y ( x − ¯ x ) / s x ( y − ¯ y ) / s y product 2.0 4.6 0 1 0 1.7 4.4 -1 -1 1 2.3 4.5 1 0 0 x = 2 , ¯ ¯ y = 4 . 5 , s x = 0 . 3 , s y = 0 . 1 Find product of normalized x and y . 1 Sum of products is 1 so r = 3 − 1 · 1 = 0 . 5

  30. Properties of Correlation ◮ r is always between − 1 and 1. ◮ If r is close to 0 then there is no linear relationship between the variables. ◮ If r > 0 then it indicates a positive relationship, with the relationship being stronger the closer r is to 1. ◮ If r < 0 then it indicates a negative relationship, with the relationship being stronger the closer r is to − 1. ◮ Correlation is not a resistant measure. Just as with the mean and standard deviation, outliers will affect the value of r .

  31. Correlation varies from − 1 to +1

  32. Manatee Deaths r = 0 . 95

  33. Horsepower vs. MPG r = − 0 . 79

  34. Weight vs. MPG r = − 0 . 9

  35. Cabin Volume vs. MPG r = − 0 . 37

  36. Iris Species r = − 0 . 36

  37. Linear Relationship? r = 0 . 18

  38. Linear Relationship? r = − 0 . 043

  39. r is not resistant Some variables with a very strong linear relationship. r = − 0 . 99

  40. r is not resistant Changing an extreme value keeps the same linear relationship but now r = − 0 . 78

Recommend


More recommend