bivariate data
play

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of - PowerPoint PPT Presentation

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 1 / 36 Table of Contents Bivariate Data 1 Scatterplots 2 Correlation 3


  1. Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 1 / 36

  2. Table of Contents Bivariate Data 1 Scatterplots 2 Correlation 3 Two–Way Tables 4 Chapter #2 R Assignment 5 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 2 / 36

  3. Bivariate Data Bivariate Data Bivariate Data Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 3 / 36

  4. Bivariate Data Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70 , 178) , (72 , 192) , (74 , 184) , (68 , 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent ) variable measures the outcome of a study. The explanatory (or independent ) variable is the one that predicts the response variable. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 4 / 36

  5. Scatterplots Scatterplots Scatterplots Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 5 / 36

  6. Scatterplots Bivariate data  For each individual studied, we record Student Number Blood Alcohol ID of Beers Content data on two variables. 1 5 0.1 2 2 0.03 3 9 0.19  We then examine whether there is a 6 7 0.095 relationship between these two 7 3 0.07 variables: Do changes in one variable 9 3 0.02 tend to be associated with specific 11 4 0.07 changes in the other variables? 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 Here we have two quantitative variables 12 6 0.1 recorded for each of 16 students: 14 7 0.09 1. how many beers they drank 15 1 0.01 2. their resulting blood alcohol content (BAC) 16 4 0.05 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 6 / 36

  7. Scatterplots Scatterplots A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 7 / 36

  8. Scatterplots > plot(trees$Girth~trees$Height,main="girth vs height") girth vs height ● 20 18 ● ● ● ● ● 16 ● trees$Girth ● ● 14 ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● 8 65 70 75 80 85 trees$Height Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 8 / 36

  9. Scatterplots How to scale a scatterplot Same data in all four plots Both variables should be given a similar amount of space:  Plot is roughly square  Points should occupy all the plot space (no blank space) Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 9 / 36

  10. Scatterplots Interpreting scatterplots  After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for …  Form : linear, curved, clusters, no pattern  Direction : positive, negative, no direction  Strength : how closely the points fit the “form”  … and clear deviations from that pattern  Outliers of the relationship Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 10 / 36

  11. Scatterplots Form Linear No relationship Nonlinear Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 11 / 36

  12. Scatterplots Direction Positive association : High values of one variable tend to occur together with high values of the other variable. Negative association : High values of one variable tend to occur together with low values of the other variable. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 12 / 36

  13. Scatterplots Strength The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 13 / 36

  14. Scatterplots Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 14 / 36

  15. Scatterplots Adding categorical variables to scatterplots Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. The graph compares the association between thorax length and longevity of male fruit flies that are allowed to reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 15 / 36

  16. Correlation Correlation Correlation Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 16 / 36

  17. Correlation Definition Given the bivariate data, ( x 1 , y 1 ) , · · · , ( x n , y n ), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is n 1 � x j − ¯ x � � y j − ¯ y � r def � = . n − 1 s x s y j =1 The population correlation coefficient is denoted as N � x j − µ X � � y j − µ Y � = 1 ρ def � N σ X σ Y j =1 where the above sum is summed over the entire population of size N . One thinks of r as an estimator of ρ . Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 17 / 36

  18. Correlation One can also use the formula n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) r = �� � 2 � � � 2 � �� n �� n n � n n � n j =1 x 2 j =1 y 2 j − j − j =1 x j j =1 y j R command: > cor(trees$Girth,trees$Height) [1] 0.5192801 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 18 / 36

  19. Correlation One can also use the formula n ( � n j =1 x j y j ) − ( � n j =1 x j )( � n j =1 y j ) r = �� � 2 � � � 2 � �� n �� n n � n n � n j =1 x 2 j =1 y 2 j − j − j =1 x j j =1 y j R command: > cor(trees$Girth,trees$Height) [1] 0.5192801 Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 18 / 36

  20. Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

  21. Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

  22. Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

  23. Correlation The correlation coefficient measures the strength of any linear relationship between X and Y . Properties of Correlation: cor ( X , Y ) = cor ( Y , X ). − 1 ≤ r ≤ 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer | r | is to one, the stronger the linear relationship between the two variables. if | r | = 1 (ie, r = 1 or − 1), all the data points lie on a straight line. Marc Mehlman Marc Mehlman Marc Mehlman (University of New Haven) Bivariate Data 19 / 36

Recommend


More recommend