data visualization in r
play

Data Visualization in R May 15, 2017 Data Visualization in R May - PowerPoint PPT Presentation

Data Visualization in R May 15, 2017 Data Visualization in R May 15, 2017 1 / 40 Jumping In Lets get started right away by the loading ggplot2 package and reading in our dataset. ### Install packages if you don't have them yet ### Typical


  1. Data Visualization in R May 15, 2017 Data Visualization in R May 15, 2017 1 / 40

  2. Jumping In Let’s get started right away by the loading ggplot2 package and reading in our dataset. ### Install packages if you don't have them yet ### Typical install: # install.packages('gpplot2') # install.packages('dplyr') ### Install personal copy (no admin rights) # install.packages('gpplot2',lib="/path/to/myfolder") # install.packages('dplyr',lib="/path/to/myfolder") ### Load packages library(ggplot2) library(dplyr) # Load personal copy # library(ggplot2,lib.loc="/path/to/myfolder") # library(dplyr,lib.loc="/path/to/myfolder") ### Read In data auto.data <- read.csv("AutoData.csv", header = TRUE) # tbl_df() isn't necessary here # It helps to display the data more clearly auto.data <- tbl_df(auto.data) Data Visualization in R May 15, 2017 2 / 40

  3. Glimpse at the data Run the following to get a quick glimpse of the data # Find the dimensions dim(auto.data) # Look at the structure str(auto.data) # Examine the top head(auto.data) # Find out about a function ?str Data Visualization in R May 15, 2017 3 / 40

  4. Data Exploration When looking at a new data set, exploration is key. What types of variables do we have? What types of relationships do you expect to see between variables? Does your intuition check out? If not, why not? Do we observe anomalous behavior? Data Visualization in R May 15, 2017 4 / 40

  5. Scatter Plots One of the simpler plots we can make is a scatter plot between to continuous variables. Try it out: # qplot is convenient front end for the more powerful, # but slightly more complicated ggplot() function. qplot(curb.weight,price,data=auto.data) Data Visualization in R May 15, 2017 5 / 40

  6. 40000 30000 price 20000 10000 1500 2000 2500 3000 3500 4000 curb.weight Data Visualization in R May 15, 2017 6 / 40

  7. qplot is an easy to use front end to the main ggplot function. It has several reasonable defaults for plotting data As we just just saw, two continuous inputs gets us back a scatter plot. Data Visualization in R May 15, 2017 7 / 40

  8. The true power of ggplot comes from its ability to easily visualize relationships between many variables. The main ingredients we’ll be using are: 1 aesthetics 2 facets 3 geoms Data Visualization in R May 15, 2017 8 / 40

  9. Aesthetics Aesthetics control many of the plot’s visual properties Importantly these visual properties may be mapped directly to variables Data Visualization in R May 15, 2017 9 / 40

  10. Aesthetics Example Try the following # map color to factor/categorical variable qplot(curb.weight, price, data=auto.data, color=num.of.cylinders) # map color to continuous variable qplot(curb.weight, price, data=auto.data, color=bore) Data Visualization in R May 15, 2017 10 / 40

  11. 40000 40000 num.of.cylinders 30000 30000 bore eight five 3.5 price price four six 3.0 three twelve 20000 20000 10000 10000 1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000 curb.weight curb.weight Data Visualization in R May 15, 2017 11 / 40

  12. There are many other aesthetics besides color . Some we’ll encounter are: 1 color 2 size 3 shape 4 fill Not all aesthetics work with both categorical and continuous variables (like color did) Also only a certain subset of aesthetics will be available for each plot type ( geom ) Data Visualization in R May 15, 2017 12 / 40

  13. Try It Out See how the following aesthetics behave with the scatter plot. Feel free to change the variables in the scatter plot qplot(curb.weight, price, data=auto.data, size=horsepower) qplot(curb.weight, price, data=auto.data, shape=drive.wheels) Data Visualization in R May 15, 2017 13 / 40

  14. Facets Facets represent another way of visualizing the effect of factor/categorical variables Facets enable us to get a separate plot for each level/category Data Visualization in R May 15, 2017 14 / 40

  15. Facet Example Let’s try out a faceting example qplot(curb.weight, price, data=auto.data) + facet_wrap(~aspiration) Data Visualization in R May 15, 2017 15 / 40

  16. std turbo 40000 30000 price 20000 10000 1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000 curb.weight Data Visualization in R May 15, 2017 16 / 40

  17. Note facet_wrap gives a separate plot for each category Also note how we incorporated the behavior of facet_wrap : via the + operator This is one of the main strengths of ggplot : plots are built up in intuitive layers Data Visualization in R May 15, 2017 17 / 40

  18. Also available is facet_grid for examining the interaction between two categorical variables: qplot(curb.weight, price, data=auto.data) + facet_grid(drive.wheels~num.of.doors) Data Visualization in R May 15, 2017 18 / 40

  19. four two 40000 30000 4wd 20000 10000 40000 30000 price fwd 20000 10000 40000 30000 rwd 20000 10000 1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000 curb.weight Data Visualization in R May 15, 2017 19 / 40

  20. Try out the following: qplot(curb.weight, price, data=auto.data) + facet_grid(.~drive.wheels) qplot(curb.weight, price, data=auto.data) + facet_grid(drive.wheels~.) qplot(curb.weight, price, data=auto.data, color=num.of.doors) + facet_grid(drive.wheels~.) Data Visualization in R May 15, 2017 20 / 40

  21. geoms The final way we’ll look at to control ggplot s is via geoms The geom controls the type of plot which is displayed. We’ve already looked at one: geom_point . We could rewrite our scatter plot code more explicitly as: qplot(curb.weight,price,data=auto.data,geom='point') Data Visualization in R May 15, 2017 21 / 40

  22. Let’s check out another geom: geom_histogram # geom_histogram operates with a single continuous variable. # Let's look at price qplot(price, data=auto.data, geom='histogram') # or via qplot's defaults qplot(price,data=auto.data) Data Visualization in R May 15, 2017 22 / 40

  23. 30 20 count 10 0 10000 20000 30000 40000 price Data Visualization in R May 15, 2017 23 / 40

  24. Note the warning concerning binwidth The binwidth chosen can dramatically impact how we visually interpret the distribution It’s best to experiment with values to get a feel for the data We can alter the binwidth by passing the option to qplot qplot(price, data=auto.data, geom='histogram', binwidth=20000) Data Visualization in R May 15, 2017 24 / 40

  25. 75 count 50 25 0 0 20000 40000 price This tells a very different story than the original! Data Visualization in R May 15, 2017 25 / 40

  26. Note our price distribution is a bit skewed Perhaps we are not interested in higher priced ( ≥ 20 , 000 say) cars We can limit our plot cars with lower price by setting limits qplot(price, data=auto.data, geom='histogram', binwidth=450) + xlim(4000,20000) Data Visualization in R May 15, 2017 26 / 40

  27. 10 count 5 0 5000 10000 15000 20000 price Data Visualization in R May 15, 2017 27 / 40

  28. Just like our point geom, histogram too has aesthetics. Try the following qplot(price, data=auto.data, color=drive.wheels) qplot(price, data=auto.data, fill=drive.wheels) Data Visualization in R May 15, 2017 28 / 40

  29. 30 30 20 20 drive.wheels drive.wheels 4wd 4wd count count fwd fwd rwd rwd 10 10 0 0 10000 20000 30000 40000 10000 20000 30000 40000 price price Which one do like the best? Do you like either? How might we make it better? Data Visualization in R May 15, 2017 29 / 40

  30. The colors help but the figure is a bit busy. We can try faceting instead: qplot(price, data=auto.data) + facet_wrap(~drive.wheels) Data Visualization in R May 15, 2017 30 / 40

  31. 4wd fwd rwd 30 20 count 10 0 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 price Data Visualization in R May 15, 2017 31 / 40

  32. This helps us separate out the categorical variables much easier. Note the counts vary quite a bit among the different classes, but yet the count axis is the same for all. We can change this by modifying the facet_wrap call: qplot(price, data=auto.data) + facet_wrap(~drive.wheels, scales = 'free_y') Data Visualization in R May 15, 2017 32 / 40

  33. 4wd fwd rwd 3 15 30 2 10 20 count 1 5 10 0 0 0 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 price Data Visualization in R May 15, 2017 33 / 40

  34. 4wd 3 2 1 0 fwd 30 20 count 10 0 rwd 15 10 5 0 10000 20000 30000 40000 price See ?facet_wrap for more useful options. For example nrow=3 in the above. Data Visualization in R May 15, 2017 34 / 40

  35. Try it out Take some time to explore the data using various geoms , aesthetics, and facets. How are the other variable related to price? Are some of the relationships stronger than others? Data Visualization in R May 15, 2017 35 / 40

  36. More geoms There are many other geom s besides point and histogram . Try ??geom to see a list. Different geoms operate with different (combinations of) data types (i.e. categorical or continuous). As is characteristic of ggplot , geoms can be layered to create plots of increasing detail/complexity. Data Visualization in R May 15, 2017 36 / 40

Recommend


More recommend