Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1
data_dir <- "https://web.stanford.edu/class/somgen223/data/" gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv")) control : treatment) Setup data gene_tall <- gather (gene_exp1, condition, expression_level, somgen223.stanford.edu 2
ggplot2 : a package for the grammar of graphics • The grammar of graphics is the idea that graphs are composed of known elements in specific ways, in the same way that types of words are assembled through the rules of English syntax to form sentences. • ggplot2 is a package for doing this in R. • There are other plotting packages in R, but they do not follow this conceptual model. somgen223.stanford.edu 3
ggplot : the main function • ggplot (in the package ggplot2 ) is the main function for constructing a graph. • Nearly every aspect of the graph can be changed. Usually, the defaults are pretty good. • You combine the graph produced by ggplot with plot specifications to add to or modify the graph. somgen223.stanford.edu 4
ggplot (data = BOD, mapping = aes (x = Time, y = demand)) + geom_point () ggplot (BOD, aes (Time, demand)) + geom_point () How to call ggplot • ggplot is the main plotting function. • data = BOD : this tells which data frame contains the data to be plotted • mapping = aes(x = Time, y = demand) : use the data in the Time column on x-axis, demand column on y-axis • geom_point() : plot the data as points • Note that you can use positional instead of named arguments to make this expression shorter: • The use of “+” to glue these operations together will be explained later. somgen223.stanford.edu 5
BOD %>% ggplot ( aes (Time, demand)) + geom_point () Use geom_point for scatterplot 20.0 17.5 15.0 demand 12.5 10.0 2 4 6 Time • aes specifies the aesthetic mapping from the data (columns) to some aspect of the graph (x, y position). • There are 6 rows in the BOD data frame. • There are 6 points in the graph, one for each row. • This alignment of rows and points applies through much of ggplot2 . somgen223.stanford.edu 6
BOD %>% ggplot ( aes (Time, demand)) + geom_line () Example: geom_line 20.0 17.5 15.0 demand 12.5 10.0 2 4 6 Time somgen223.stanford.edu 7
BOD %>% ggplot ( aes (Time, demand)) + geom_point () + geom_line () Example: points and lines 20.0 17.5 15.0 demand 12.5 10.0 2 4 6 Time somgen223.stanford.edu 8
BOD %>% ggplot ( aes (Time, demand)) + geom_point (size = 5) Giving arguments to plot specifications: change the size 20.0 17.5 15.0 demand 12.5 10.0 2 4 6 Time somgen223.stanford.edu 9
BOD %>% ggplot ( aes (Time, demand)) + geom_point (color = "red") Giving arguments to plot specifications: change the color 20.0 17.5 15.0 demand 12.5 10.0 2 4 6 Time somgen223.stanford.edu 10
Exercise: orange trees Using the data in Orange : 1. Pull out the data for tree 2 only 2. Plot circumference versus age for those data somgen223.stanford.edu 11
Orange %>% filter (Tree == 2) %>% ggplot ( aes (age, circumference)) + geom_point () Answer: orange trees 200 150 circumference 100 50 400 800 1200 1600 age somgen223.stanford.edu 12
ggplot and + operator • ggplot(...) + geom_point() is a strange expression: it uses the + operator to add things (plots and plot specifications), which are not numbers. • This uses a feature called generic functions: the types of the arguments to + determine which piece of code, called a method, to run. • ggplot2 relies on this feature heavily. somgen223.stanford.edu 13
plot1 <- ggplot (BOD, aes (Time, demand)) spec1 <- geom_point () plot1 + spec1 plot1 + spec2 Using + in ggplot2 spec2 <- geom_line (color = "blue") • Note you can save parts of the graph specification and then add them together. somgen223.stanford.edu 14
BOD %>% ggplot ( aes (Time, demand)) + geom_point () + geom_smooth (method = "lm") Adding a smoother 20 demand 10 2 4 6 Time • lm means linear model (best fit, least-square regression) somgen223.stanford.edu 15
## fixed size, default BOD %>% ggplot ( aes (Time, demand)) + geom_point () ## fixed size, given as size argument BOD %>% ggplot ( aes (Time, demand)) + geom_point (size = 5) ## size of each point depends on value of Time column for that point BOD %>% ggplot ( aes (Time, demand)) + geom_point ( aes (size = Time)) ## THIS CAUSES AN ERROR! BOD %>% ggplot ( aes (Time, demand)) + geom_point (size = Time) Making the plot specification depend on the data somgen223.stanford.edu 16
0 12 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 gene_exp1 1 13 10 3 3 GKK7 Combining numbers and text in a graph 2 DEF234 somgen223.stanford.edu 17
gene_exp1 %>% ggplot ( aes (control, treatment)) + geom_point () Plotting treatment vs control 10 treatment 5 0.0 2.5 5.0 7.5 10.0 12.5 control somgen223.stanford.edu 18
ggplot ( aes (control, treatment)) + geom_point () + geom_text ( aes (label = gene)) gene_exp1 %>% Plotting treatment vs control with gene labels GKK7 10 treatment 5 DEF234 ABC123 0.0 2.5 5.0 7.5 10.0 12.5 control somgen223.stanford.edu 19
ggplot ( aes (control, treatment)) + geom_point () + geom_text ( aes (label = gene), hjust = "left", vjust = "bottom") gene_exp1 %>% Control placement of text GKK7 10 treatment 5 DEF234 ABC123 0.0 2.5 5.0 7.5 10.0 12.5 control somgen223.stanford.edu 20
1 13 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 0 gene_exp1 2 DEF234 10 3 3 GKK7 12 Grouping • Let’s graph the control and treatment values separately for each gene. • We’ll need the data in tall format. somgen223.stanford.edu 21
gene_tall %>% ggplot ( aes (gene, expression_level)) + geom_point () Grouping in a graph 10 expression_level 5 0 ABC123 DEF234 GKK7 gene • It would be nice of the data for each condition were grouped together (color? line?). somgen223.stanford.edu 22
ggplot ( aes (gene, expression_level)) + geom_point ( aes (color = condition)) gene_tall %>% Use the mapping to assign color to the grouping variable 10 expression_level condition control treatment 5 0 ABC123 DEF234 GKK7 gene somgen223.stanford.edu 23
ggplot ( aes (gene, expression_level)) + geom_line ( aes (group = condition, color = condition)) gene_tall %>% Use group to form groups for geom_line 10 expression_level condition control treatment 5 0 ABC123 DEF234 GKK7 gene somgen223.stanford.edu 24
When to use group ? • You need to include group when the number of graphical objects is not the same as the number of observations to graph. • With geom_line , there are n endpoints, but only n-1 lines between them. somgen223.stanford.edu 25
Facets • Most explorations of data involve making comparison to highlight an important difference between subsets. • One way to do this visually is to put the data for each condition in a separate graph, called a “facet”. • ggplot can do this, making sure that the facet axes are nicely lined up. somgen223.stanford.edu 26
ggplot ( aes (gene, expression_level)) + geom_point () + facet_wrap ( vars (condition)) gene_tall %>% Facet example control treatment 10 expression_level 5 0 ABC123 DEF234 GKK7 ABC123 DEF234 GKK7 gene somgen223.stanford.edu 27
ggplot (Orange, aes (age, circumference)) + geom_point () Exercise: Orange trees 200 150 circumference 100 50 400 800 1200 1600 age • It would be better if we visually distinguish each tree’s data. • What is the visual equivalent of group_by ? somgen223.stanford.edu 28
ggplot (Orange, aes (age, circumference)) + geom_point () + facet_wrap ( vars (Tree)) Answer: Orange trees, using facets 3 1 5 200 150 100 circumference 50 400 800 1200 1600 2 4 200 150 100 50 400 800 1200 1600 400 800 1200 1600 age somgen223.stanford.edu 29
ggplot (Orange, aes (age, circumference)) + geom_point ( aes (color = Tree)) + geom_line ( aes (color = Tree, group = Tree)) Answer: Orange trees, using grouping 200 Tree 150 circumference 3 1 5 100 2 4 50 400 800 1200 1600 age somgen223.stanford.edu 30
ggplot (Orange, aes (age, circumference)) + geom_point ( aes (color = Tree)) + geom_line ( aes (color = Tree, group = Tree)) + labs (x = "Age (days)", y = "Circumference (mm)", title = "Circumference vs. age for orange trees", subtitle = "Data from built-in data frame Orange") Labeling the graph Circumference vs. age for orange trees Data from built-in data frame Orange 200 Tree Circumference (mm) 150 3 1 5 100 2 4 50 400 800 1200 1600 Age (days) somgen223.stanford.edu 31
geom_point ( aes (color = Tree)) + ggplot (Orange, aes (age, circumference, group = Tree)) + geom_line ( aes (color = Tree, group = Tree)) + expand_limits (x = 0, y = 0) Including the origin (0, 0) 200 Tree 150 circumference 3 1 100 5 2 4 50 0 0 500 1000 1500 age somgen223.stanford.edu 32
Recommend
More recommend