working with pipes
play

Working with pipes Computational Pipelines R.W. Oldford Pipes - PowerPoint PPT Presentation

Working with pipes Computational Pipelines R.W. Oldford Pipes French surrealist painter Rene Magrittes 1929 painting The Treachery of Images The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No,


  1. magrittr - compound assignment Or, perhaps most perversely, could use a different pipe connector, the so-called compound assignment pipe-operator %<>% could be used. N.B. this will have the side-effect that the original data will be changed . For illustration, first copy mtcars new_mtcars <- mtcars # make a copy head (new_mtcars, 2) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Now use the compound assignment %<>% on new_mtcars new_mtcars %<>% # Use compound assignment subset (am == 0) %>% aggregate (. ~ cyl, data = ., FUN = . %>% mean %>% round (2)) %>% transform (kpl = mpg %>% multiply_by (0.4251)) # what has happened to new_mtcars? head (new_mtcars, 2) ## cyl mpg disp hp drat wt qsec vs am gear carb kpl ## 1 4 22.90 135.87 84.67 3.77 2.94 20.97 1 0 3.67 1.67 9.734790 ## 2 6 19.12 204.55 115.25 3.42 3.39 19.21 1 0 3.50 2.50 8.127912 It’s as if the data got passed through the pipe and bounced back at the end of the flow!

  2. magrittr - simplified program control with pipes %>% works with any function provided it accepts the output from the pipe as its first argument. For example, we could also create a plot using pipes mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% with ( plot (x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )) 22 litres per 100 km. 18 14 10 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight Note data.frame() here appended the column lp100k to its input data. (e.g. try data.frame(mtcars, mtcars) )

  3. magrittr - pipes and with() Note the use of with() to move things along through the pipe. For example, will the following work? mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% with ( plot (x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )) %>% with ( lines (x = range (wt), y = range (lp100k), col = "steelblue", lwd = 2)) The penultimate piece of pipe passed NULL as the output of the with(plot(...)) on to the final with(lines(...)) . Does NOT work because of the NULL and because lines() cannot determine the values of wt and of lpk100 .

  4. magrittr - pipes and with() Note the use of with() to move things along throught the pipe. For example, will the following work? mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% with ( plot (x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )) %>% axis (side = 3) The penultimate piece of pipe passed on NULL as the output of the with(plot) Works because axis() will happily accept (and ignore) NULL because the argument side was named and specified (otherwise it would have failed).

  5. magrittr - pipes and with() Note the use of with() to move things along throught the pipe. For example, will the following work? mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% with ({ plot (x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines (x = range (wt), y = range (lp100k), col = "steelblue", lwd = 2) }) %>% axis (side = 3) The last piece of pipe passed on NULL as the output of the with(plot(...)) to axis() Works because axis(side = 3) is called on the active, wherever it might be. Which is pretty bad programming style.

  6. magrittr - pipes and with() The preferred use with ‘with() mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% with ({ plot (x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines (x = range (wt), y = range (lp100k), col = "steelblue", lwd = 2) axis (side = 3, col = "blue", col.ticks = "forestgreen") }) This time the value of the with(...) would be that of axis() which returns the location of the axis tic marks. You really need to know what each piece of a pipe passes on to the next!

  7. magrittr - the exposition pipe %$% Instead of using with() an exposition pipe %$% will do nearly the same. It exposes the names from the data from the LHS to be used in the RHS expression. For example, the following allows plot() to refer to the names wt and lp100k of its input data.frame . mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %$% # exposition pipe plot (wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) 22 20 litres per 100 km. 18 16 14 12 10 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight

  8. magrittr - the exposition pipe %$% Note however, that the pipe has ended with the plot (no further piping). We cannot, for example, now pipe to lines() or to axis() , and expect to continue to refer to the names of the dataset. mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %$% # exposition pipe plot (wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) %>% lines (x = range (mtcars $ wt), y = range (mtcars $ lp100k), col = "steelblue", lwd = 2) %>% axis (side = 3, col = "blue", col.ticks = "forestgreen") 2.5 3.0 3.5 4.0 4.5 5.0 5.5 22 20 litres per 100 km. 18 16 14 12 10 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight Instead, as above, we had to reintroduce the data mtcars which breaks the pipe metaphor.

  9. magrittr - the exposition pipe %$% Fortunately, this problem is easily resolved using braces {} . mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %$% # exposition pipe { plot (x = wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines (x = range (wt), y = range (lp100k), col = "steelblue", lwd = 2) axis (side = 3, col = "blue", col.ticks = "forestgreen") } 2.5 3.0 3.5 4.0 4.5 5.0 5.5 22 litres per 100 km. 18 14 10 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight

  10. magrittr - the exposition pipe %$% and braces {} Can the piping continue? Of course. Just be mindful of the last output . . . it might not be what you need. Remember, mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %$% # exposition pipe { plot (x = wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines (x = range (wt), y = range (lp100k), col = "steelblue", lwd = 2) axis (side = 3, col = "blue", col.ticks = "forestgreen") } %>% print 2.5 3.0 3.5 4.0 4.5 5.0 5.5 22 litres per 100 km. 18 14 10 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight ## [1] 2.5 3.0 3.5 4.0 4.5 5.0 5.5 which are just . . . the numeric tic locations from axis() .

  11. magrittr - the tee pipe %T>% The tee pipe %T>% is like the usual pipe %>% except that it returns the value of the LHS instead of the value of the RHS. This can be very handy. For example, library (knitr) mtcars %$% # exposition pipe lm (mpg ~ wt) %T>% # tee pipe, fit is passed on through the next piece { plot (x = . $ model $ wt, y = . $ model $ mpg, col = "firebrick", main = "1974 Motor Trend magazine", xlab = "Weight", ylab = "miles per US gallon" ) abline (. $ coef, col = "steelblue") } %>% # standard pipe coefficients %>% # standard pipe round %>% # standard pipe kable 1974 Motor Trend magazine miles per US gallon 30 25 20 15 10 2 3 4 5 Weight x (Intercept) 37 wt -5

  12. magrittr - the tee pipe %T>% and loon The tee pipe %T>% is especially handy with loon plots. For example, set.seed (314159) library (loon) mtcars %$% # exposition pipe l_plot (x = wt, y = mpg, color = cyl, glyph = c ("ocircle", "ccircle")[am + 1], size = hp / 5, itemLabel = rownames (.), title = "1974 Motor Trend magazine", xlabel = "Weight (1000s of lbs)", ylabel = "miles per US gallon") %>% # standard pipe, plot passed on l_configure ("selected" = sample ( c (TRUE, rep (FALSE, 5)), length (.["x"]), replace = TRUE) ) %T>% # tee pipe, # because l_scaleto_selected returns NULL, and plot needs to be passed on l_scaleto_selected %>% # standard pipe since l_configure returns plot l_configure ("showGuides" = TRUE, "showScales" = TRUE) %>% # standard pipe l_configure ("showItemLabels" = TRUE) %T>% # tee pipe because a layer would be returned by l_layer_line l_layer_line (x = sort (.["x"]), y = predict ( lm (mpg ~ wt, data = data.frame (wt = .["x"], mpg = .["y"])), newdata = data.frame (wt = sort (.["x"])) ), color = l_getOption ("select-color"), linewidth = 4 ) -> # forward assignment of plot to p

  13. magrittr - the tee pipe %T>% and loon produces plot (p) 1974 Motor Trend magazine 24 miles per US gallon 20 16 2.2 2.6 3 3.4 Weight (1000s of lbs)

  14. magrittr - lots of ways to write the code For example, set.seed (314159) library (loon) mtcars %$% { # exposition pipe to several statements l_hist (disp, xlabel = "Displacement (cubic inches)", linkingGroup = "mtcars") ->> h # global assignment l_plot (x = wt, y = mpg, linkingGroup = "mtcars", color = cyl, glyph = c ("ocircle", "ccircle")[am + 1], size = hp / 5, itemLabel = rownames (.), title = "1974 Motor Trend magazine", xlabel = "Weight (1000s of lbs)", ylabel = "miles per US gallon", showGuides = TRUE, showScales = TRUE, showItemLabels = TRUE, selected = disp < median (disp) ) } %T>% # tee pipe # because next expression returns NULL l_scaleto_selected %T>% # tee pipe again ... why? l_layer_line (x = sort (.["x"][.["selected"]]), y = predict ( loess (mpg ~ wt, data = data.frame (wt = .["x"], mpg = .["y"]), # fit only the selected observations subset = .["selected"]), newdata = data.frame (wt = sort (.["x"][.["selected"]])) ), color = l_getOption ("select-color"), linewidth = 4, dash = c (10,4), index = "end" # bottommost layer ) -> # forward assignment p

  15. magrittr - the tee pipe %T>% and loon From the previous plots we can still make adjustments and then export the results to a grid graphic. h["showStackedColors"] <- TRUE l_scaleto_world (p) # Get the grobs necessary for grid.arrange gh <- loonGrob (h) gp <- loonGrob (p) library (gridExtra) # contains some friendly grid functions like grid.arrange (gh, gp, nrow = 1) 1974 Motor Trend magazine 35 miles per US gallon Frequency 25 15 2 3 4 5 Displacement (cubic inches) Weight (1000s of lbs)

  16. magrittr - same approach (sort of) with base graphics For example, set.seed (314159) mtcars %$% { # exposition pipe to several statements savePar <- par (mfrow = c (1,2)) hist (disp, xlab = "Displacement (cubic inches)") plot (x = wt, y = mpg, col = cyl, pch = c (19, 21)[am + 1], cex = hp / 50, # divide by 50 now main = "1974 Motor Trend magazine", xlab = "Weight (1000s of lbs)", ylab = "miles per US gallon" ) orderX <- order (wt) lines (x = sort (wt), # fit all observations y = predict ( loess (mpg ~ wt, data = data.frame (wt = wt[orderX], mpg = mpg[orderX]))), col = "grey", lwd = 4, lty = 2 ) par (savePar) } # no assignment, no tee pipe Histogram of disp 1974 Motor Trend magazine 7 6 30 miles per US gallon 5 25 Frequency 4 3 20 2 15 1 0 10 100 200 300 400 500 2 3 4 5 Displacement (cubic inches) Weight (1000s of lbs)

  17. On using magrittr pipes 1. Pipes connect a left hand side expression, LHS , to a right hand side expression, RHS , as in LHS %pipe% RHS where the result of the LHS expression is passed as the first argument to the RHS expression. The result of the LHS can be referenced as a dot . in the RHS . 2. Pipelines are a series of connected pipes: expr_1 %pipe% expr_2 %pipe% expr_3 %pipe% ... %pipe% expr_k These are evaluated left to right in pairs as in ((...((expr_1 %pipe% expr_2) %pipe% expr_3) %pipe% ...) %pipe% expr_k)

  18. On using magrittr pipes 3. There are four different pipes: %>% , %T>% , %$% , and %<>%‘ ◮ Standard: LHS %>% RHS ◮ result of RHS is passed on ◮ Tee: LHS %T>% RHS ◮ result of LHS is passed on from RHS ◮ Exposition: LHS %$% RHS ◮ names of result of LHS are exposed to RHS ◮ result of RHS is passed on ◮ Compound assignment: LHS %<>% RHS ◮ result of RHS is passed on ◮ result of pipeline is finally assigned to LHS ◮ changes LHS by side-effect (e.g. try iris[,1:4] %<>% sqrt )

  19. On using magrittr pipes 4. Pipelines are most easily understood when it is essentially the same object being pushed through the pipes. ◮ Example: data construction pipeline mtcars %>% subset (am == 0) %>% transform (lp100k = 100 / (mpg * 0.4251)) -> autoTransData ◮ Example: model pipeline autoTransData %$% lm (lp100k ~ wt) %>% predict (interval = "prediction") -> autoTransFit

  20. On using magrittr pipes 4. Continued ◮ Example: plot pipeline (mainly implicit pipeline) data.frame (autoTransData, autoTransFit)[ order (autoTransData[, "wt"]),] %T>% with ({ plot (wt, lp100k, ylim = extendrange ( c (lwr, fit, upr)), xlab = "Weight (1000s of lbs)", ylab = "Litres per 100 kilometres", main = "Cars with automatic transmissions", col = "firebrick", pch = 19) lines (wt, fit, col = "steelblue") lines (wt, lwr, col = "firebrick", lty = 2) lines (wt, upr, col = "firebrick", lty = 2) }) -> autoTransDataFit Cars with automatic transmissions 25 Litres per 100 kilometres 20 15 10 5 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Weight (1000s of lbs)

  21. On using magrittr pipes 4. Continued ◮ Example: loon plot pipeline (more obvious pipeline) data.frame (autoTransData, autoTransFit)[ order (autoTransData[, "wt"]),] %>% with ({ l_plot (x = wt, y = lp100k, linkingGroup = "automatic transmissions", xlabel = "Weight (1000s of lbs)", ylabel = "Litres per 100 kilometres", title = "Cars with automatic transmissions", color = "firebrick", glyph = "circle") %T>% l_layer_line (x = wt, y = fit, color = "steelblue", index = "end") %T>% l_layer_line (x = wt, y = lwr, dash = c (5,5), color = "firebrick", index = "end") %T>% l_layer_line (x = wt, y = upr, dash = c (5,5), color = "firebrick", index = "end") %T>% l_scaleto_world }) -> p plot (p) Cars with automatic transmissions Litres per 100 kilometres Weight (1000s of lbs)

  22. On using magrittr pipes When should you use pipelines? ◮ Not as a general programming style. ◮ More for data analysis, data wrangling, . . . ◮ track your analysis in easy to read pieces ◮ use many sets of small pipelines ◮ helps you understand and identify chunks of analysis ◮ interupt the pipe anywhere to make sure you are getting what you intended ◮ edit and re-run (true for any commands in a file) ◮ provides reusable chunks that might be adapted and applied to different data and analyses ◮ if the pipeline becomes too difficult to read, it probably needs to be separated into different pieces

  23. A pipeline model for statistical graphics Lee Wilkinson’s monumental The Grammar of Graphics begins with a pipeline model for constructing statistical graphics: Each step in the pipeline transforms its input to produce output for the next step. The order of steps is essential, though not all need be there for every plot. Because the pipeline consists of separate components, the final graphic that is rendered can be simply and sometimes dramatically changed by making changes to a single component in the pipeline.

  24. ggplot2 – a grammar of graphics for R library (ggplot2) Inspired by Wilkinson’s “Grammar of Graphics”, ggplot2 is a “layered” grammar of graphics. Much like Wilkinson’s original grammar, ggplot2 uses a pipeline model for its graphics construction in that a plot is built in an ordered series of steps, where each step operates on the output of its immediate predecessor in the line. Departing from the grammar, ggplot2 slightly mixes metaphors in that each step in the pipeline can (typically) be thought of as adding a layer to all that preceded it. From the ggplot2 book: "The layered grammar of graphics (Wickham 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic." Notationally, the components of the pipeline appear in sequence connected one to the next via an intervening + sign, thus emphasizing each as an addition of a layer (or of some further processing of the plot). Note that he + sign mixes the concepts of layer and of pipeline.

  25. Example - South African heart disease Consider the ‘SAheart‘ data from the package ‘ElemStatLearn‘. This is a sample from a retrospective study of heart disease in males from a high-risk region of the Western Cape, South Africa. There are 462 cases and 10 variates. The first few obervations (cases) are shown below. sbp tobacco ldl adiposity famhist typea obesity alcohol age chd 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 0 For example, sbp denotes “systolic blood pressure”, sbp “low density lipoprotein cholesterol”. famhist “family history of heart disease”, age “age at onset” (in years), and chd indicates whether the patient has coronary heart disease or not (a response). (see help(SAheart, package="ElemStatLearn") for details)

  26. Constructing a plot - the pipeline In the grammar of graphics, a plot processes each component in turn ggplot (data = SAheart) First the data

  27. Constructing a plot - pipeline In the grammar of graphics, a plot processes each component in turn ggplot (data = SAheart) + aes ( x = age, y = chd) 1.00 0.75 chd 0.50 0.25 0.00 20 30 40 50 60 age Then the mapping of the data to plot “aesthetics”

  28. Constructing a plot - pipeline In the grammar of graphics, a plot processes each component in turn ggplot (data = SAheart) + aes ( x = age, y = chd) + geom_point () 1.00 0.75 chd 0.50 0.25 0.00 20 30 40 50 60 age Then the geometry.

  29. Constructing a plot - pipeline In the grammar of graphics, a plot processes each component in turn ggplot (data = SAheart) + aes ( x = age, y = chd) + geom_point () + geom_smooth () 1.00 0.75 0.50 chd 0.25 0.00 20 30 40 50 60 age Which can have several further steps in the pipeline

  30. Constructing a plot Alternatively, in the grammar of ggplot2 , a plot is also a sum of component layers. ggplot (data = SAheart, mapping = aes (x = age, y = chd)) 1.00 0.75 chd 0.50 0.25 0.00 20 30 40 50 60 age The base display with mapping.

  31. Constructing a plot Alternatively, in the grammar of ggplot2 , a plot is also a sum of component layers. ggplot (data = SAheart, mapping = aes (x = age, y = chd)) + geom_point () 1.00 0.75 chd 0.50 0.25 0.00 20 30 40 50 60 age Here the + is adding layers.

  32. Constructing a plot Alternatively, in the grammar of ggplot2 , a plot is also a sum of component layers. ggplot (data = SAheart, mapping = aes (x = age, y = chd)) + geom_point () + geom_smooth () 1.00 0.75 0.50 chd 0.25 0.00 20 30 40 50 60 age Here the + is adding layers.

  33. Constructing a plot - separate mappings Alternatively, we could deliberately associate only the data with the plot, forcing the mapping of the data to aesthetics within each individual component layer: ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) 1.00 0.75 chd 0.50 0.25 0.00 20 30 40 50 60 age The mapping is explicit for each layer.

  34. Constructing a plot - separate mappings What would the following plot look like? ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth ()

  35. Constructing a plot - separate mappings What would the following plot look like? ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth () It fails . . . why? How could it be fixed?

  36. Constructing a plot - separate mappings What would the following plot look like? ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth () It fails . . . why? How could it be fixed? Cautionary note: the ggplot2 grammar mixes the two metaphors of “layers” and “pipes”. Just because an aesthetic precedes a component in the pipeline does not mean that it is available for use.

  37. Constructing a plot - separate mappings Solution 1: explicitly, give the mapping for each layer: ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth (mapping = aes (x = age, y = chd)) 1.00 0.75 0.50 chd 0.25 0.00 20 30 40 50 60 age

  38. Constructing a plot - separate mappings Solution 2: provide aesthetics upstream: ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd)) + aes (x = age, y = chd) + geom_smooth () 1.00 0.75 0.50 chd 0.25 0.00 20 30 40 50 60 age

  39. Constructing a plot - separate mappings ggplot (data = SAheart) + geom_point (mapping = aes (x = age, y = chd, col = famhist)) + geom_smooth (mapping = aes (x = age, y = chd)) 1.00 0.75 famhist 0.50 chd Absent Present 0.25 0.00 20 30 40 50 60 age

  40. Constructing a plot - shared and separate mappings ggplot (data = SAheart) + aes (group = famhist) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth (mapping = aes (x = age, y = chd)) 1.0 0.5 chd 0.0 20 30 40 50 60 age

  41. Constructing a plot - shared and separate mappings ggplot (data = SAheart, mapping = aes (group = famhist)) + geom_point (mapping = aes (x = age, y = chd, col = famhist)) + geom_smooth (mapping = aes (x = age, y = chd)) 1.0 0.5 famhist chd Absent Present 0.0 20 30 40 50 60 age

  42. Constructing a plot - shared and separate mappings ggplot (data = SAheart, mapping = aes (group = famhist, col = famhist)) + geom_point (mapping = aes (x = age, y = chd)) + geom_smooth (mapping = aes (x = age, y = chd)) 1.0 0.5 famhist chd Absent Present 0.0 20 30 40 50 60 age

  43. Constructing a plot Alternatively, we could split the plot into two pieces by facetting: ggplot (data = SAheart, mapping = aes (x = age, y = chd)) + geom_point (col="steelblue", size = 3, alpha = 0.4) + geom_smooth (method = "loess", col = "steelblue") + facet_wrap ( ~ famhist) Absent Present 1.0 0.5 chd 0.0 20 30 40 50 60 20 30 40 50 60 age

  44. Components of the layered grammar In the grammar of ggplot2 , a plot is a structured combination of: ◮ a dataset, ◮ a set of mappings from variates to aesthetics, ◮ one or more layers, each composed of ◮ a geometric object, ◮ a statistical transformation, ◮ a position adjustment, and ◮ (optionally) its own dataset and aesthetic mappings ◮ a scale for each aesthetic mapping, ◮ a coordinate system, ◮ a facetting specification

  45. Geometric objects There are a variety of geometric objects that can be added to a plot ◮ geom_abline() , geom_hline() , geom_vline() , geom_curve() , geom_segment() , geom_step() ◮ geom_label() , geom_text() ◮ geom_point() , geom_smooth() , geom_crossbar() , geom_errorbar() , geom_errorbarh() , geom_linerange() , geom_pointrange() , ◮ geom_rect() , geom_raster() , geom_area() , geom_ribbon() , geom_tile() , ◮ geom_bar() , geom_col() , ◮ geom_dotplot() , geom_boxplot() , geom_histogram() , geom_freqpoly() , geom_density() , geom_violin() , geom_quantile() , geom_qq() ◮ geom_bin2d() , geom_density2d() , geom_hex() , ◮ geom_contour() , ◮ geom_map() , geom_polygon() Each of these will have their own arguments including mapping , data , stat , et cetera.

  46. Geometric objects - adding to plots Beginning with a plot different geometric objects may be added. For example: p <- ggplot (data = SAheart, mapping = aes (x = tobacco, y = sbp)) p 200 175 sbp 150 125 100 0 10 20 30 tobacco

  47. Geometric objects - points and density Beginning with a plot different geometric objects may be added. For example: p + geom_point () + geom_density_2d (lwd = 1.5, col = "steelblue") 200 175 sbp 150 125 100 0 10 20 30 tobacco

  48. Geometric objects - histogram h <- ggplot (data = SAheart, mapping = aes (x = adiposity)) h + geom_histogram (bins = 10, fill = "steelblue", col = "black", alpha = 0.5) 75 count 50 25 0 10 20 30 40 adiposity Note that had we tried to layer the histogram on top of p , it would have inherited from p a y aesthetic (namely y = sbp ) which does not make sense for a histogram.

  49. Geometric objects - histogram h + geom_histogram (mapping = aes (y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) 0.05 0.04 0.03 density 0.02 0.01 0.00 10 20 30 40 adiposity A y aesthetic that does make sense for a histogram is ..density.. which forces the scaling of the vertical axis so that the histogram has unit area. Note that the x aesthetic was inherited from h .

  50. Geometric objects - density scale histogram Provided we provide a y aesthetic mapping, a histogram could therefore be added to p as well. p + geom_histogram (mapping = aes (x = adiposity, y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) 0.05 0.04 0.03 sbp 0.02 0.01 0.00 10 20 30 40 tobacco Note: ◮ the change in vertical scale matches the histogram ◮ the axes labels do not match the aesthetics of the histogram (though the tick marks and values happen to) Because this is only a grammar, it is as easy to make silly visualizations as it is silly sentences.

  51. Geometric objects - layering effect The order of layering (on top of h now) matters: h + geom_histogram (mapping = aes (y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) + geom_density (mapping = aes (y = ..density..), fill = "grey", alpha = 0.5) 0.05 0.04 0.03 density 0.02 0.01 0.00 10 20 30 40 adiposity Note that the y aesthetic had to be repeated here . . .

  52. Geometric objects - layering effect Switch the order of addition: h + geom_density (mapping = aes (y = ..density..), fill = "grey", alpha = 0.5) + geom_histogram (mapping = aes (y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) 0.05 0.04 0.03 density 0.02 0.01 0.00 10 20 30 40 adiposity Note that the aesthetics need to be repeated here . . .

  53. Geometric objects - bar charts ggplot (SAheart) + geom_bar (mapping = aes (x = factor (chd), fill = famhist)) + labs (x = "chd", title ="South African heart disease") + coord_flip () South African heart disease 1 famhist chd Absent Present 0 0 100 200 300 count Which makes you wonder how the data were collected.

  54. Geometric objects A different scatterplot p2 <- ggplot (data = SAheart, mapping = aes (x = sqrt (age), y = sbp)) p2 + geom_point () 200 175 sbp 150 125 100 4 5 6 7 8 sqrt(age)

  55. Geometric objects Note that each geometric object has its own arguments and properties that can be set. p2 + geom_point (col = "red", size = 3, pch = 21, fill = "yellow", alpha = 0.5) + geom_smooth (method = "loess", col = "steelblue", lty = 2, lwd = 1.5, alpha = 0.2) 200 175 sbp 150 125 100 4 5 6 7 8 sqrt(age)

  56. Geometric objects Aesthetics apply to every point individually. p2 + geom_point (mapping = aes (size = obesity), fill = "steelblue", col = "black", pch = 21, alpha = 0.4) + geom_smooth (method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2) 200 175 obesity 20 sbp 30 150 40 125 100 4 5 6 7 8 sqrt(age)

  57. Geometric objects Aesthetics apply to every point individually. p2 + geom_point (mapping = aes (size = obesity, fill = tobacco), col = "black", pch = 21, alpha = 0.4) + geom_smooth (method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2) obesity 200 20 30 40 175 sbp tobacco 30 150 20 125 10 0 100 4 5 6 7 8 sqrt(age)

  58. Geometric objects The data may change with each layer heartAttack <- SAheart[, "chd"] == 1 hAplot <- p2 + geom_point (data = SAheart[heartAttack, ], mapping = aes (size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "steelblue") hAplot 200 obesity 15 175 20 25 sbp 30 150 35 40 45 125 100 4 5 6 7 8 sqrt(age)

  59. Geometric objects The data may change with each layer qboth <- hAplot + geom_point (data = SAheart[ ! heartAttack, ], # Not heartAttack mapping = aes (size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "firebrick") qboth 200 175 obesity 20 sbp 30 150 40 125 100 4 5 6 7 8 sqrt(age)

  60. Geometric objects The data may change with each layer qboth + geom_smooth (data = SAheart[heartAttack, ], method = "loess", col = "steelblue", alpha = 0.4) + geom_smooth (data = SAheart[ ! heartAttack, ], method = "loess", col = "firebrick", alpha = 0.4) 200 obesity 160 20 sbp 30 40 120 4 5 6 7 8 sqrt(age)

  61. Geometric objects The data may change with each layer qboth + geom_smooth (method = "loess") 200 175 obesity 20 sbp 30 150 40 125 100 4 5 6 7 8 sqrt(age) Note smooth is using all of the data here.

  62. Geometric objects The data may change with each layer qboth + geom_smooth (mapping = aes (color = factor (chd)), method = "loess") 200 factor(chd) 0 1 160 sbp obesity 20 30 120 40 4 5 6 7 8 sqrt(age) Here the smooth is separate for each colour given by chd as factor. Note ggplot ’s default colours.

  63. Geometric objects The colours can be coordinated by relying on the original data and using chd as a factor: p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth (mapping = aes (col = factor (chd)), method = "loess", lwd = 1.5, alpha = 0.2) 200 factor(chd) 0 1 160 sbp obesity 20 30 120 40 4 5 6 7 8 sqrt(age) Here the smooth is separate for each colour given by chd as factor.

  64. Scales A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ). p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth (mapping = aes (col = factor (chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual ("chd", values= c ("steelblue", "firebrick")) + scale_color_manual ("chd", values= c ("steelblue", "firebrick")) 200 chd 0 1 160 sbp obesity 20 30 120 40 4 5 6 7 8 sqrt(age) . . . gets your own “scale” values for colour and for fill.

  65. Scales A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ). p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth (mapping = aes (col = factor (chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual ("chd", values= c ("steelblue", "firebrick")) + scale_color_manual ("chd", values= c ("steelblue", "firebrick")) + scale_size ("obesity", breaks = seq (0,100,5)) chd 200 0 1 obesity 160 15 sbp 20 25 30 120 35 40 45 4 5 6 7 8 sqrt(age) . . . additonally gets your own “scale” values for point size (which is proportional to area).

  66. Scales A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ). p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth (mapping = aes (col = factor (chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual ("chd", values= c ("steelblue", "firebrick")) + scale_color_manual ("chd", values= c ("steelblue", "firebrick")) + scale_size_area ("obesity", breaks = seq (0,100,5)) chd 200 0 1 obesity 160 15 sbp 20 25 30 120 35 40 45 4 5 6 7 8 sqrt(age) . . . Now a zero value gives a zero area.

  67. Position scales There are two position scales: horizontal (x) and vertical (y). p + geom_point (alpha = 0.5, size = 1) + scale_x_continuous (limits = c (0,40)) + scale_y_continuous (limits = c (75,225)) 200 sbp 150 100 0 10 20 30 40 tobacco

  68. Position scales There are two position scales: horizontal (x) and vertical (y). p + geom_point (alpha = 0.5, size = 1) + xlim (0,40) + ylim (75,225) 200 sbp 150 100 0 10 20 30 40 tobacco

  69. Position scales There are two position scales: horizontal (x) and vertical (y). p + aes (x = tobacco + 1) + geom_point (alpha = 0.5, size = 1) + scale_x_log10 () 200 175 sbp 150 125 100 1 3 10 30 tobacco + 1

  70. Coordinates This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many others; the aspect ratio, for example, is fixed using coord_fixed() : ggplot (SAheart, aes (obesity, adiposity)) + geom_point () + coord_fixed (ratio = 1) 40 30 adiposity 20 10 20 30 40 obesity Here the aspect ratio is fixed so that one unit change in the x direction produces only one unit change in the y direction.

  71. Coordinates This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many others; the aspect ratio, for example, is fixed using coord_fixed() : ggplot (SAheart, aes (obesity, adiposity)) + geom_point () + coord_fixed (ratio = 0.5) 40 30 adiposity 20 10 20 30 40 obesity Here the aspect ratio is fixed so that one unit change in the x direction produces only half a unit change in the y direction.

  72. Coordinates One coordinate system that is used is called coord_polar() which, unlike its name suggests, does not calculate a polar coordinate transformation but rather treats one of the two positions as defining an angle and the other as defining the radius. ggplot (SAheart, aes (obesity, adiposity)) + geom_point () + geom_smooth () + coord_polar (theta = "x") 40 30 20 20 40 adiposity 10 30 obesity which, arguably, is a pretty weird plot but does demonstrate how coordinate systems are abstracted out as part of the grammar. Consequently coord_polar() should be used with considerable caution

  73. Coordinates Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar() . First the bar chart barChart <- ggplot (SAheart, aes (x = factor (1), fill = famhist)) + geom_bar (width = 1) + xlab ("") barChart 400 300 famhist count Absent Present 200 100 0 1

  74. Coordinates Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar() . Now the pie chart barChart + coord_polar (theta = "y") 0 400 1 100 famhist Absent Present 300 200 count Be careful with coord_polar() and bar charts; it is easy to produce some very silly pointless charts.

  75. Positions A bar chart with two variates. Default position is “stack” barChart2 <- ggplot (SAheart, aes (x = factor (chd), fill = famhist)) + geom_bar (position="stack") + xlab ("chd") barChart2 300 200 famhist count Absent Present 100 0 0 1 chd which stacks the two colours in the same bar.

  76. Positions To place the colours beside each other rather than stack them, change the position to “dodge” barChart3 <- ggplot (SAheart, aes (x = factor (chd), fill = famhist)) + geom_bar (position="dodge") + xlab ("chd") barChart3 200 150 famhist count Absent 100 Present 50 0 0 1 chd

  77. Statistical transformations - stat These transformations often summarize data in some manner (e.g. by counting, by averaging, etc.). Some statistical functions operate “behind the scenes”: ◮ stat_bin() in geom_bar() , geom_freqpoly() , geom_histogram() ◮ stat_bin2d() in geom_bin2d() ◮ stat_bindot() in geom_dotplot() ◮ stat_binhex() in geom_hex() ◮ stat_boxplot() in geom_boxplot() ◮ stat_contour() in geom_contour() ◮ stat_quantile() in geom_quantile() ◮ stat_smooth() in geom_smooth() ◮ stat_sum() in geom_count() but may also be called directly (outside the geom_ )

  78. Statistical transformations - stat Other stats have no corresponding geom_ function: ◮ stat_ecdf() : compute a empirical cumulative distribution plot. ◮ stat_function() : compute y values from a function of x values. ◮ stat_summary() : summarise y values at distinct x values. ◮ stat_summary2d() , stat_summary_hex() : summarise binned values. ◮ stat_qq() : perform calculations for a quantile-quantile plot. ◮ stat_spoke() : convert angle and radius to position. ◮ stat_unique() : remove duplicated rows.

  79. Statistical transformations - example Adding some statistical summary information to the scatterplot p2 p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + stat_summary (geom = "point", fun = "median", orientation = "x", col = "yellow", size = 2, pch = 19) 200 factor(chd) 0 175 1 sbp obesity 150 20 30 40 125 100 4 5 6 7 8 sqrt(age) Adds the median of the y s at each observed x .

  80. Statistical transformations - example Alternatively use stat = "summary" in geom_point() . Also add connecting lines to the scatterplot p2 p2 + geom_point (mapping = aes (size = obesity, fill = factor (chd)), col = "black", pch = 21, alpha = 0.4) + geom_point (stat = "summary", fun = "median", orientation = "x", col = "yellow", size = 2, pch = 19) + stat_summary (geom = "line", fun = "median", orientation = "x", col = "yellow", size = 1, pch = 19) 200 factor(chd) 0 175 1 sbp obesity 150 20 30 40 125 100 4 5 6 7 8 sqrt(age) Adds the median of the y s at each observed x .

  81. Miscellaneous ◮ Can also facet in a matrix grid using facet_grid() ◮ position can also be “jitter” (best for scatterplots) ◮ there is a function called theme() which is how the appearance of all non-data plot components are changed. ◮ E.g. it is possible to turn that grey background grid off via theme() (though it seems a lot of work) ◮ there is a function qplot() or quickplot() which is more like a base graphics plot() call and so may be easier to use than following the ggplot2 grammar via ggplot() + ... ◮ ggsave() will save the most recent ggplot.

  82. Miscellaneous Note: to plot time series (objects of class ts ) you need the ggfortify package and then use autoplot() . library (ggfortify) autoplot (sunspots) 250 200 150 100 50 0 1750 1800 1850 1900 1950 Similarly, library(ggmap) for raster maps from get_map()

  83. Working with magrittr pipes The grammar model of ggplot has + behaving much like a pipe in magrittr and can be used with the pipes of magrittr . library (magrittr) mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% ggplot ( aes (x = wt, y = lp100k)) + geom_point (mapping = aes (col = vs)) + ylab ("Litres per 100 kilometres") + ggtitle ("Gas usage") -> p p Gas usage 20 Litres per 100 kilometres vs 1.00 0.75 16 0.50 0.25 0.00 12 3 4 5 wt

  84. Working with magrittr pipes Note that unlike the base graphics plots, but like grid and loon plots, ggplot s are structures that can be passed on with the pipes. library (magrittr) mtcars %>% subset (am == 0) %>% transform (kpl = mpg %>% multiply_by (0.4251)) %>% data.frame (lp100k = 100 / . $ kpl) %>% ggplot ( aes (x = wt, y = lp100k)) + geom_point (mapping = aes (col = vs)) + ylab ("Litres per 100 kilometres") + ggtitle ("Gas usage") %>% print ## $title ## [1] "Gas usage" ## ## attr(,"class") ## [1] "labels" Gas usage 20 Litres per 100 kilometres vs 1.00 0.75 16 0.50 0.25 0.00 12 3 4 5 wt Note that typically a ggplot data structure is not completely constructed until it has been printed (or forced).

Recommend


More recommend