Manage your analyses workflows with the drake R package Grenoble R - - PowerPoint PPT Presentation

manage your analyses workflows with the drake r package
SMART_READER_LITE
LIVE PREVIEW

Manage your analyses workflows with the drake R package Grenoble R - - PowerPoint PPT Presentation

Manage your analyses workflows with the drake R package Grenoble R Users Group Xavier Laviron December 6, 2018 A data analysts job 1/23 A data analysts job 2/23 A data analysts job 3/23 A data analysts job 4/23 A data


slide-1
SLIDE 1

Manage your analyses workflows with the drake R package

Grenoble R Users Group Xavier Laviron

December 6, 2018

slide-2
SLIDE 2

A data analyst’s job

1/23

slide-3
SLIDE 3

A data analyst’s job

2/23

slide-4
SLIDE 4

A data analyst’s job

3/23

slide-5
SLIDE 5

A data analyst’s job

4/23

slide-6
SLIDE 6

A data analyst’s job

2 options: ◮ Run everything from scratch (simple, but can be too long...) ◮ Track the dependencies between your objects (boring, perfect job for a pipeline toolkit...)

5/23

slide-7
SLIDE 7

The drake package is here to help you

Why use drake? ◮ Keeps track of dependencies in your code ◮ Keeps track of changes in your code ◮ Runs only what needs to be run, and skip the rest ◮ It has a cool name :-)

6/23

slide-8
SLIDE 8

The drake package is here to help you

Why use drake? ◮ Keeps track of dependencies in your code ◮ Keeps track of changes in your code ◮ Runs only what needs to be run, and skip the rest ◮ It has a cool name :-)

In other words, ‘drake‘ can save a lot of time!*

*more time for coffee breaks 6/23

slide-9
SLIDE 9

drake tracks changes in functions

Encapsulate your code in functions:

# Process the data process_data <- function(raw.data) { raw.data[raw.data$Sepal.Length > 5, ] } # fit a model fit_model <- function(data) { lm(Sepal.Length ~ Petal.Width + Species, data = data) } # create plots create_plot <- function(data) { ggplot(data, aes(x = Petal.Width, fill = Species)) + geom_histogram() }

7/23

slide-10
SLIDE 10

The plan

The central piece of ‘drake‘: the workflow plan

The plan is a simple data.frame with two columns: ◮ target: the objects you want to build ◮ command: the functions to build them

8/23

slide-11
SLIDE 11

The plan

The central piece of ‘drake‘: the workflow plan

The plan is a simple data.frame with two columns: ◮ target: the objects you want to build ◮ command: the functions to build them Different ways to create the plan: ◮ Like any data.frame: data.frame(), expand.grid(), ... ◮ With one of drake’s helper functions: drake_plan(), evaluate_plan(), ...

8/23

slide-12
SLIDE 12

The drake_plan() function

Usage:

drake_plan(target1 = command1, target2 = command2, ...)

9/23

slide-13
SLIDE 13

The drake_plan() function

Usage:

drake_plan(target1 = command1, target2 = command2, ...)

my.plan <- drake_plan(raw.data = read.csv(file_in("data/raw_data.csv")), proc.data = process_data(raw.data), plot = create_plot(proc.data), model = fit_model(proc.data), report = render(input = knitr_in("report.Rmd"),

  • utput_file = file_out("report.pdf"),

quiet = TRUE))

9/23

slide-14
SLIDE 14

The drake_plan() function

print(my.plan)

## # A tibble: 5 x 2 ## target command ## * <chr> <chr> ## 1 raw.data read.csv(file_in('data/raw_data.csv')) ## 2 proc.data process_data(raw.data) ## 3 plot create_plot(proc.data) ## 4 model fit_model(proc.data) ## 5 report "render(input = knitr_in('report.Rmd'), output_file = file_ou~

10/23

slide-15
SLIDE 15

Files dependencies

Files are not tracked by drake, you have to declare them explicitly as dependencies: ◮ file_in("some_data.csv"): an input file ◮ file_out("some_data.Rds"): an output file ◮ knitr_in("report.Rmd"): an rmarkdown file, drake will scan it to find its dependencies

11/23

slide-16
SLIDE 16

The dependency graph

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependency graph 12/23

slide-17
SLIDE 17

The make() command

The central command of drake, runs everything that needs to run.

make(my.plan)

Dependency graph

13/23

slide-18
SLIDE 18

The make() command

The central command of drake, runs everything that needs to run.

make(my.plan) vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependency graph

13/23

slide-19
SLIDE 19

Accessing the objects

All objects are stored in a hidden cache (.drake/). To access them:

loadd(model) model <- readd(model)

14/23

slide-20
SLIDE 20

Accessing the objects

All objects are stored in a hidden cache (.drake/). To access them:

loadd(model) model <- readd(model) print(readd(model))

## ## Call: ## lm(formula = Sepal.Length ~ Petal.Width + Species, data = data) ## ## Coefficients: ## (Intercept) Petal.Width Speciesversicolor ## 5.13118 0.65802

  • 0.01955

## Speciesvirginica ## 0.15373

14/23

slide-21
SLIDE 21

An update in the code!

What happens if we modify a function?

create_plot <- function(data) { ggplot(data, aes(x = Petal.Width, y = Sepal.Width, fill = Species)) + geom_point() }

Dependency graph

15/23

slide-22
SLIDE 22

An update in the code!

What happens if we modify a function?

create_plot <- function(data) { ggplot(data, aes(x = Petal.Width, y = Sepal.Width, fill = Species)) + geom_point() }

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependency graph

15/23

slide-23
SLIDE 23

Other advantages

Reproducibilty

You have proof of what is done:

make(my.plan)

## All targets are already up to date.

16/23

slide-24
SLIDE 24

Other advantages

Independant replication is made easy

◮ Your code is separated into functions: more readability and maintainability ◮ The plan allows an independent user to easily understand the analyses ◮ Restart everything from scratch easily:

  • utdated(drake_config(my.plan))

## character(0)

clean()

  • utdated(drake_config(my.plan))

## [1] "model" "plot" "proc.data" "raw.data" "report"

17/23

slide-25
SLIDE 25

Parallelization

◮ drake can manage multi-core computing (on a local machine or a HPC) ◮ Simply change the jobs argument of make():

make(my.plan, jobs = 2)

18/23

slide-26
SLIDE 26

Parallelization

◮ drake can manage multi-core computing (on a local machine or a HPC) ◮ Simply change the jobs argument of make():

make(my.plan, jobs = 2)

◮ drake will automatically know which targets can be run in parallel and which cannot

18/23

slide-27
SLIDE 27

Ressources

To go further

https://github.com/ropensci/drake ◮ Online documentation ◮ Cheatsheet ◮ FAQ The package is in active development and there are a lot of other functionnalities

19/23

slide-28
SLIDE 28

Exercices

There exists a bunch of built-in examples, you can list them with:

drake_examples()

And then load one with:

drake_example("example_name")

This will create a directory with all the necessary files, that you can

  • pen in the IDE of your choice (Rstudio, vim, ...).

20/23

slide-29
SLIDE 29

Exercice: The basic example

The most accessible example for beginners

drake_example("main")

21/23

slide-30
SLIDE 30

Exercice: The mtcars example

This chapter is a walkthrough of drake’s main functionality based

  • n the mtcars example. It sets up the project and runs it repeatedly

to demonstrate drake’s most important functionality.

drake_example("mtcars")

22/23

slide-31
SLIDE 31

Exercice: An analysis of R package download trends

This example explores R package download trends using the cranlogs package, and it shows how drake’s custom triggers can help with workflows with remote data sources.

drake_example("packages")

23/23