finding packages project organization
play

Finding packages, project organization Steve Bagley - PowerPoint PPT Presentation

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R packages There are over 15,000 packages available for R. Thats great, but how do you find what you want? Task Views


  1. Finding packages, project organization Steve Bagley somgen223.stanford.edu 1

  2. How to find R packages • There are over 15,000 packages available for R. • That’s great, but how do you find what you want? • Task Views (https://cran.r-project.org/web/views/): human-curated lists of packages for a given area. • METACRAN (https://r-pkg.org/): provides some more organization to CRAN. somgen223.stanford.edu 2

  3. What is stored where • R servers (usually CRAN, but also BioConductor). Contain the packages. Also, some packages live on the developer’s website. • Your computer . Contains the packages you have installed. • Your (project) directory . Contains script (program) files, data files, output files. • The workspace . Contains the current variable and function bindings, and packages that have been loaded since starting R. somgen223.stanford.edu 3

  4. Where to put files for your project • The most natural organization of a project uses the tree-like structure of a hierarchical file system. • For each project, put all the scripts/code in one directory or sub-directory. • R (and RStudio) have a notion of the current working directory, which can be set through the graphical user interface, or using R commands (setwd, getwd). somgen223.stanford.edu 4

  5. The workspace • The workspace contains all the functions and variables that you have defined in it (but not deleted from it). • You can save the workspace contents, close R, and then restart it, restoring all of the workspace contents. • When you restart, all of your data will be there. But you still will need to reload all the packages you use. • You might not want to rely on save the workspace for two reasons: • It may be easier to start over with a fresh workspace than to try undoing some complicated error. • You want a written record of reproducible commands (scripts) to create the state, not just the state itself. somgen223.stanford.edu 5

  6. Project organization: RStudio • Use the Project menu (upper right corner) to create/open a project. somgen223.stanford.edu 6

  7. Project organization: directories 1. Make a project directory: .../dolphin/ 2. Make a subdirectory for the input data files: .../dolphin/data 3. Make a subdirectory for your code/script files: .../dolphin/src 4. Make a subdirectory for the output files: .../dolphin/output 5. Make a subdirectory for all pdfs: .../dolphin/figures 6. Make a subdirectory for any papers: .../dolphin/papers somgen223.stanford.edu 7

  8. How to approach a new dataset 1. Whenever you get a new dataset, record when you got it, where you got it from. 2. Read the raw data from a file or url. 3. Fix column names to make all subsequent manipulation easier. 4. Figure out the meaning of the data in each column. You may have received a description of the data (“metadata”), or something called a “data dictionary”. If not, you may need to apply your knowledge of the domain and some common sense. 5. Start testing your assumptions about the data (and about the metadata, which can be wrong). Look for illegal values (completely out of bounds), outliers (possible, but unlikely), missing values, typos, coding errors, inconsistencies. 6. In general, try to fix the problems by writing a sequence of R expressions (script or R Markdown). This makes your work reproducible: you can rerun the script, or use it on the next version of the data. Try to never modify by hand the source files containing the original data. somgen223.stanford.edu 8

  9. How to approach data visualization • Compared to what : decide how to make a meaningful comparison. • This could be: treatment vs control, compared to baseline, compared to some simple null model, trend over time, trend over space. • Then display the data to make this comparison visually salient. somgen223.stanford.edu 9

  10. How to start the exploration • Make some assumptions, even very simple or straightforward ones, about the data. Sometimes these are explicitly stated by whoever gives you the data. (They might be wrong.) • See if those assumptions hold true. • Iterate, trying to build up an explanation (model) in your head. • Focus on understanding, make the graph pretty later. somgen223.stanford.edu 10

  11. Saving figures • Use R Markdown to make a computational lab notebook. This will show your entire analysis workflow, and can include data frame tables and figures. • You can write a figure out to a file. somgen223.stanford.edu 11

  12. ## This opens a pdf file for writing. pdf ("../figures/fig27.pdf") ## This plot is sent to the file ggplot (iris, aes (Petal.Width, Petal.Length)) + geom_point ( aes (color = Species)) ## This closes the file dev.off () Saving figures • ../ is the parent directory of the current directory. • ../figures/ is the sibling directory, assuming we are in src . • pdf writes pdf files. • postscript writes postscript files. • png writes png files. • jpeg writes jpeg files. • tiff writes tiff files. • svg writes svg files. somgen223.stanford.edu 12

  13. Kinds of data frames somgen223.stanford.edu 13

  14. data.frame tibble data.table data.frame vs tibble vs data.table Type Package Notes built-in slow for big data, some odd defaults tidyverse used throughout tidyverse, fast enough data.table very fast, syntax is powerful/complex somgen223.stanford.edu 14

Recommend


More recommend