introduction to data frames
play

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 - PowerPoint PPT Presentation

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 Using packages from the tidyverse somgen223.stanford.edu 2 install.packages ("tidyverse") Need to install the tidyverse set of packages Type this:


  1. Introduction to data frames Steve Bagley somgen223.stanford.edu 1

  2. Using packages from the tidyverse somgen223.stanford.edu 2

  3. install.packages ("tidyverse") Need to install the tidyverse set of packages • Type this: • “tidyverse” is a coherent set of packages for operating a kind of data called the “data frame.” • It is not built-in, so you need to install it (once), then load it each time you restart R. • Put library(tidyverse) or library("tidyverse") at the top of every script file. • More about packages later. somgen223.stanford.edu 3

  4. Data frame: a two-dimensional data structure A data frame is one of the most powerful features in R. • It is a rectangular data structure that can contain different types of data, similar to an Excel spreadsheet. • Typically, each row in a data frame contains information about one instance of some (real-world) object. • Each column can be thought of as a variable, containing the values for the corresponding instances. • All the values in one column should be of the same type, but different columns can be of different types. somgen223.stanford.edu 4

  5. 4 3 6 15.6 5 5 16 4 bod <- as_tibble (BOD) 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 bod 7 Data frame example • This data set contains data on biological oxygen demand. • A tibble is a kind of data frame. This one has 6 rows and 2 columns. • Across the top is the name of each column. The next row shows the type of data in the column. <dbl> , means double-precision floating point number. • The row numbers are added for printing. They are not part of the data frame. somgen223.stanford.edu 5

  6. 4 3 6 15.6 5 5 16 4 bod 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame example Check claim on earlier slide: • Typically, each row in a data frame describes an instance of some (real-world) object. (Yes: one row for each subject, time, or measurement.) • Each column contains the values of a variable for the corresponding instance. (Yes: one column for each variable.) somgen223.stanford.edu 6

  7. Data frame functions • The rest of this section shows data frame functions (“verbs”). • Each function takes a data frame and produces a new data frame. somgen223.stanford.edu 7

  8. 10.3 19 # A tibble: 3 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 filter (bod, Time < 4) 3 3 Data frame function: filtering rows • This produces (and prints out) a new tibble, which contains all the rows where the Time value in that row is less than 4. • There are only 3 rows in this data frame. • There are still 2 columns. • filter does not modify bod , which still has 6 rows and 2 columns. somgen223.stanford.edu 8

  9. filter (bod, Time < 4, demand <= 16) # A tibble: 2 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 10.3 Combining constraints in filter • This filters by the conjunction of the two constraints—both must be satisfied. • Constraints appear as second (and third, …) arguments, separated by commas. somgen223.stanford.edu 9

  10. 3 10.3 5 16 4 4 19 3 filter (bod, Time < 4 | demand <= 16) 2 15.6 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 5 x 2 5 Disjunction with filter • This filters by the disjunction of the two constraints—either must be satisfied. • More about logical operators later. somgen223.stanford.edu 10

  11. filter (bod, Time > 10) # A tibble: 0 x 2 # ... with 2 variables: Time <dbl>, demand <dbl> Filtering out all rows • If the constraint is too severe, then you will produce a tibble with zero rows. somgen223.stanford.edu 11

  12. 19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, demand) 4 16 5 15.6 6 Data frame function: select columns • The select function will return a subset of the tibble, using only the requested columns in the order specified. somgen223.stanford.edu 12

  13. 19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, - Time) 4 16 5 15.6 6 Use - to leave out a column • The - operator can be used to leave out a column. somgen223.stanford.edu 13

  14. 15.6 3 6 19 3 5 16 4 4 arrange (bod, demand) 5 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame function: arrange to sort rows • arrange takes a data frame and a column, and sorts the rows by the values in that column in ascending order. somgen223.stanford.edu 14

  15. 4 3 6 10.3 2 5 15.6 5 4 16 arrange (bod, desc (demand)) 19 8.3 3 2 19.8 7 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 1 Use desc to sort from high to low somgen223.stanford.edu 15

  16. Interlude somgen223.stanford.edu 16

  17. 1 : 5 [1] 1 2 3 4 5 5 : 1 [1] 5 4 3 2 1 seq (1, 5) [1] 1 2 3 4 5 seq (5, 1) [1] 5 4 3 2 1 Arguments by position • seq is the function equivalent of the colon operator. • Arguments can be specified by position , with one supplied argument for each name in the function parameter list, and in the same order. • Arguments are separated by commas. somgen223.stanford.edu 17

  18. seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (to = 5, from = 1) [1] 1 2 3 4 5 Arguments by name • Arguments can be supplied by name using the syntax, variable = value . • When using names, the order of the named arguments does not matter. • You can mix positional and named arguments (carefully). • Do not use <- in place of = when specifying named arguments. somgen223.stanford.edu 18

  19. seq (1, 5) [1] 1 2 3 4 5 seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (begin = 1, end = 5) Warning : In seq.default (begin = 1, end = 5) : extra arguments 'begin', 'end' will be disregarded [1] 1 Using the correct argument name • You have to use the correct argument name. somgen223.stanford.edu 19

  20. ## Try this: ?seq How to find the names of a function’s arguments • How can you figure out the names of seq’s arguments? • Answer: the arguments are listed in the R documentation of the function. somgen223.stanford.edu 20

  21. x <- 10 x [1] 10 (x <- 10) [1] 10 How to assign and evaluate in one line • When you are typing to R, it is very common to want to assign a variable and return the value of the assignment. • But ordinarily, R does not return a visible result from an assignment expression. • You can force it to do so in a single expression by putting parentheses, ( ) , around the assignment . • This is used throughout the R documentation and some of the slides for this course, so you should be familiar with this bit of syntax. somgen223.stanford.edu 21

  22. Back to data frame functions somgen223.stanford.edu 22

  23. 4 3 7 6 0.0641 15.6 5 5 0.0625 16 4 (bod2 <- mutate (bod, inv_demand = 1 / demand)) 0.0526 19 3 0.0505 0.0971 10.3 2 2 0.120 8.3 1 1 < dbl > < dbl > < dbl > Time demand inv_demand # A tibble: 6 x 3 19.8 Data frame function: mutate to compute new column • This uses mutate to add a new column to bod which is the reciprocal of demand . • Note use of = to assign to new column. Do not use <- here! • The result of this is a new data frame with the new column. You must assign to a new (or old) name to save the result. • Note use of ( ) to assign and evaluate on a single line. somgen223.stanford.edu 23

  24. 664 58 6 1 120 1231 5 1 115 1004 4 1 87 orange <- as_tibble (Orange) 3 1 484 142 2 1 30 118 1 1 < dbl > < ord > < dbl > age circumference Tree # A tibble: 6 x 3 head (orange) 1372 Exercise: change units in orange trees • Add a new column to orange called circum_in which is the circumference in inches, not in millimeters. • Hint: 1 in = 2.54 cm somgen223.stanford.edu 24

  25. 1231 5.71 120 4.72 6 1 1372 142 5.59 7 1 1582 145 8 2 5 1 118 33 1.30 9 2 484 69 2.72 10 2 664 111 orange <- mutate (orange, circum_in = circumference / (10 * 2.54)) 4.53 # ... with 25 more rows 30 orange # A tibble: 35 x 4 Tree age circumference circum_in < ord > < dbl > < dbl > < dbl > 1 1 118 1.18 115 2 1 484 58 2.28 3 1 664 87 3.43 4 1 1004 4.37 Answer: change units in orange trees somgen223.stanford.edu 25

  26. cw <- read_csv ("https://somgen223.stanford.edu/data/cw.csv") data_dir <- "https://somgen223.stanford.edu/data/" cw <- read_csv ( str_c (data_dir, "cw.csv")) str_c ("abc", "def") [1] "abcdef" Use read_csv to read csv file (from file or url) • We will need that directory many times in this course, so use str_c to construct the location for read_csv • data_dir holds the first part of the url as a string • str_c glues all of its arguments together into a single string, like so: somgen223.stanford.edu 26

  27. Reading • Read: 5 Data transformation | R for Data Science (sections 5.1 to 5.5) somgen223.stanford.edu 27

Recommend


More recommend