ifelse summarize mutate cummulative functions lead lag
play

ifelse, summarize/mutate, cummulative functions, lead/lag Steve - PowerPoint PPT Presentation

ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1 2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b",


  1. ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1

  2. 2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b", "c"))) 3 How to create a new tibble from scratch • Although you will most often create new data frames using read_csv , you can create one from scratch by providing arguments to the tibble function in the format name = vector . • Use this function to create small test examples. somgen223.stanford.edu 2

  3. ifelse somgen223.stanford.edu 3

  4. z <- c (1, 2, -999, 4) Replace some indicator value with NA • Suppose that the value -999 has been used to represent a missing value. (Most computer languages do not have the equivalent of R’s NA , so out-of-bounds values are used instead). somgen223.stanford.edu 4

  5. z [1] 1 2 -999 4 ifelse (z == -999, NA, z) [1] 1 2 NA 4 Replace -999 with NA somgen223.stanford.edu 5

  6. NA -999 2 NA FALSE 1 NA FALSE z NA flag z 4 FALSE NA 4 2 NA 1 [1] ifelse (flag, NA, z) TRUE FALSE [1] FALSE FALSE (flag <- z == -999) 4 2 -999 1 [1] TRUE How ifelse works • ifelse takes three arguments: a test vector, here, flag , and two other vectors, here, NA and z . • It returns a vector with elements from either NA or z , depending on whether the corresponding element of flag is TRUE or FALSE . somgen223.stanford.edu 6

  7. Using ifelse to set the color of a point Trebi Wisconsin No. 38 Grand Rapids No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Duluth Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi University Farm Wisconsin No. 38 No. 457 Glabron Peatland Variety of barley Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Morris Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Crookston Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Waseca Peatland Velvet No. 475 Manchuria No. 462 Svansota -20 -10 0 10 20 Difference in yield (1932 vs 1931) somgen223.stanford.edu 7

  8. barley2 <- barley %>% spread (year, yield) %>% mutate (yield_diff = `1932` - `1931`) ggplot (barley2, aes (x = yield_diff, y = variety, color = factor ( ifelse (yield_diff >= 0, "+", "-"), levels = c ("-", "+")))) + geom_point () + scale_color_manual (values = c ("red", "blue")) + xlab ("Difference in yield (1932 vs 1931)") + ylab ("Variety of barley") + facet_grid (rows = vars (site)) + theme (legend.position = "none") + theme (text = element_text (size = 9)) Code for the barley plot somgen223.stanford.edu 8

  9. More about recoding/replacing values somgen223.stanford.edu 9

  10. v <- c (1, 2, -999, 4) na_if (v, -999) [1] 1 2 NA 4 How to replace value with NA somgen223.stanford.edu 10

  11. v2 <- c (1, 2, 3, 4, NA) replace_na (v2, -999) [1] 1 2 3 4 -999 How to replace NA with another value somgen223.stanford.edu 11

  12. 5 a 0.861 < NA > 6 6 0.640 b 7 7 NA 8 5 8 0.233 b 9 9 0.666 a 10 (missing_df <- read_csv ( str_c (data_dir, "missing_df.csv"))) b 0.514 b 0.114 a # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 2 4 NA 2 0.622 b 3 3 0.609 a 4 10 Remove all rows containing NA anywhere somgen223.stanford.edu 12

  13. 0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 na.omit () 10 Remove all rows containing NA anywhere somgen223.stanford.edu 13

  14. 4 missing_df %>% 8 0.666 a 9 7 0.233 b 8 6 0.640 b 6 5 0.861 < NA > 5 0.609 a 0.514 b 3 3 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 8 x 3 filter ( complete.cases (weight)) 10 Remove all rows containing NA in specified columns somgen223.stanford.edu 14

  15. 0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 filter ( complete.cases (weight, group)) 10 Another way somgen223.stanford.edu 15

  16. summarize vs mutate on grouped data frames somgen223.stanford.edu 16

  17. summarize 2 135. 4 4 143. 3 3 103. 2 123. cw <- read_csv ( str_c (data_dir, "cw.csv")) 1 1 < dbl > < dbl > diet mean_weight # A tibble: 4 x 2 summarize (mean_weight = mean (weight)) group_by (diet) %>% cw %>% somgen223.stanford.edu 17

  18. 8 8 2 103. 6 93 10 1 2 103. 7 106 12 1 2 103. 125 cw %>% group_by (diet) %>% mutate (mean_weight = mean (weight)) 2 2 1 18 171 10 103. 1 14 16 149 9 103. 2 1 1 76 # ... with 568 more rows < dbl > 103. 2 1 0 42 1 < dbl > < dbl > < dbl > < dbl > 51 diet mean_weight time chick weight diet [4] # Groups: # A tibble: 578 x 5 2 2 5 103. 103. 2 1 6 64 4 2 1 1 4 59 3 103. 2 103. mutate on a grouped data frame somgen223.stanford.edu 18

  19. mutate on a grouped data frame • mutate on a grouped data frame will add a new column (or columns), with the values computed over the groups. • The result will have the same number of rows as the original data frame. • This idiom is very useful for finding members of each group that meet some condition. summarize computes properties of the group, but collapses the data of the individuals in the group. somgen223.stanford.edu 19

  20. Exercise: largest difference from the mean weight • Find the chick with the largest difference from its mean weight. somgen223.stanford.edu 20

  21. < dbl > < dbl > < dbl > < dbl > < dbl > 3 35 21 373 1 cw %>% < dbl > diet mean_weight weight_diff 180. time chick weight # A tibble: 1 x 6 filter (weight_diff == max (weight_diff)) ungroup () %>% weight_diff = abs (mean_weight - weight)) %>% mutate (mean_weight = mean (weight), group_by (chick) %>% 193. Answer: largest difference from the mean weight • ungroup is the opposite of group_by , and removes the grouping from a data frame. We need to do this so that filter does not operates on groups. (Try leaving out the ungroup .) somgen223.stanford.edu 21

  22. Exercise: best diet • Which diet produced the largest growth for some chick? somgen223.stanford.edu 22

  23. 5 264 5 182 6 6 119 7 7 8 118 8 92 9 9 58 10 10 cw %>% 4 # ... with 40 more rows 1 group_by (chick) %>% ## summarize does not keep the diet summarize (weight_gain = max (weight) - min (weight)) # A tibble: 50 x 2 chick weight_gain < dbl > < dbl > 1 4 163 2 2 175 3 3 163 83 First try at answer: somgen223.stanford.edu 23

  24. 2 2 5 182 2 6 6 119 2 7 7 264 8 cw %>% 8 92 2 9 9 58 2 10 10 83 5 118 # ... with 40 more rows 1 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), ## Remember the first value of diet for each chick. ## Note that each chick only gets one diet. diet = first (diet)) # A tibble: 50 x 3 chick weight_gain diet < dbl > < dbl > < dbl > 1 4 163 2 2 2 175 2 3 3 163 2 4 2 Partial answer: somgen223.stanford.edu 24

  25. diet 3 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), diet = first (diet)) %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 chick weight_gain cw %>% < dbl > < dbl > < dbl > 1 35 332 Complete answer: somgen223.stanford.edu 25

  26. diet chick weight_gain 332 group_by (diet, chick) %>% ## This will summarize by chick. Summarize uses the last ## variable in the group_by summarize (weight_gain = max (weight) - min (weight)) %>% ungroup () %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 cw %>% < dbl > < dbl > < dbl > 1 3 35 Another way, using multiple groups • This will be explained in greater detail later. somgen223.stanford.edu 26

  27. Cumulative functions somgen223.stanford.edu 27

  28. 1 1 cumsum (delta_x) [1] 1 1 0 1 2 delta_x <- c (1, 0, -1, 1, 1, -1, -1, -1, 1, 1) 0 -1 0 cumsum and similar • cumsum returns a vector with the cumulative sums: the sum of all the numbers up to and including that position. • In this example, we compute the location given the change in x at each step. • This is sometimes called the running sum. somgen223.stanford.edu 28

  29. [1] 1 TRUE TRUE TRUE TRUE TRUE cumsum (delta_x) cumall ( cumsum (delta_x) >= 0) 0 TRUE FALSE FALSE FALSE 0 -1 1 2 1 0 1 1 [1] TRUE Other cumulative functions • This marks with TRUE all the positions where we have not yet moved to the left of the origin. • Other functions in this family: cumprod , cummin , cummax , cumany , cummean . somgen223.stanford.edu 29

  30. lead and lag : using data from the previous or next line somgen223.stanford.edu 30

  31. 1 9 # A tibble: 3 x 2 time x < int > < dbl > 1 1 (dist <- tibble (time = 1 : 3, x = (1 : 3) ^ 2)) 2 2 4 3 3 Set up example somgen223.stanford.edu 31

Recommend


More recommend