ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1
2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b", "c"))) 3 How to create a new tibble from scratch • Although you will most often create new data frames using read_csv , you can create one from scratch by providing arguments to the tibble function in the format name = vector . • Use this function to create small test examples. somgen223.stanford.edu 2
ifelse somgen223.stanford.edu 3
z <- c (1, 2, -999, 4) Replace some indicator value with NA • Suppose that the value -999 has been used to represent a missing value. (Most computer languages do not have the equivalent of R’s NA , so out-of-bounds values are used instead). somgen223.stanford.edu 4
z [1] 1 2 -999 4 ifelse (z == -999, NA, z) [1] 1 2 NA 4 Replace -999 with NA somgen223.stanford.edu 5
NA -999 2 NA FALSE 1 NA FALSE z NA flag z 4 FALSE NA 4 2 NA 1 [1] ifelse (flag, NA, z) TRUE FALSE [1] FALSE FALSE (flag <- z == -999) 4 2 -999 1 [1] TRUE How ifelse works • ifelse takes three arguments: a test vector, here, flag , and two other vectors, here, NA and z . • It returns a vector with elements from either NA or z , depending on whether the corresponding element of flag is TRUE or FALSE . somgen223.stanford.edu 6
Using ifelse to set the color of a point Trebi Wisconsin No. 38 Grand Rapids No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Duluth Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi University Farm Wisconsin No. 38 No. 457 Glabron Peatland Variety of barley Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Morris Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Crookston Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Waseca Peatland Velvet No. 475 Manchuria No. 462 Svansota -20 -10 0 10 20 Difference in yield (1932 vs 1931) somgen223.stanford.edu 7
barley2 <- barley %>% spread (year, yield) %>% mutate (yield_diff = `1932` - `1931`) ggplot (barley2, aes (x = yield_diff, y = variety, color = factor ( ifelse (yield_diff >= 0, "+", "-"), levels = c ("-", "+")))) + geom_point () + scale_color_manual (values = c ("red", "blue")) + xlab ("Difference in yield (1932 vs 1931)") + ylab ("Variety of barley") + facet_grid (rows = vars (site)) + theme (legend.position = "none") + theme (text = element_text (size = 9)) Code for the barley plot somgen223.stanford.edu 8
More about recoding/replacing values somgen223.stanford.edu 9
v <- c (1, 2, -999, 4) na_if (v, -999) [1] 1 2 NA 4 How to replace value with NA somgen223.stanford.edu 10
v2 <- c (1, 2, 3, 4, NA) replace_na (v2, -999) [1] 1 2 3 4 -999 How to replace NA with another value somgen223.stanford.edu 11
5 a 0.861 < NA > 6 6 0.640 b 7 7 NA 8 5 8 0.233 b 9 9 0.666 a 10 (missing_df <- read_csv ( str_c (data_dir, "missing_df.csv"))) b 0.514 b 0.114 a # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 2 4 NA 2 0.622 b 3 3 0.609 a 4 10 Remove all rows containing NA anywhere somgen223.stanford.edu 12
0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 na.omit () 10 Remove all rows containing NA anywhere somgen223.stanford.edu 13
4 missing_df %>% 8 0.666 a 9 7 0.233 b 8 6 0.640 b 6 5 0.861 < NA > 5 0.609 a 0.514 b 3 3 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 8 x 3 filter ( complete.cases (weight)) 10 Remove all rows containing NA in specified columns somgen223.stanford.edu 14
0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 filter ( complete.cases (weight, group)) 10 Another way somgen223.stanford.edu 15
summarize vs mutate on grouped data frames somgen223.stanford.edu 16
summarize 2 135. 4 4 143. 3 3 103. 2 123. cw <- read_csv ( str_c (data_dir, "cw.csv")) 1 1 < dbl > < dbl > diet mean_weight # A tibble: 4 x 2 summarize (mean_weight = mean (weight)) group_by (diet) %>% cw %>% somgen223.stanford.edu 17
8 8 2 103. 6 93 10 1 2 103. 7 106 12 1 2 103. 125 cw %>% group_by (diet) %>% mutate (mean_weight = mean (weight)) 2 2 1 18 171 10 103. 1 14 16 149 9 103. 2 1 1 76 # ... with 568 more rows < dbl > 103. 2 1 0 42 1 < dbl > < dbl > < dbl > < dbl > 51 diet mean_weight time chick weight diet [4] # Groups: # A tibble: 578 x 5 2 2 5 103. 103. 2 1 6 64 4 2 1 1 4 59 3 103. 2 103. mutate on a grouped data frame somgen223.stanford.edu 18
mutate on a grouped data frame • mutate on a grouped data frame will add a new column (or columns), with the values computed over the groups. • The result will have the same number of rows as the original data frame. • This idiom is very useful for finding members of each group that meet some condition. summarize computes properties of the group, but collapses the data of the individuals in the group. somgen223.stanford.edu 19
Exercise: largest difference from the mean weight • Find the chick with the largest difference from its mean weight. somgen223.stanford.edu 20
< dbl > < dbl > < dbl > < dbl > < dbl > 3 35 21 373 1 cw %>% < dbl > diet mean_weight weight_diff 180. time chick weight # A tibble: 1 x 6 filter (weight_diff == max (weight_diff)) ungroup () %>% weight_diff = abs (mean_weight - weight)) %>% mutate (mean_weight = mean (weight), group_by (chick) %>% 193. Answer: largest difference from the mean weight • ungroup is the opposite of group_by , and removes the grouping from a data frame. We need to do this so that filter does not operates on groups. (Try leaving out the ungroup .) somgen223.stanford.edu 21
Exercise: best diet • Which diet produced the largest growth for some chick? somgen223.stanford.edu 22
5 264 5 182 6 6 119 7 7 8 118 8 92 9 9 58 10 10 cw %>% 4 # ... with 40 more rows 1 group_by (chick) %>% ## summarize does not keep the diet summarize (weight_gain = max (weight) - min (weight)) # A tibble: 50 x 2 chick weight_gain < dbl > < dbl > 1 4 163 2 2 175 3 3 163 83 First try at answer: somgen223.stanford.edu 23
2 2 5 182 2 6 6 119 2 7 7 264 8 cw %>% 8 92 2 9 9 58 2 10 10 83 5 118 # ... with 40 more rows 1 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), ## Remember the first value of diet for each chick. ## Note that each chick only gets one diet. diet = first (diet)) # A tibble: 50 x 3 chick weight_gain diet < dbl > < dbl > < dbl > 1 4 163 2 2 2 175 2 3 3 163 2 4 2 Partial answer: somgen223.stanford.edu 24
diet 3 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), diet = first (diet)) %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 chick weight_gain cw %>% < dbl > < dbl > < dbl > 1 35 332 Complete answer: somgen223.stanford.edu 25
diet chick weight_gain 332 group_by (diet, chick) %>% ## This will summarize by chick. Summarize uses the last ## variable in the group_by summarize (weight_gain = max (weight) - min (weight)) %>% ungroup () %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 cw %>% < dbl > < dbl > < dbl > 1 3 35 Another way, using multiple groups • This will be explained in greater detail later. somgen223.stanford.edu 26
Cumulative functions somgen223.stanford.edu 27
1 1 cumsum (delta_x) [1] 1 1 0 1 2 delta_x <- c (1, 0, -1, 1, 1, -1, -1, -1, 1, 1) 0 -1 0 cumsum and similar • cumsum returns a vector with the cumulative sums: the sum of all the numbers up to and including that position. • In this example, we compute the location given the change in x at each step. • This is sometimes called the running sum. somgen223.stanford.edu 28
[1] 1 TRUE TRUE TRUE TRUE TRUE cumsum (delta_x) cumall ( cumsum (delta_x) >= 0) 0 TRUE FALSE FALSE FALSE 0 -1 1 2 1 0 1 1 [1] TRUE Other cumulative functions • This marks with TRUE all the positions where we have not yet moved to the left of the origin. • Other functions in this family: cumprod , cummin , cummax , cumany , cummean . somgen223.stanford.edu 29
lead and lag : using data from the previous or next line somgen223.stanford.edu 30
1 9 # A tibble: 3 x 2 time x < int > < dbl > 1 1 (dist <- tibble (time = 1 : 3, x = (1 : 3) ^ 2)) 2 2 4 3 3 Set up example somgen223.stanford.edu 31
Recommend
More recommend