ifelse, summarize/mutate, cummulative functions, lead/lag Steve - PowerPoint PPT Presentation

ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1

2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b", "c"))) 3 How to create a new tibble from scratch • Although you will most often create new data frames using read_csv , you can create one from scratch by providing arguments to the tibble function in the format name = vector . • Use this function to create small test examples. somgen223.stanford.edu 2

ifelse somgen223.stanford.edu 3

z <- c (1, 2, -999, 4) Replace some indicator value with NA • Suppose that the value -999 has been used to represent a missing value. (Most computer languages do not have the equivalent of R’s NA , so out-of-bounds values are used instead). somgen223.stanford.edu 4

z [1] 1 2 -999 4 ifelse (z == -999, NA, z) [1] 1 2 NA 4 Replace -999 with NA somgen223.stanford.edu 5

NA -999 2 NA FALSE 1 NA FALSE z NA flag z 4 FALSE NA 4 2 NA 1 [1] ifelse (flag, NA, z) TRUE FALSE [1] FALSE FALSE (flag <- z == -999) 4 2 -999 1 [1] TRUE How ifelse works • ifelse takes three arguments: a test vector, here, flag , and two other vectors, here, NA and z . • It returns a vector with elements from either NA or z , depending on whether the corresponding element of flag is TRUE or FALSE . somgen223.stanford.edu 6

Using ifelse to set the color of a point Trebi Wisconsin No. 38 Grand Rapids No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Duluth Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi University Farm Wisconsin No. 38 No. 457 Glabron Peatland Variety of barley Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Morris Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Crookston Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Waseca Peatland Velvet No. 475 Manchuria No. 462 Svansota -20 -10 0 10 20 Difference in yield (1932 vs 1931) somgen223.stanford.edu 7

barley2 <- barley %>% spread (year, yield) %>% mutate (yield_diff = `1932` - `1931`) ggplot (barley2, aes (x = yield_diff, y = variety, color = factor ( ifelse (yield_diff >= 0, "+", "-"), levels = c ("-", "+")))) + geom_point () + scale_color_manual (values = c ("red", "blue")) + xlab ("Difference in yield (1932 vs 1931)") + ylab ("Variety of barley") + facet_grid (rows = vars (site)) + theme (legend.position = "none") + theme (text = element_text (size = 9)) Code for the barley plot somgen223.stanford.edu 8

More about recoding/replacing values somgen223.stanford.edu 9

v <- c (1, 2, -999, 4) na_if (v, -999) [1] 1 2 NA 4 How to replace value with NA somgen223.stanford.edu 10

v2 <- c (1, 2, 3, 4, NA) replace_na (v2, -999) [1] 1 2 3 4 -999 How to replace NA with another value somgen223.stanford.edu 11

5 a 0.861 < NA > 6 6 0.640 b 7 7 NA 8 5 8 0.233 b 9 9 0.666 a 10 (missing_df <- read_csv ( str_c (data_dir, "missing_df.csv"))) b 0.514 b 0.114 a # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 2 4 NA 2 0.622 b 3 3 0.609 a 4 10 Remove all rows containing NA anywhere somgen223.stanford.edu 12

0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 na.omit () 10 Remove all rows containing NA anywhere somgen223.stanford.edu 13

4 missing_df %>% 8 0.666 a 9 7 0.233 b 8 6 0.640 b 6 5 0.861 < NA > 5 0.609 a 0.514 b 3 3 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 8 x 3 filter ( complete.cases (weight)) 10 Remove all rows containing NA in specified columns somgen223.stanford.edu 14

0.609 a 3 7 0.666 a 9 6 0.233 b 8 5 0.640 b 6 4 missing_df %>% 3 0.514 b 0.622 b 2 2 0.114 a 1 1 < dbl > < chr > < dbl > id weight group # A tibble: 7 x 3 filter ( complete.cases (weight, group)) 10 Another way somgen223.stanford.edu 15

summarize vs mutate on grouped data frames somgen223.stanford.edu 16

summarize 2 135. 4 4 143. 3 3 103. 2 123. cw <- read_csv ( str_c (data_dir, "cw.csv")) 1 1 < dbl > < dbl > diet mean_weight # A tibble: 4 x 2 summarize (mean_weight = mean (weight)) group_by (diet) %>% cw %>% somgen223.stanford.edu 17

8 8 2 103. 6 93 10 1 2 103. 7 106 12 1 2 103. 125 cw %>% group_by (diet) %>% mutate (mean_weight = mean (weight)) 2 2 1 18 171 10 103. 1 14 16 149 9 103. 2 1 1 76 # ... with 568 more rows < dbl > 103. 2 1 0 42 1 < dbl > < dbl > < dbl > < dbl > 51 diet mean_weight time chick weight diet [4] # Groups: # A tibble: 578 x 5 2 2 5 103. 103. 2 1 6 64 4 2 1 1 4 59 3 103. 2 103. mutate on a grouped data frame somgen223.stanford.edu 18

mutate on a grouped data frame • mutate on a grouped data frame will add a new column (or columns), with the values computed over the groups. • The result will have the same number of rows as the original data frame. • This idiom is very useful for finding members of each group that meet some condition. summarize computes properties of the group, but collapses the data of the individuals in the group. somgen223.stanford.edu 19

Exercise: largest difference from the mean weight • Find the chick with the largest difference from its mean weight. somgen223.stanford.edu 20

< dbl > < dbl > < dbl > < dbl > < dbl > 3 35 21 373 1 cw %>% < dbl > diet mean_weight weight_diff 180. time chick weight # A tibble: 1 x 6 filter (weight_diff == max (weight_diff)) ungroup () %>% weight_diff = abs (mean_weight - weight)) %>% mutate (mean_weight = mean (weight), group_by (chick) %>% 193. Answer: largest difference from the mean weight • ungroup is the opposite of group_by , and removes the grouping from a data frame. We need to do this so that filter does not operates on groups. (Try leaving out the ungroup .) somgen223.stanford.edu 21

Exercise: best diet • Which diet produced the largest growth for some chick? somgen223.stanford.edu 22

5 264 5 182 6 6 119 7 7 8 118 8 92 9 9 58 10 10 cw %>% 4 # ... with 40 more rows 1 group_by (chick) %>% ## summarize does not keep the diet summarize (weight_gain = max (weight) - min (weight)) # A tibble: 50 x 2 chick weight_gain < dbl > < dbl > 1 4 163 2 2 175 3 3 163 83 First try at answer: somgen223.stanford.edu 23

2 2 5 182 2 6 6 119 2 7 7 264 8 cw %>% 8 92 2 9 9 58 2 10 10 83 5 118 # ... with 40 more rows 1 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), ## Remember the first value of diet for each chick. ## Note that each chick only gets one diet. diet = first (diet)) # A tibble: 50 x 3 chick weight_gain diet < dbl > < dbl > < dbl > 1 4 163 2 2 2 175 2 3 3 163 2 4 2 Partial answer: somgen223.stanford.edu 24

diet 3 group_by (chick) %>% summarize (weight_gain = max (weight) - min (weight), diet = first (diet)) %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 chick weight_gain cw %>% < dbl > < dbl > < dbl > 1 35 332 Complete answer: somgen223.stanford.edu 25

diet chick weight_gain 332 group_by (diet, chick) %>% ## This will summarize by chick. Summarize uses the last ## variable in the group_by summarize (weight_gain = max (weight) - min (weight)) %>% ungroup () %>% filter (weight_gain == max (weight_gain)) # A tibble: 1 x 3 cw %>% < dbl > < dbl > < dbl > 1 3 35 Another way, using multiple groups • This will be explained in greater detail later. somgen223.stanford.edu 26

Cumulative functions somgen223.stanford.edu 27

1 1 cumsum (delta_x) [1] 1 1 0 1 2 delta_x <- c (1, 0, -1, 1, 1, -1, -1, -1, 1, 1) 0 -1 0 cumsum and similar • cumsum returns a vector with the cumulative sums: the sum of all the numbers up to and including that position. • In this example, we compute the location given the change in x at each step. • This is sometimes called the running sum. somgen223.stanford.edu 28

[1] 1 TRUE TRUE TRUE TRUE TRUE cumsum (delta_x) cumall ( cumsum (delta_x) >= 0) 0 TRUE FALSE FALSE FALSE 0 -1 1 2 1 0 1 1 [1] TRUE Other cumulative functions • This marks with TRUE all the positions where we have not yet moved to the left of the origin. • Other functions in this family: cumprod , cummin , cummax , cumany , cummean . somgen223.stanford.edu 29

lead and lag : using data from the previous or next line somgen223.stanford.edu 30

1 9 # A tibble: 3 x 2 time x < int > < dbl > 1 1 (dist <- tibble (time = 1 : 3, x = (1 : 3) ^ 2)) 2 2 4 3 3 Set up example somgen223.stanford.edu 31

ifelse, summarize/mutate, cummulative functions, lead/lag Steve - PowerPoint PPT Presentation

ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1 2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b",

Proje jects Lynn Hamilton FVL LAG/VisitScotland Anne-Michelle Ketteridge FVL LAG Larry Rosie SG

Reverberation Lag Variability Abdu Zoghbi IoA, Cambridge Southampton :: 21/05/2010 Andy Fabian,

Classify then Summarize or Summarize then Classify Melvin F. Janowitz DIMACS, Rutgers University

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Calculating the Average and SD in R group_by() and summarize() # group and summarize data

Learning to love the SAS LAG function Phuse 9-12 October 2011 Herman Ament, MSD, Oss NL Phuse

Recap Hashing-based sketch techniques summarize large data sets Summarize vectors: Test

Plan of the Lecture Review: control design using frequency response: PI/lead Todays

Keep Lead from Keep Lead from Lurking Lurking Lead Testing and Lead Testing and Healthy

On Reverberation Mapping Lag Uncertainties Zhefu Yu, Department of Astronomy, The Ohio State

Technical lag for software Jesus M. Gonzalez-Barahona deployments The balance Releases

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Introduction to R Week 4: Grouping and tables Louisa Smith August 3 - August 7 Let's summarize

EE3CL4: Compensator Design Introduction to Linear Control Systems Lead Compensators Section

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Simplex Method The beer problem: we want to produce beer, either blonde, or brown

County Agricultural Production Survey: 2020 Small Grains (CROPS CE) United States Department of

Robert Millikan (1868-1953) His Religious Life and Thought Let me then henceforth use the

Anglo-Chinese School (Junior) Primary 5 & 6 Pupils Meet-The-Parents Session 1 9 January

Rotations: How do I choose? CELUP Oak Park IPM benefits/ Concerns Importance of a good rotation

The $10 Million George Barley Water Prize Widespread Water Crisis Nutrient pollution is impacting

Australias role in Asia Mark Palmquist Major importers of dry climate grains Wheat, barley,

CS-5630 / CS-6630 Visualization for Data Science Views Alexander Lex alex@sci.utah.edu [xkcd]

ifelse, summarize/mutate, cummulative functions, lead/lag Steve - PowerPoint PPT Presentation

ifelse, summarize/mutate, cummulative functions, lead/lag Steve Bagley somgen223.stanford.edu 1 2 b 3 c # A tibble: 3 x 2 x label < int > < chr > 1 1 a 2 (new_df <- tibble (x = 1 : 3, label = c ("a", "b",

Proje jects Lynn Hamilton FVL LAG/VisitScotland Anne-Michelle Ketteridge FVL LAG Larry Rosie SG

Reverberation Lag Variability Abdu Zoghbi IoA, Cambridge Southampton :: 21/05/2010 Andy Fabian,

Classify then Summarize or Summarize then Classify Melvin F. Janowitz DIMACS, Rutgers University

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Calculating the Average and SD in R group_by() and summarize() # group and summarize data

Learning to love the SAS LAG function Phuse 9-12 October 2011 Herman Ament, MSD, Oss NL Phuse

Recap Hashing-based sketch techniques summarize large data sets Summarize vectors: Test

Plan of the Lecture Review: control design using frequency response: PI/lead Todays

Keep Lead from Keep Lead from Lurking Lurking Lead Testing and Lead Testing and Healthy

On Reverberation Mapping Lag Uncertainties Zhefu Yu, Department of Astronomy, The Ohio State

Technical lag for software Jesus M. Gonzalez-Barahona deployments The balance Releases

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Introduction to R Week 4: Grouping and tables Louisa Smith August 3 - August 7 Let's summarize

EE3CL4: Compensator Design Introduction to Linear Control Systems Lead Compensators Section

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Simplex Method The beer problem: we want to produce beer, either blonde, or brown

County Agricultural Production Survey: 2020 Small Grains (CROPS CE) United States Department of

Robert Millikan (1868-1953) His Religious Life and Thought Let me then henceforth use the

Anglo-Chinese School (Junior) Primary 5 &amp; 6 Pupils Meet-The-Parents Session 1 9 January

Rotations: How do I choose? CELUP Oak Park IPM benefits/ Concerns Importance of a good rotation

The $10 Million George Barley Water Prize Widespread Water Crisis Nutrient pollution is impacting

Australias role in Asia Mark Palmquist Major importers of dry climate grains Wheat, barley,

CS-5630 / CS-6630 Visualization for Data Science Views Alexander Lex alex@sci.utah.edu [xkcd]

Anglo-Chinese School (Junior) Primary 5 & 6 Pupils Meet-The-Parents Session 1 9 January