group_by , summarize , factors Steve Bagley somgen223.stanford.edu 1
Data frame manipulation: group_by , summarize somgen223.stanford.edu 2
3.4 1 3 2 5 3.3 2 2 2 4 1.1 2 1 2 3 data_dir <- "https://somgen223.stanford.edu/data/" 2 6.6 1 2 1.6 1 1 1 1 < dbl > < dbl > < dbl > < dbl > diet weight time chick # A tibble: 5 x 4 (cw1 <- read_csv ( str_c (data_dir, "cw1.csv"))) 2 Set up cw1 somgen223.stanford.edu 3
cw1 %>% distinct (diet) # A tibble: 2 x 1 diet < dbl > 1 1 2 2 Computing over groups • There are two different diets. • What is the mean weight of all the chicks on each diet? somgen223.stanford.edu 4
cw1 %>% group_by (diet) %>% summarize (mean_weight = mean (weight)) # A tibble: 2 x 2 diet mean_weight < dbl > < dbl > 1 1 2.5 2 2 3.67 Computing the mean weight of each diet somgen223.stanford.edu 5
group_by 2 3 2 1 2 1.1 4 2 1 2 3.3 5 2 3 2 6.6 cw1 %>% 3.4 2 diet weight group_by (diet) # A tibble: 5 x 4 # Groups: diet [2] chick 1 time < dbl > < dbl > < dbl > < dbl > 1 1 1 1 1.6 2 • This looks like the original data frame, except for the additional comment line: # Groups: ... , which is a record of the variables used to form groups. No analysis has happened yet. somgen223.stanford.edu 6
summarize cw1 %>% group_by (diet) %>% summarize (mean_weight = mean (weight)) • summarize takes a grouped data frame and performs the specified operation separately for all the values in each group. • In this case, mean will get called 2 times, once on each subset of rows corresponding to each value of diet . • The results for each group are then combined into a single data frame with the final result. • Note that the result has one row for each group value. somgen223.stanford.edu 7
cw1 %>% summarize (mean_weight = mean (weight)) # A tibble: 1 x 1 mean_weight < dbl > 1 3.20 summarize on an ungrouped data frame • Note also that you can use summarize on an ungrouped data frame: you’ll get one row of results. In this case, it will contain the overall mean weight (of all chicks). somgen223.stanford.edu 8
1 6.6 group_by (diet) %>% summarize (mean_weight = mean (weight), max_weight = max (weight)) # A tibble: 2 x 3 diet mean_weight max_weight < dbl > < dbl > < dbl > 1 cw1 %>% 2.5 3.4 2 2 3.67 Computing more than one summary at the same time • max(weight) will return the maximum value of the weight column. • Do not use max on the entire data frame: max(cw1) ! somgen223.stanford.edu 9
Exercise: the range of weights • For each diet, compute the range of weights (max - min), and sort the result by the range. somgen223.stanford.edu 10
< dbl > 5.5 group_by (diet) %>% summarize (weight_range = max (weight) - min (weight)) %>% arrange (weight_range) # A tibble: 2 x 2 diet weight_range < dbl > cw1 %>% 1 1 1.8 2 2 Answer: the range of weights somgen223.stanford.edu 11
Exercise: max weight of each chick • For each chick, compute its maximum weight somgen223.stanford.edu 12
cw1 %>% group_by (chick) %>% summarize (max_weight = max (weight)) # A tibble: 2 x 2 chick max_weight < dbl > < dbl > 1 1 3.4 2 2 6.6 Answer: max weight of each chick somgen223.stanford.edu 13
1 3 group_by (diet) %>% summarize (n_diet = n ()) # A tibble: 2 x 2 diet n_diet < dbl > < int > 1 cw1 %>% 2 2 2 How many chicks are on each diet? • The function n() returns the number of rows in a group. • group_by/summarize computes the number of rows in each group. somgen223.stanford.edu 14
Exercise: How many measurements for each chick? • Compute the number of measurements (rows) for each chick. somgen223.stanford.edu 15
cw1 %>% group_by (chick) %>% summarize (n_measurements = n ()) # A tibble: 2 x 2 chick n_measurements < dbl > < int > 1 1 2 2 2 3 Answer: How many measurements for each chick? somgen223.stanford.edu 16
Factors somgen223.stanford.edu 17
Defining factors • Factors are a powerful, but sometimes perplexing, way to work with discrete-valued data. • The possible values of a factor are drawn from a finite set of alternatives or categories. Factors are often used in graphics and analysis for grouping. • Example: encoding the sex of a human subject as either M or F and grouping by sex. • Example: encoding the names of the fifty US states and grouping by state. • Note that many measured values are better represented not as factors but as either integers (such as for counting) or floating-point (real-valued) numbers. Example: number of subjects, weight. • We will return to factors later in the course. somgen223.stanford.edu 18
Reading • Read: 5 Data transformation | R for Data Science (sections 5.6 to 5.7) • Watch at least part of this video: Tidy Tuesday screencast: analyzing malaria incidence in R - YouTube (or another video from the same channel). somgen223.stanford.edu 19
Recommend
More recommend