Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1
Different ways to select columns • It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by writing the column names. • To use select , we provide the column names. • What if we want to select columns based on some aspect of the column names? • What if we want to select columns based on the values in those columns, such as, all columns that contain at least one NA value? • Somehow, we need to compute the identity of the desired columns. somgen223.stanford.edu 2
## select by computing which columns match a pattern 3 gene_exp2 %>% select_at ( vars ( starts_with ("d1"))) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 1 2 5 3 4 3 6 6 4 gene_exp2 <- read_csv ( str_c (data_dir, "gene_exp2.csv")) 2 5 1 ## select by using the exact names gene_exp2 %>% select (d1_g1, d1_g2) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 3 4 2 3 4 3 6 6 2 select_at : when you can compute the names of the columns somgen223.stanford.edu 3
everything starts_with ends_with contains matches num_range last_col one_of Functions you can use with select_at from the documentation: Function Notes Starts with a prefix Ends with a suffix Contains a literal string Matches a regular expression Matches a numerical range like x01, x02, x03 Matches variable names in a character vector Matches all variables Select last variable, possibly with an offset somgen223.stanford.edu 4
20 4 2 4 1 30 6 6 3 1 gene_exp2 %>% select_at ( vars ( contains ("_"))) 3 40 12121 2 1 10 3 1 1 < dbl > < dbl > < dbl > < dbl > d1_g1 d1_g2 d2_g1 d2_g2 # A tibble: 4 x 4 5 Example of select_at somgen223.stanford.edu 5
1 1 # A tibble: 4 x 2 gene d2_g2 < chr > < dbl > 1 ABC123 gene_exp2 %>% select_at ( vars ( -contains ("_"), last_col ())) 2 ABC123 1 3 DEF234 Example of select_at 4 DEF234 12121 • vars accepts multiple specifications. somgen223.stanford.edu 6
3 4 DEF234 # A tibble: 4 x 3 gene d1_g1 d1_g2 < chr > < dbl > < dbl > 1 ABC123 1 3 5 gene_exp2 %>% select_at ( vars ("gene", starts_with ("d1"))) 4 2 6 6 Example of select_at 2 ABC123 3 DEF234 • Can use exact name of column, as a string. somgen223.stanford.edu 7
2 ABC123 1 1 6 3 DEF234 1 4 gene_exp2 %>% select_at ( vars (1, ends_with ("g2"))) 3 5 12121 1 ABC123 < dbl > < dbl > < chr > d1_g2 d2_g2 gene # A tibble: 4 x 3 4 DEF234 Example of select_at • Can use the number of a column. somgen223.stanford.edu 8
2 ABC123 5 12121 # A tibble: 4 x 3 gene d1_g2 d2_g2 < chr > < dbl > < dbl > 1 ABC123 3 1 gene_exp2 %>% select_at ( vars ( seq (from = 1, to = 5, by = 2))) 4 1 3 DEF234 6 1 Example of select_at 4 DEF234 • This is useful if there is a regular pattern to the columns to want to keep. somgen223.stanford.edu 9
20 4 DEF234 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 40 12121 gene_exp2 %>% select_if ( function (x) any (x > 10)) 1 3 DEF234 30 1 select_if : when you use the contents of the columns 2 ABC123 • The function is applied to the vector containing the contents of the column and returns TRUE to select that column. somgen223.stanford.edu 10
gene_exp2 $ d2_g2 [1] 1 1 1 12121 gene_exp2 $ d2_g2 > 10 [1] FALSE FALSE FALSE TRUE any (gene_exp2 $ d2_g2 > 10) [1] TRUE the anonymous function somgen223.stanford.edu 11
2 ABC123 1 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 gene_exp2 %>% select_if ( function (x) any (x > 10)) 20 1 40 12121 30 select_if (repeated) 3 DEF234 4 DEF234 • Why is the gene column selected? Hint: Is "ABC123" > "10" ? somgen223.stanford.edu 12
20 40 12121 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 2 ABC123 gene_exp2 %>% select_if ( ~any (. > 10)) 1 3 DEF234 30 1 select_if : alternative function syntax 4 DEF234 • Inside a ~ function, . refers to the argument passed in, in this case, each column in succession. • This syntax is often a bit shorter than using function (...) ... somgen223.stanford.edu 13
Same idea for mutate somgen223.stanford.edu 14
mutate_at mutate_at (gene_exp2, vars ( ends_with ("g2")), function (x) - x) 40 -12121 -5 2 4 DEF234 -1 30 -6 6 -1 20 -4 3 < dbl > < dbl > < dbl > gene 10 -3 1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose name contains the string “g2”. somgen223.stanford.edu 15
mutate_if mutate_if (gene_exp2, is.numeric, function (x) - x) -40 -12121 -5 -2 4 DEF234 -1 -30 -6 -6 -1 -20 -4 -3 < dbl > < dbl > < dbl > gene -10 -3 -1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose contents are numeric. somgen223.stanford.edu 16
Same idea for rename somgen223.stanford.edu 17
tibble (`this is a col name` = 3 : 4) %>% rename_all ( ~str_replace_all (., " ", "_")) # A tibble: 2 x 1 this_is_a_col_name < int > 1 3 2 4 Replace spaces in column names somgen223.stanford.edu 18
tibble (Col1 = 1 : 2, Col2 = 3 : 4) %>% rename_all ( ~str_to_lower (.)) # A tibble: 2 x 2 col1 col2 < int > < int > 1 1 3 2 2 4 Use all lower case in column names somgen223.stanford.edu 19
Same idea for summarize somgen223.stanford.edu 20
0.640 b 10 7 NA a 8 8 0.233 b 9 9 0.666 a 10 (m <- read_csv ( str_c (data_dir, "missing_df.csv"))) 0.514 b m %>% summarize_all ( ~sum ( is.na (.))) # A tibble: 1 x 3 id weight group < int > < int > < int > 1 0 7 6 1 2 # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 0.114 a 2 0.622 b 6 3 3 0.609 a 4 4 NA b 5 5 0.861 < NA > 2 Count number of NA values in each column somgen223.stanford.edu 21
m %>% summarize_if (is.numeric, ~mean (., na.rm = TRUE)) # A tibble: 1 x 2 id weight < dbl > < dbl > 1 5.5 0.532 Summarize with mean • Summarize by computing the mean of all numeric columns, ignoring NA s. somgen223.stanford.edu 22
# A tibble: 2 x 2 4 2 3 1 1 < int > < int > new_a new_b (d1 <- tibble (a = 1 : 2, b = 3 : 4)) (d2 <- set_names (d1, c ("new_a", "new_b"))) 2 4 2 3 1 1 < int > < int > b a # A tibble: 2 x 2 2 Setting column names • set_names can assign all the columns new names. • Remember to save the new frame. somgen223.stanford.edu 23
Grouping over multiple columns somgen223.stanford.edu 24
Memantine DYRK1A_N 2 Control Saline 4 Control 0.592 DYRK1A_N Saline 3 Control 0.515 (group_by_example <- read_csv ( str_c (data_dir, "group_by_example.csv"))) 0.504 0.590 Memantine DYRK1A_N 1 Control < dbl > < chr > < chr > < chr > expression_value Genotype Treatment gene # A tibble: 4 x 4 DYRK1A_N Get example dataset • This is part of the intermediate result from data_challenge_mouse_protein_expression . somgen223.stanford.edu 25
group_by_example %>% group_by (Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 2 Treatment mean_expression < chr > < dbl > 1 Memantine 0.509 0.591 Summarize by Treatment 2 Saline somgen223.stanford.edu 26
group_by_example %>% group_by (gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 1 x 2 gene mean_expression < chr > < dbl > 1 DYRK1A_N 0.550 Summarize by gene somgen223.stanford.edu 27
< chr > DYRK1A_N group_by (Treatment, gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: Treatment [2] Treatment gene mean_expression < chr > group_by_example %>% < dbl > 1 Memantine DYRK1A_N 0.509 0.591 Summarize Treatment, gene 2 Saline • Note the result is grouped by Treatment . • If you summarize a grouped data frame, the last group is removed. somgen223.stanford.edu 28
< chr > 0.591 group_by (gene, Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: gene [1] gene Treatment mean_expression group_by_example %>% < chr > < dbl > 1 DYRK1A_N Memantine 0.509 2 DYRK1A_N Saline Summarize gene, Treatment • Note the result is grouped by gene . somgen223.stanford.edu 29
Recommend
More recommend