advanced column oriented methods all at if
play

Advanced column-oriented methods: _all, _at, _if Steve Bagley - PowerPoint PPT Presentation

Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1 Different ways to select columns It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by


  1. Advanced column-oriented methods: _all, _at, _if Steve Bagley somgen223.stanford.edu 1

  2. Different ways to select columns • It is easy to use filter to select rows: the filter expressions can use the values in the columns that are specified by writing the column names. • To use select , we provide the column names. • What if we want to select columns based on some aspect of the column names? • What if we want to select columns based on the values in those columns, such as, all columns that contain at least one NA value? • Somehow, we need to compute the identity of the desired columns. somgen223.stanford.edu 2

  3. ## select by computing which columns match a pattern 3 gene_exp2 %>% select_at ( vars ( starts_with ("d1"))) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 1 2 5 3 4 3 6 6 4 gene_exp2 <- read_csv ( str_c (data_dir, "gene_exp2.csv")) 2 5 1 ## select by using the exact names gene_exp2 %>% select (d1_g1, d1_g2) # A tibble: 4 x 2 d1_g1 d1_g2 < dbl > < dbl > 1 3 4 2 3 4 3 6 6 2 select_at : when you can compute the names of the columns somgen223.stanford.edu 3

  4. everything starts_with ends_with contains matches num_range last_col one_of Functions you can use with select_at from the documentation: Function Notes Starts with a prefix Ends with a suffix Contains a literal string Matches a regular expression Matches a numerical range like x01, x02, x03 Matches variable names in a character vector Matches all variables Select last variable, possibly with an offset somgen223.stanford.edu 4

  5. 20 4 2 4 1 30 6 6 3 1 gene_exp2 %>% select_at ( vars ( contains ("_"))) 3 40 12121 2 1 10 3 1 1 < dbl > < dbl > < dbl > < dbl > d1_g1 d1_g2 d2_g1 d2_g2 # A tibble: 4 x 4 5 Example of select_at somgen223.stanford.edu 5

  6. 1 1 # A tibble: 4 x 2 gene d2_g2 < chr > < dbl > 1 ABC123 gene_exp2 %>% select_at ( vars ( -contains ("_"), last_col ())) 2 ABC123 1 3 DEF234 Example of select_at 4 DEF234 12121 • vars accepts multiple specifications. somgen223.stanford.edu 6

  7. 3 4 DEF234 # A tibble: 4 x 3 gene d1_g1 d1_g2 < chr > < dbl > < dbl > 1 ABC123 1 3 5 gene_exp2 %>% select_at ( vars ("gene", starts_with ("d1"))) 4 2 6 6 Example of select_at 2 ABC123 3 DEF234 • Can use exact name of column, as a string. somgen223.stanford.edu 7

  8. 2 ABC123 1 1 6 3 DEF234 1 4 gene_exp2 %>% select_at ( vars (1, ends_with ("g2"))) 3 5 12121 1 ABC123 < dbl > < dbl > < chr > d1_g2 d2_g2 gene # A tibble: 4 x 3 4 DEF234 Example of select_at • Can use the number of a column. somgen223.stanford.edu 8

  9. 2 ABC123 5 12121 # A tibble: 4 x 3 gene d1_g2 d2_g2 < chr > < dbl > < dbl > 1 ABC123 3 1 gene_exp2 %>% select_at ( vars ( seq (from = 1, to = 5, by = 2))) 4 1 3 DEF234 6 1 Example of select_at 4 DEF234 • This is useful if there is a regular pattern to the columns to want to keep. somgen223.stanford.edu 9

  10. 20 4 DEF234 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 40 12121 gene_exp2 %>% select_if ( function (x) any (x > 10)) 1 3 DEF234 30 1 select_if : when you use the contents of the columns 2 ABC123 • The function is applied to the vector containing the contents of the column and returns TRUE to select that column. somgen223.stanford.edu 10

  11. gene_exp2 $ d2_g2 [1] 1 1 1 12121 gene_exp2 $ d2_g2 > 10 [1] FALSE FALSE FALSE TRUE any (gene_exp2 $ d2_g2 > 10) [1] TRUE the anonymous function somgen223.stanford.edu 11

  12. 2 ABC123 1 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 gene_exp2 %>% select_if ( function (x) any (x > 10)) 20 1 40 12121 30 select_if (repeated) 3 DEF234 4 DEF234 • Why is the gene column selected? Hint: Is "ABC123" > "10" ? somgen223.stanford.edu 12

  13. 20 40 12121 # A tibble: 4 x 3 gene d2_g1 d2_g2 < chr > < dbl > < dbl > 1 ABC123 10 1 2 ABC123 gene_exp2 %>% select_if ( ~any (. > 10)) 1 3 DEF234 30 1 select_if : alternative function syntax 4 DEF234 • Inside a ~ function, . refers to the argument passed in, in this case, each column in succession. • This syntax is often a bit shorter than using function (...) ... somgen223.stanford.edu 13

  14. Same idea for mutate somgen223.stanford.edu 14

  15. mutate_at mutate_at (gene_exp2, vars ( ends_with ("g2")), function (x) - x) 40 -12121 -5 2 4 DEF234 -1 30 -6 6 -1 20 -4 3 < dbl > < dbl > < dbl > gene 10 -3 1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose name contains the string “g2”. somgen223.stanford.edu 15

  16. mutate_if mutate_if (gene_exp2, is.numeric, function (x) - x) -40 -12121 -5 -2 4 DEF234 -1 -30 -6 -6 -1 -20 -4 -3 < dbl > < dbl > < dbl > gene -10 -3 -1 # A tibble: 4 x 5 1 ABC123 < dbl > -1 < chr > d2_g2 d1_g1 d1_g2 d2_g1 2 ABC123 3 DEF234 • This will negate the values in all columns whose contents are numeric. somgen223.stanford.edu 16

  17. Same idea for rename somgen223.stanford.edu 17

  18. tibble (`this is a col name` = 3 : 4) %>% rename_all ( ~str_replace_all (., " ", "_")) # A tibble: 2 x 1 this_is_a_col_name < int > 1 3 2 4 Replace spaces in column names somgen223.stanford.edu 18

  19. tibble (Col1 = 1 : 2, Col2 = 3 : 4) %>% rename_all ( ~str_to_lower (.)) # A tibble: 2 x 2 col1 col2 < int > < int > 1 1 3 2 2 4 Use all lower case in column names somgen223.stanford.edu 19

  20. Same idea for summarize somgen223.stanford.edu 20

  21. 0.640 b 10 7 NA a 8 8 0.233 b 9 9 0.666 a 10 (m <- read_csv ( str_c (data_dir, "missing_df.csv"))) 0.514 b m %>% summarize_all ( ~sum ( is.na (.))) # A tibble: 1 x 3 id weight group < int > < int > < int > 1 0 7 6 1 2 # A tibble: 10 x 3 id weight group < dbl > < dbl > < chr > 1 1 0.114 a 2 0.622 b 6 3 3 0.609 a 4 4 NA b 5 5 0.861 < NA > 2 Count number of NA values in each column somgen223.stanford.edu 21

  22. m %>% summarize_if (is.numeric, ~mean (., na.rm = TRUE)) # A tibble: 1 x 2 id weight < dbl > < dbl > 1 5.5 0.532 Summarize with mean • Summarize by computing the mean of all numeric columns, ignoring NA s. somgen223.stanford.edu 22

  23. # A tibble: 2 x 2 4 2 3 1 1 < int > < int > new_a new_b (d1 <- tibble (a = 1 : 2, b = 3 : 4)) (d2 <- set_names (d1, c ("new_a", "new_b"))) 2 4 2 3 1 1 < int > < int > b a # A tibble: 2 x 2 2 Setting column names • set_names can assign all the columns new names. • Remember to save the new frame. somgen223.stanford.edu 23

  24. Grouping over multiple columns somgen223.stanford.edu 24

  25. Memantine DYRK1A_N 2 Control Saline 4 Control 0.592 DYRK1A_N Saline 3 Control 0.515 (group_by_example <- read_csv ( str_c (data_dir, "group_by_example.csv"))) 0.504 0.590 Memantine DYRK1A_N 1 Control < dbl > < chr > < chr > < chr > expression_value Genotype Treatment gene # A tibble: 4 x 4 DYRK1A_N Get example dataset • This is part of the intermediate result from data_challenge_mouse_protein_expression . somgen223.stanford.edu 25

  26. group_by_example %>% group_by (Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 2 Treatment mean_expression < chr > < dbl > 1 Memantine 0.509 0.591 Summarize by Treatment 2 Saline somgen223.stanford.edu 26

  27. group_by_example %>% group_by (gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 1 x 2 gene mean_expression < chr > < dbl > 1 DYRK1A_N 0.550 Summarize by gene somgen223.stanford.edu 27

  28. < chr > DYRK1A_N group_by (Treatment, gene) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: Treatment [2] Treatment gene mean_expression < chr > group_by_example %>% < dbl > 1 Memantine DYRK1A_N 0.509 0.591 Summarize Treatment, gene 2 Saline • Note the result is grouped by Treatment . • If you summarize a grouped data frame, the last group is removed. somgen223.stanford.edu 28

  29. < chr > 0.591 group_by (gene, Treatment) %>% summarize (mean_expression = mean (expression_value)) # A tibble: 2 x 3 # Groups: gene [1] gene Treatment mean_expression group_by_example %>% < chr > < dbl > 1 DYRK1A_N Memantine 0.509 2 DYRK1A_N Saline Summarize gene, Treatment • Note the result is grouped by gene . somgen223.stanford.edu 29

Recommend


More recommend