Reshaping a data frame Steve Bagley somgen223.stanford.edu 1
Reshaping data • Sometimes data are organized in a way that makes it difficult to compute in a vector-oriented way. • Sometimes data elements are included in the column names. • The tidyr package (part of tidyverse ) allows you to change the organization of the data, keeping the content the same. somgen223.stanford.edu 2
2 DEF234 13 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 0 1 (gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv"))) 10 3 3 GKK7 12 Reshaping example • This is in wide format: the column names contain data (the condition). • Adding another condition, such as treatment2 , would create a new column. • In R, it is sometimes useful to set up the data frame so that new data are added as rows ( tall or long format). somgen223.stanford.edu 3
< chr > # A tibble: 6 x 3 1 4 ABC123 treatment 12 control treatment 10 13 0 1 ABC123 control < dbl > < chr > gene_exp1 condition expression_level gene gene_tall 3 1 ABC123 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 0 13 1 2 DEF234 10 3 3 GKK7 12 6 GKK7 Wide vs tall 2 DEF234 control 3 GKK7 5 DEF234 treatment • Convince yourself that the same information appears in these two data frames. • In gene_tall , the column names describe the data; they don’t contain any data. somgen223.stanford.edu 4
4 ABC123 treatment expression_level 13 1 (gene_tall <- gather (gene_exp1, condition, expression_level, 12 control 3 GKK7 10 0 treatment 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 control : treatment)) 3 Using gather 2 DEF234 control 5 DEF234 treatment 6 GKK7 • The arguments to gather: 1. The data frame , here gene_exp1 2. The key , which is the name of the new column for the values from the old column names, here condition 3. The value , which is the name of the new column for the data values, here 4. The columns from which to gather the data. Here we use the : operator to name a range of columns somgen223.stanford.edu 5
Exercise: using tidy data • Filter the rows of gene_tall for gene ABC123 only. • Separately, filter to get only the control condition. somgen223.stanford.edu 6
gene # A tibble: 3 x 3 10 2 DEF234 control 0 1 ABC123 control < dbl > < chr > < chr > condition expression_level filter (gene_tall, gene == "ABC123") filter (gene_tall, condition == "control") control 1 2 ABC123 treatment 0 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 2 x 3 12 Answer: using tidy data 3 GKK7 somgen223.stanford.edu 7
Exercise: compute change in gene expression • Compute the change ( treatment - control ) for each sample • Hint: think about what shape the data should have to enable this computation. somgen223.stanford.edu 8
1 1 ABC123 -7 3 10 2 DEF234 1 1 0 < dbl > 12 < dbl > < dbl > < chr > control treatment change gene # A tibble: 3 x 4 mutate (gene_exp1, change = treatment - control) 13 Answer: compute change in gene expression • We use the data in the original format because then we can subtract the columns. 3 GKK7 somgen223.stanford.edu 9
Exercise: filtering the gene expression data • Produce a data frame that includes all data where the control or treatment expression value is above 5. somgen223.stanford.edu 10
2 GKK7 13 filter (expression_level > 5) # A tibble: 3 x 3 gene condition expression_level < chr > < chr > < dbl > 1 DEF234 control 10 gene_tall %>% control 12 3 GKK7 treatment Answer: filtering the gene expression data • Note that this computation is very easy to do using the tall data format. In this format, it is a single comparison that works for both treatment and control groups. somgen223.stanford.edu 11
1 ABC123 12 group_by (gene) %>% summarize (min_level = min (expression_level)) # A tibble: 3 x 2 gene min_level < chr > < dbl > gene_tall %>% 0 2 DEF234 3 Example: what is the minimum expression level for each gene? 3 GKK7 • This drops the condition column, which we might want to retain. somgen223.stanford.edu 12
< dbl > < chr > 3 12 0 1 ABC123 control gene_tall %>% < chr > condition expression_level control gene # A tibble: 3 x 3 ungroup () slice (1) %>% arrange (expression_level) %>% group_by (gene) %>% 3 GKK7 Keep the entire row with the minimum 2 DEF234 treatment • slice(1) returns the first row in the group • ungroup removes the grouping attribute that was added by group_by . somgen223.stanford.edu 13
< dbl > < chr > 3 GKK7 3 2 DEF234 treatment 0 1 ABC123 control gene_tall %>% < chr > 12 condition expression_level gene gene [3] # Groups: # A tibble: 3 x 3 slice ( which.min (expression_level)) group_by (gene) %>% control Even better: which.min • which.min returns the index of the vector element with the first occurrence of the minimum value. somgen223.stanford.edu 14
6 GKK7 2 DEF234 control 3 13 1 4 ABC123 treatment 12 control 10 0 treatment 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 gene_tall The opposite of gather • Suppose you started with the data in tall format. 3 GKK7 5 DEF234 treatment • How would you convert it to the wide format? somgen223.stanford.edu 15
10 12 # A tibble: 3 x 3 gene control treatment < chr > < dbl > < dbl > 1 ABC123 0 1 2 DEF234 spread (gene_tall, condition, expression_level) 3 13 spread is the opposite of gather 3 GKK7 • spread constructs wide data frames. • The second argument defines the column in the tall format to be used to make new column names. • The third argument defines the column in the tall format to be used as the source of data for those new columns. somgen223.stanford.edu 16
How to get information out of the column names somgen223.stanford.edu 17
20 3 2 4 DEF234 1 30 6 6 3 DEF234 1 (gene_exp2 <- read_csv ( str_c (data_dir, "gene_exp2.csv"))) 4 2 ABC123 40 12121 1 10 3 1 1 ABC123 < dbl > < dbl > < dbl > < dbl > < chr > d1_g1 d1_g2 d2_g1 d2_g2 gene # A tibble: 4 x 5 5 Example of getting information out of column names • This data frame has useful information encoded in the column names representing the number of the day and the group. • We want to move those data down into the contents of the data frame. somgen223.stanford.edu 18
2 ABC123 d1_g1 1 4 2 4 DEF234 d1_g1 6 3 DEF234 d1_g1 3 gene_exp2 %>% 1 ABC123 d1_g1 6 ABC123 d1_g2 < dbl > < chr > < chr > condition expression_level gene # A tibble: 6 x 3 head () gather (condition, expression_level, d1_g1 : d2_g2) %>% 3 Convert from wide to tall format 5 ABC123 d1_g2 somgen223.stanford.edu 19
3 2 ABC123 d1 3 g2 5 ABC123 d1 2 g1 4 6 g1 3 DEF234 d1 gene_exp2 %>% g1 1 g2 g1 1 ABC123 d1 < dbl > < chr > < chr > < chr > group expression_level day gene # A tibble: 6 x 4 head () separate (condition, into = c ("day", "group"), sep = "_") %>% gather (condition, expression_level, d1_g1 : d2_g2) %>% 6 ABC123 d1 Getting data out of the condition column 4 DEF234 d1 • The condition column has the compressed format for the values: d1_g1 means “day 1, group 1”. We need to split the string apart at the "_" character using separate . somgen223.stanford.edu 20
1 1 6 ABC123 1 2 1 2 6 1 4 3 gene_exp2 %>% 2 ABC123 1 1 1 ABC123 1 3 < dbl > < chr > < chr > < chr > group expression_level day gene # A tibble: 6 x 4 head () group = str_remove (group, "g")) %>% mutate (day = str_remove (day, "d"), separate (condition, into = c ("day", "group"), sep = "_") %>% gather (condition, expression_level, d1_g1 : d2_g2) %>% 2 Clean up the data: strings 3 DEF234 1 4 DEF234 1 5 ABC123 1 • If we want to get rid of the "d" and "g" prefixes, we need to do some string manipulation. somgen223.stanford.edu 21
str_remove ( c ("d1", "d2", "ddddd3", "dxy"), "d") [1] "1" "2" "dddd3" "xy" str_remove : replace one occurrence of pattern in string somgen223.stanford.edu 22
3 1 gene_exp2 %>% 3 DEF234 1 1 6 4 DEF234 1 1 2 5 ABC123 1 2 3 6 ABC123 1 4 2 # A tibble: 6 x 4 gather (condition, expression_level, d1_g1 : d2_g2) %>% separate (condition, into = c ("day", "group"), sep = "_") %>% mutate (day = str_remove (day, "d"), group = str_remove (group, "g")) %>% mutate (day = as.integer (day), group = as.integer (group)) %>% head () gene 1 day group expression_level < chr > < int > < int > < dbl > 1 ABC123 1 1 1 Clean up the data: numbers 2 ABC123 • The values in the day and group columns are characters, not numbers, so coerce to the desired type. somgen223.stanford.edu 23
Reading • Read: 12 Tidy data | R for Data Science (sections 12.1 to 12.4) • Read: Tidy data • tidyr somgen223.stanford.edu 24
Recommend
More recommend