Joins, and dates/times Steve Bagley somgen223.stanford.edu 1
Joining data frames • It is common to have related data in two or more data frames. • It may be more convenient to have all the data in a single data frame for analysis and for plotting. • Merging data this way is called “joining.” somgen223.stanford.edu 2
12 3 GKK7 2 ABC123 13 1 XYZ3 < dbl > < chr > length gene # A tibble: 3 x 2 (gene_length <- read_csv ( str_c (data_dir, "gene_length.csv"))) 13 (gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv"))) 3 3 GKK7 10 2001 1 0 1 ABC123 < dbl > < dbl > < chr > control treatment gene # A tibble: 3 x 3 100 Getting some data to join 2 DEF234 somgen223.stanford.edu 3
inner_join 0 2001 13 12 2 GKK7 inner_join (gene_exp1, gene_length, by = "gene") 1 100 1 ABC123 < dbl > < dbl > < dbl > < chr > control treatment length gene # A tibble: 2 x 4 • by specifies the “key”: which columns to use to control the join. • The rows in both data frames will be aligned using the by column. • A row is included in the inner join if its key appears in both data frames. Note: this might throw away a lot of rows. • The join result includes any column that appears in either data frame. somgen223.stanford.edu 4
100 1 ABC123 control 100 1 13 2001 12 control 2001 gene_tall <- gene_exp1 %>% 0 < dbl > treatment < dbl > < chr > < chr > condition expression_level length gene # A tibble: 4 x 4 inner_join (gene_tall, gene_length, by = "gene") control : treatment) gather (condition, expression_level, 4 GKK7 Exercise: explain this result 2 GKK7 3 ABC123 treatment somgen223.stanford.edu 5
Answer: explain this result • Each gene appears twice in gene_tall . • The join operation aligns each copy with the row in gene_length , duplicating the information in gene_length . somgen223.stanford.edu 6
12 control 7 XYZ3 2001 13 treatment 6 GKK7 NA 3 13 100 1 4 ABC123 treatment 2001 full_join (gene_tall, gene_length, by = "gene") 3 GKK7 NA NA 10 2 DEF234 control 100 0 1 ABC123 control < dbl > < dbl > < chr > < chr > condition expression_level length gene # A tibble: 7 x 4 < NA > full_join example 5 DEF234 treatment somgen223.stanford.edu 7
full_join explained • A key appears in the result if it appears in either data frame. • All the data from both data frames are included. • If data are missing from one data frame, then NA ’s are inserted. • Make sure you understand why the result on the previous slide has 7 rows. somgen223.stanford.edu 8
semi_join 0 13 treatment 1 12 control semi_join (gene_tall, gene_length, by = "gene") 2 GKK7 1 ABC123 control < dbl > < chr > < chr > condition expression_level gene # A tibble: 4 x 3 3 ABC123 treatment 4 GKK7 • Result includes all rows of gene_tall that have a key in gene_length . somgen223.stanford.edu 9
anti_join anti_join (gene_tall, gene_length, by = "gene") # A tibble: 2 x 3 gene condition expression_level < chr > < chr > < dbl > 1 DEF234 control 10 2 DEF234 treatment 3 • Results includes all rows of gene_tall that do not have a key in gene_length . somgen223.stanford.edu 10
filtering joins • semi_join and anti_join are filtering joins: they filter rows (of the first argument). • They do not include any new columns. somgen223.stanford.edu 11
Dates and times somgen223.stanford.edu 12
Dates and times • Dates and times are complicated: leap years, month/day/year vs day/month/year vs …, time zones, daylight saving time, leap seconds, 12-hour vs 24-hour format, …. somgen223.stanford.edu 13
parse_date ("2015-11-10") [1] "2015-11-10" parse_date ("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11" Examples • Dates and times come in many different formats. • parse_date takes a format argument that uses a special pattern code for identifying what is expected, and in what order. • See ?parse_date for details. somgen223.stanford.edu 14
parse_datetime ("10/11/2015", format = "%m/%d/%Y") [1] "2015-10-11 UTC" parse_datetime ("2015-11-10") [1] "2015-11-10 UTC" parse_datetime ("10/11/2015 13:45:09", format = "%m/%d/%Y %H:%M:%S") [1] "2015-10-11 13:45:09 UTC" parse_datetime ("10/11/2015 13:45:09 America/Los_Angeles", format = "%m/%d/%Y %H:%M:%S %Z") [1] "2015-10-11 20:45:09 UTC" Examples • parse_datetime works on dates with times (and time zones). • See ?parse_datetime for details. somgen223.stanford.edu 15
Reading • Read: 13 Relational data | R for Data Science • Read: 16 Dates and times | R for Data Science somgen223.stanford.edu 16
Recommend
More recommend