Data analysis pipelines Reading and tidying tables R.W. Oldford
readr - importing “rectangular” data Often data are stored in a “rectangular” format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a number of functions which facilitate reading such data into R and turning them into data frames (tibbles): ◮ read_csv() for comma separated files, ◮ read_csv2() for semicolon separated files, ◮ read_tsv() for tab separated files, ◮ read_delim() for arbitrary delimiter separating entries, ◮ read_fwf() for fixed width files, ◮ read_table() for fixed width files with white space separating columns, and ◮ read_log() for Apache style log files (see also the ‘webreadr‘ package for reading log files). Why use these?
readr - importing “rectangular” data Why read_csv() instead of say read.csv() ? ◮ has a lot in common with read.csv() ◮ flexible about metadata, comments, NA spec, . . . ◮ clever ◮ about numbers (e.g. 3.14159 versus 3,14159 ) ◮ tries to infer data types (e.g. from first 1,000 lines) ◮ identifies problems encountered ◮ much much faster ◮ always returns a tibble ◮ usually more portable code
readr - a little cleverer Suppose we have the following files test1.csv 2,4,6 3.1, 14, 59 test2.csv a,b 1,2,3 4,5,6 test3.csv x,y, 10, 9, 8 3.1,14,59 each being in some dataDirectory Note: each of the above files ends with a new line.
test1.csv - read.csv versus read_csv test1.csv 2,4,6 3.1, 14, 59 # a data.frame (x <- read.csv ( path_concat (dataDirectory, "test1.csv"))) ## X2 X4 X6 ## 1 3.1 14 59 (x <- read.csv ( path_concat (dataDirectory, "test1.csv"), header = FALSE)) ## V1 V2 V3 ## 1 2.0 4 6 ## 2 3.1 14 59
test1.csv - read.csv versus read_csv test1.csv 2,4,6 3.1, 14, 59 # a tibble (y <- read_csv ( path_concat (dataDirectory, "test1.csv"))) ## # A tibble: 1 x 3 ## `2` `4` `6` ## <dbl> <dbl> <dbl> ## 1 3.1 14 59 (y <- read_csv ( path_concat (dataDirectory, "test1.csv"), col_names = FALSE)) # can also supply names ## # A tibble: 2 x 3 ## X1 X2 X3 ## <dbl> <dbl> <dbl> ## 1 2 4 6 ## 2 3.1 14 59
test2.csv - first using read.csv test2.csv a,b 1,2,3 4,5,6 # a data.frame (x <- read.csv ( path_concat (dataDirectory, "test2.csv"))) ## a b ## 1 2 3 ## 4 5 6 row.names (x) ## [1] "1" "4" (x <- read.csv ( path_concat (dataDirectory, "test2.csv"), header = FALSE)) ## V1 V2 V3 ## 1 a b NA ## 2 1 2 3 ## 3 4 5 6 class (x $ V1) ; class (x $ V3) ## [1] "factor" ## [1] "integer"
test2.csv - now using read_csv test2.csv a,b 1,2,3 4,5,6 # a tibble (y <- read_csv ( path_concat (dataDirectory, "test2.csv"))) ## # A tibble: 2 x 2 ## a b ## <dbl> <dbl> ## 1 1 2 ## 2 4 5 problems (y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 1 <NA> 2 columns 3 columns './data/test2.csv' ## 2 2 <NA> 2 columns 3 columns './data/test2.csv'
test2.csv - again with read_csv test2.csv a,b 1,2,3 4,5,6 (y <- read_csv ( path_concat (dataDirectory, "test2.csv"), col_names = FALSE)) ## # A tibble: 3 x 2 ## X1 X2 ## <chr> <chr> ## 1 a b ## 2 1 2 ## 3 4 5 problems (y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 2 <NA> 2 columns 3 columns './data/test2.csv' ## 2 3 <NA> 2 columns 3 columns './data/test2.csv'
test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (x <- read.csv ( path_concat (dataDirectory, "test3.csv"))) ## x y X ## 1 10.0 9 8 ## 2 3.1 14 59 class (x $ x) ## [1] "numeric" (x <- read.csv ( path_concat (dataDirectory, "test3.csv"), header = FALSE)) ## V1 V2 V3 ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 class (x $ V3) ## [1] "integer"
test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (y <- read_csv ( path_concat (dataDirectory, "test3.csv"))) ## # A tibble: 2 x 3 ## x y X3 ## <dbl> <dbl> <dbl> ## 1 10 9 8 ## 2 3.1 14 59 spec (y) ; is.null ( problems (y)) ## cols( ## x = col_double(), ## y = col_double(), ## X3 = col_double() ## ) ## [1] FALSE
test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (y <- read_csv ( path_concat (dataDirectory, "test3.csv"), col_names = FALSE)) ## # A tibble: 3 x 3 ## X1 X2 X3 ## <chr> <chr> <dbl> ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 problems (y) ## [1] row col expected actual ## <0 rows> (or 0-length row.names)
readr - parsing files All depends on how readr package parses a vector. From vignette("readr") : "The key problem that readr solves is parsing a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages: 1. The flat file is parsed into a rectangular matrix of strings. 2. The type of each column is determined. 3. Each column of strings is parsed into a vector of a more specific type. It’s easiest to learn how this works in the opposite order Below, you’ll learn how the: 1. Vector parsers turn a character vector in to a more specific type. 2. Column specification describes the type of each column and the strategy readr uses to guess types so you don’t need to supply them all. 3. Rectangular parsers turn a flat file into a matrix of rows and columns. Each parse_*() is coupled with a col_*() function, which will be used in the process of parsing a complete tibble." You can override defaults.
Tidy data - tidyr So everything is a “table” of some sort, a rectangular tibble . tidyr allows messy data to be reshaped into “tidier” data, that is easier for analysis. For example, the tidyr package has a small table, table1 , containing the number of cases of tuberculosis (TB) in Afghanistan, Brazil, and China in the years 1999 and 2000. It also records the population of those countries in each year. The data are table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Think of how we might have actually collected this data. I imagine visiting each country (its records at least) for each year and recording the number of cases of TB and its population.
Tidy data - tidyr Here is the same data but stored in a different tabular form. table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 How do you imagine this data was collected? Does each ◮ row feel like an observation measured on several variates? ◮ column feel like a variate collected on several individuals? ◮ cell feel like a measurement on a single variate for a single case? Rows, columns, and cells ideally all “feel right” in that one can imagine collecting the data in this way. table2 does not feel as natural as table1 in the context of this problem.
Tidy data - tidyr Natural questions for this data: ◮ How does the rate of TB per 100,000 people compare between countries?. ◮ How does the rate compare from one year to the next for each country? To compute the rate from table1 table1 %>% mutate (rate = (cases / population) * 100000) ## # A tibble: 6 x 5 ## country year cases population rate ## <chr> <int> <int> <int> <dbl> ## 1 Afghanistan 1999 745 19987071 3.73 ## 2 Afghanistan 2000 2666 20595360 12.9 ## 3 Brazil 1999 37737 172006362 21.9 ## 4 Brazil 2000 80488 174504898 46.1 ## 5 China 1999 212258 1272915272 16.7 ## 6 China 2000 213766 1280428583 16.7
Tidy data - tidyr To compare the rates using table1 table1 %>% mutate (rate = (cases / population) * 100000) %>% ggplot ( aes (x=year, y=rate)) + ggtitle ("Tuberculosis rates per 100,000") + geom_line ( aes (group = country, colour = country), lwd = 1.5) + geom_point ( aes (colour = country), size = 5) + scale_x_continuous (breaks= c (1999, 2000)) Tuberculosis rates per 100,000 40 country 30 Afghanistan rate Brazil 20 China 10 1999 2000 year How about beginning with table2 ?
Recommend
More recommend