1 Tidying and exploring data Workshop 5
2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by ‘tidy data’ - Devise reproducible strategies to tidy imported data Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas Remember to apply what you know about reproducibility
3 Outline Owes much to Hadley Wickham Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10 “tidy datasets are all alike but every messy dataset is messy in its own way” Difficult to be comprehensive!
4 Cleaning, tidying, exploring = 80-85% Iterative Cleaning - Content NAs, factor levels, variable names Tidying - Organisation Key concept - variables in columns, cases in rows, Clean and tidy datasets are easy to work with Exploring will help you see if it’s clean and tidy
5 What is tidy? Each variable is in a named column Each row is an observation Easy to explore, plot, model, report. Easy way to think about data. Several powerful packages exist.
6 Tidy buoy #44025 data One example: Indicative of types of manipulation possible - the sophisticated way # read the first line in A single vars <- readLines(file, n = 1) string vars "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" # we can split the line into separate strings using strsplit. Here we want to split on any number of white spaces. we also use unlist to store the result as a character vector instead of a list coln <- unlist(strsplit(vars, split = "\\s+", fixed = F)) coln [1] "#YY" "MM" "DD" "hh" "mm" "WDIR" "WSPD" "GST" "WVHT" "DPD" "APD" "MWD" "PRES" "ATMP" [15] "WTMP" "DEWP" "VIS" "TIDE" names(mydata) <- coln str(mydata) 'data.frame': 6358 obs. of 18 variables: $ #YY : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ MM : int 12 1 1 1 1 1 1 1 1 1 ... $ DD : int 31 1 1 1 1 1 1 1 1 1 ... $ hh : int 23 0 1 2 3 4 5 6 7 8 ... Key point: almost all things are scriptable ………Google!
7 Tidy buoy #44025 data Key point: almost all things are scriptable………... ………even if you have to fudge it a bit…………Be creative # less sophisticated alternative names(mydata) <-c("YY","MM", "DD", "hh", "mm", "WDIR","WSPD","GST", "WVHT","DPD","APD","MWD", "PRES","ATMP", "WTMP", "DEWP", "VIS", "TIDE")
8 Useful tidying packages Tidy Untidy: one variable in several columns; multiple obs in a row library(tidyr) biomass2 <- gather(data = biomass, fertiliser, mass) library(reshape2) biomass2 <- melt(biomass, measure.vars = 1:6) Key point: TMTOWTDI
9 Tidy Untidy:data in rows and columns; one obs per cell library(reshape2) fungi2 <- melt(fungi, id.vars = "Temperature", measure.vars = c("A","B","C","D"))
10 Untidy: data are not factors > mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... Example from the ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... Importing data slides .. ..- attr(*, "label")= chr "Country" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:3] 1 2 3 .. .. ..- attr(*, "names")= chr [1:3] "U.K" "France" "Germany" $ woodtype:Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Wood Type" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:2] 1 2 .. .. ..- attr(*, "names")= chr [1:2] "Deciduous" "Mixed" .... .... .... Tidier > mydata$country <- as_factor(mydata$country) > mydata$woodtype <- as_factor(mydata$woodtype) > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country : Factor w/ 3 levels "U.K","France",..: 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Country" $ woodtype: Factor w/ 2 levels "Deciduous","Mixed": 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Wood Type"
11 To delete use Adding and deleting rows and columns selection str(biomass) # delete the second column 'data.frame': 10 obs. of 6 variables: biomass3 <- biomass[,c(1,3:6)] $ WaterControl: num 350 324 359 255 208 ... # or $ A : num 159 146 116 135 137 ... biomass3 <- biomass[,-2] $ B : num 150.1 154.4 69.5 150.7 212.6 ... str(biomass3) $ C : num 80 266.4 161.2 161.4 51.2 ... 'data.frame': 10 obs. of 5 variables: $ D : num 267 110 221 160 198 ... $ WaterControl: num 350 324 359 255 208 ... $ E : num 350 320 359 255 208 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... TMTOWTDI $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ... # delete the 2nd row and 5th row # adding a column biomass4 <- biomass[c(1,3,4,6:10),] biomass$addedcol <- 1 # or # or biomass4 <- biomass[c(-2,-5),] biomass$addedcol2 <- biomass$WaterControl - biomass$A # or commonly on a conditional statement str(biomass) str(biomass4) 'data.frame': 10 obs. of 8 variables: 'data.frame': 8 obs. of 6 variables: $ WaterControl: num 350 324 359 255 208 ... $ WaterControl: num 350 359 255 326 295 ... $ A : num 159 146 116 135 137 ... $ A : num 159.1 116.3 135.2 81.8 115.7 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ B : num 150.1 69.5 150.7 144 149.8 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ C : num 80 161 161 184 176 ... $ D : num 267 110 221 160 198 ... $ D : num 267 221 160 270 224 ... $ E : num 350 320 359 255 208 ... $ E : num 350 359 255 326 295 ... $ addedcol : num 1 1 1 1 1 1 1 1 1 1 $ addedcol2 : num 190.7 178.2 242.2 120.1 71.9 ...
12 Additional useful functions droplevels {base} used to drop unused levels from a factor is.na {base} indicates which elements are missing. complete.cases {stats} Return a logical vector indicating which cases are complete, i.e., have no missing values. Reordering factor levels: seek1$hiqual = factor(seek1$hiqual, levels(seek1$hiqual)[c(5,4,1,6,2,3)])
Recommend
More recommend