-1- Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table of contents 1 Data manipulation 1 2 Sorting data 2 3 Manipulating factors 7 4 Filtering 10 5 Reshaping data 15 6 Merging data 19 7 VLOOKUP 24 8 Aggregating data (pivot tables) 26 9 Transformations 36 1. Data manipulation 1.1. Important data manipulation libraries Task Function Package Sorting base order() plyr arrange() Re-ordering factor levels base factor(,levels=) gdata reorder(,new.order=) Re-labelling base factor(,lab=) car recode() plyr revalue(,replace=) Re-naming columns base colnames() dplyr rename(,replace=) Filtering/Subsetting indexing base base subset(,subset=,select=) select(,...) dplyr Reshaping data melt() , cast() reshape2 1.2. Important data manipulation libraries Task Function Package Merging/joining base merge() plyr join() Aggregating base tapply()
-2- Task Function Package ddply() plyr reshape2 cast() doBy summaryBy() Transformations transform() , within() base plyr mutate() Adding columns base within() mutate() plyr 1.3. The grammar of data manipulation 1.3.1. Verbs • arrange() - sorting data • select() - subset columns • filter() - subset rows • mutate() - adding colums • summarise() - aggregate 1.4. Data files > source("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.R") > #OR > load(url("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.RData")) Plot Cond Between Temp LAT LONG P1 H A1 15.74 17.26 146.2 P2 H A1 23.84 14.07 144.9 P3 H A1 13.64 20.75 144.7 P4 H A2 37.95 18.41 142.1 P1 M A2 25.3 18.47 144 P2 M A2 13.8 20.39 145.8 P3 M A3 26.87 20.14 147.7 P4 M A3 29.38 19.69 144.8 P1 L A3 27.76 20.34 145.8 P2 L A4 18.95 20.06 144.9 P3 L A4 37.12 18.65 142.2 P4 L A4 25.9 14.52 144.2 2. Sorting data
-3- 2.1. Sorting data Sorting by LAT > order(data.1$LAT) [1] 2 12 1 4 5 11 8 10 7 9 6 3 > data.1[order(data.1$LAT),] Plot Cond Between Temp LAT 2 P2 H A1 23.83643 14.07060 12 P4 L A4 25.89843 14.52130 1 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 11 P3 L A4 37.11781 18.64913 8 P4 M A3 29.38325 19.68780 10 P2 L A4 18.94612 20.06427 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 LONG 2 144.8877 12 144.1700 1 146.2397 4 142.0585 5 144.0437 11 142.2459 8 144.7944 10 144.8924 7 147.7174 9 145.7753 6 145.8359 3 144.6884 2.2. Sorting data Sorting by LAT > #OR > library(plyr) #OR library(dplyr) > arrange(data.1, LAT) Plot Cond Between Temp LAT 1 P2 H A1 23.83643 14.07060 2 P4 L A4 25.89843 14.52130 3 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 6 P3 L A4 37.11781 18.64913 7 P4 M A3 29.38325 19.68780 8 P2 L A4 18.94612 20.06427 9 P3 M A3 26.87429 20.14244 10 P1 L A3 27.75781 20.33795
-4- 11 P2 M A2 13.79532 20.38767 12 P3 H A1 13.64371 20.74986 LONG 1 144.8877 2 144.1700 3 146.2397 4 142.0585 5 144.0437 6 142.2459 7 144.7944 8 144.8924 9 147.7174 10 145.7753 11 145.8359 12 144.6884 2.3. Sorting data Sorting by Cond and then TEMP > library(plyr) #OR library(dplyr) > arrange(data.1, Cond,Temp) Plot Cond Between Temp LAT 1 P3 H A1 13.64371 20.74986 2 P1 H A1 15.73546 17.25752 3 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 L A3 27.75781 20.33795 8 P3 L A4 37.11781 18.64913 9 P2 M A2 13.79532 20.38767 10 P1 M A2 25.29508 18.46762 11 P3 M A3 26.87429 20.14244 12 P4 M A3 29.38325 19.68780 LONG 1 144.6884 2 146.2397 3 144.8877 4 142.0585 5 144.8924 6 144.1700 7 145.7753 8 142.2459 9 145.8359 10 144.0437 11 147.7174 12 144.7944 > #OR > data.1[order(data.1$Cond,data.1$Temp),] Plot Cond Between Temp LAT 3 P3 H A1 13.64371 20.74986 1 P1 H A1 15.73546 17.25752
-5- 2 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 9 P1 L A3 27.75781 20.33795 11 P3 L A4 37.11781 18.64913 6 P2 M A2 13.79532 20.38767 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 8 P4 M A3 29.38325 19.68780 LONG 3 144.6884 1 146.2397 2 144.8877 4 142.0585 10 144.8924 12 144.1700 9 145.7753 11 142.2459 6 145.8359 5 144.0437 7 147.7174 8 144.7944 2.4. Sorting data Sort by the sum of Temp and LAT > library(plyr) > arrange(data.1,Temp+LAT) Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 2 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 4 P2 H A1 23.83643 14.07060 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 M A2 25.29508 18.46762 8 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 10 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 12 P4 H A2 37.95281 18.41013 LONG 1 146.2397 2 145.8359 3 144.6884 4 144.8877 5 144.8924 6 144.1700 7 144.0437 8 147.7174 9 145.7753 10 144.7944 11 142.2459 12 142.0585
-6- > #OR > data.1[order(data.1$Temp+data.1$LAT),] Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 2 P2 H A1 23.83643 14.07060 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 8 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 4 P4 H A2 37.95281 18.41013 LONG 1 146.2397 6 145.8359 3 144.6884 2 144.8877 10 144.8924 12 144.1700 5 144.0437 7 147.7174 9 145.7753 8 144.7944 11 142.2459 4 142.0585 2.5. Your turn • sort by Condition and then the ratio of Temp to LAT 2.6. Your turn • sort by Condition and then the ratio of Temp to LAT > library(plyr) > arrange(data.1,Cond,Temp/LAT) Plot Cond Between Temp LAT LONG 1 P3 H A1 13.64371 20.74986 144.6884 2 P1 H A1 15.73546 17.25752 146.2397 3 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 5 P2 L A4 18.94612 20.06427 144.8924 6 P1 L A3 27.75781 20.33795 145.7753 7 P4 L A4 25.89843 14.52130 144.1700 8 P3 L A4 37.11781 18.64913 142.2459 9 P2 M A2 13.79532 20.38767 145.8359 10 P3 M A3 26.87429 20.14244 147.7174 11 P1 M A2 25.29508 18.46762 144.0437 12 P4 M A3 29.38325 19.68780 144.7944
-7- > #OR > data.1[order(data.1$Cond, data.1$Temp/data.1$LAT),] Plot Cond Between Temp LAT LONG 3 P3 H A1 13.64371 20.74986 144.6884 1 P1 H A1 15.73546 17.25752 146.2397 2 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 10 P2 L A4 18.94612 20.06427 144.8924 9 P1 L A3 27.75781 20.33795 145.7753 12 P4 L A4 25.89843 14.52130 144.1700 11 P3 L A4 37.11781 18.64913 142.2459 6 P2 M A2 13.79532 20.38767 145.8359 7 P3 M A3 26.87429 20.14244 147.7174 5 P1 M A2 25.29508 18.46762 144.0437 8 P4 M A3 29.38325 19.68780 144.7944 3. Manipulating factors 3.1. Re-levelling (sorting) factors (data.2) Plot Cond Between Temp P1 H A1 15.74 P2 H A1 23.84 P3 H A1 13.64 P4 H A2 37.95 P1 M A2 25.3 P2 M A2 13.8 P3 M A3 26.87 P4 M A3 29.38 P1 L A3 27.76 P2 L A4 18.95 P3 L A4 37.12 P4 L A4 25.9 > levels(data.2$Cond) [1] "H" "L" "M" 3.2. Re-levelling (sorting) factors > data.2$Cond <- factor(data.2$Cond, levels=c("L","M","H")) > #OR > library(gdata) > data.2$Cond <- reorder(data.2$Cond, new.order=c("L","M","H")) > > levels(data.2$Cond)
-8- [1] "L" "M" "H" 3.3. Re-labelling factors > #create a copy of data2 and call it data.3 > data.3 <- data.2 > #reorder AND rename the factor levels > data.3$Cond <- factor(data.3$Cond, levels=c("L","M","H"), + lab=c("Low","Medium","High")) > data.3 Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843 3.4. Re-labelling factors > #restore data.3 from the original data.2 again > data.3 <- data.2 > library(plyr) > data.3$Cond <- revalue(data.3$Cond,c("L"="Low", + "M"="Medium","H"="High")) > data.3 Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843 3.5. Re-naming columns (vectors) 3.5.1. By position > data.3 <- data.2 > colnames(data.3) <- c("Plot","Condition","Treatment", + "Temperature") > data.3 Plot Condition Treatment Temperature
Recommend
More recommend