Data Manipulation in R Introduction to dplyr May 15, 2017 Data Manipulation in R May 15, 2017 1 / 67
Introduction to dplyr dplyr is Hadley’s package for data manipulation dplyr provides abstractions for basic data manipulation operations (called verbs) Verbs can be combined to achieve complicated data manipulation results using a series of simple steps The approach is familiar to those who use UNIX/Linux and the ”dotadiw” philosophy: Do One Thing and Do It Well Data Manipulation in R May 15, 2017 2 / 67
dplyr’s Verbs The verbs are: filter arrange select distinct mutate summarise Data Manipulation in R May 15, 2017 3 / 67
Let’s load in some data and examine each verb Data Manipulation in R May 15, 2017 4 / 67
Houston Flight Delay Data We’ll look at airline flight delay data. First read in: #install.pacakges('dplyr') library(dplyr) delay.dat.houston <- read.csv("HoustonAirline.csv", header=TRUE, stringsAsFactors = FALSE) # tbl_df allows for nice printing delay.dat.houston <- tbl_df(delay.dat.houston) Data Manipulation in R May 15, 2017 5 / 67
Take a look delay.dat.houston ## # A tibble: 241,105 x 29 ## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime ## <int> <int> <int> <int> <int> <int> <int> <int> ## 1 2008 1 4 5 1910 1910 2025 2025 ## 2 2008 1 4 5 1345 1345 1453 1500 ## 3 2008 1 4 5 736 735 839 850 ## 4 2008 1 4 5 1603 1550 1647 1635 ## 5 2008 1 4 5 2105 2105 2151 2150 ## 6 2008 1 4 5 635 635 716 720 ## 7 2008 1 4 5 1331 1330 1411 1415 ## 8 2008 1 4 5 1850 1850 1936 1935 ## 9 2008 1 4 5 956 1000 1038 1045 ## 10 2008 1 4 5 823 805 906 850 ## # ... with 241,095 more rows, and 21 more variables: UniqueCarrier <chr>, ## # FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>, ## # CRSElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>, ## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>, ## # Cancelled <int>, CancellationCode <chr>, Diverted <int>, ## # CarrierDelay <int>, WeatherDelay <int>, NASDelay <int>, ## # SecurityDelay <int>, LateAircraftDelay <int> Data Manipulation in R May 15, 2017 6 / 67
Variable Description Data Manipulation in R May 15, 2017 7 / 67
Airport Data # Airport information airport.dat <- read.table("airports.csv", header=TRUE, sep=",", stringsAsFactors = FALSE) airport.dat <- tbl_df(airport.dat) Data Manipulation in R May 15, 2017 8 / 67
Airport Data cont. airport.dat ## # A tibble: 3,376 x 7 ## iata airport city state country lat ## <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 00M Thigpen Bay Springs MS USA 31.95376 ## 2 00R Livingston Municipal Livingston TX USA 30.68586 ## 3 00V Meadow Lake Colorado Springs CO USA 38.94575 ## 4 01G Perry-Warsaw Perry NY USA 42.74135 ## 5 01J Hilliard Airpark Hilliard FL USA 30.68801 ## 6 01M Tishomingo County Belmont MS USA 34.49167 ## 7 02A Gragg-Wade Clanton AL USA 32.85049 ## 8 02C Capitol Brookfield WI USA 43.08751 ## 9 02G Columbiana County East Liverpool OH USA 40.67331 ## 10 03D Memphis Memorial Memphis MO USA 40.44726 ## # ... with 3,366 more rows, and 1 more variables: long <dbl> Data Manipulation in R May 15, 2017 9 / 67
filter filter is probably the most familiar verb filter is dplyr’s version of R’s subset() function filter returns all rows (observations) for which a logical condition holds Data Manipulation in R May 15, 2017 10 / 67
filter - Inputs and Outputs Inputs: data.frame and logical expressions Output: data.frame All dplyr verbs behave similarly A data.frame is inputted, and a data.frame is outputted Data Manipulation in R May 15, 2017 11 / 67
filter - Example 1 # Find all flight which occurred in Janurary filter(delay.dat.houston, Month==1) ## # A tibble: 20,349 x 29 ## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime ## <int> <int> <int> <int> <int> <int> <int> <int> ## 1 2008 1 4 5 1910 1910 2025 2025 ## 2 2008 1 4 5 1345 1345 1453 1500 ## 3 2008 1 4 5 736 735 839 850 ## 4 2008 1 4 5 1603 1550 1647 1635 ## 5 2008 1 4 5 2105 2105 2151 2150 ## 6 2008 1 4 5 635 635 716 720 ## 7 2008 1 4 5 1331 1330 1411 1415 ## 8 2008 1 4 5 1850 1850 1936 1935 ## 9 2008 1 4 5 956 1000 1038 1045 ## 10 2008 1 4 5 823 805 906 850 ## # ... with 20,339 more rows, and 21 more variables: UniqueCarrier <chr>, ## # FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>, ## # CRSElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>, ## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>, ## # Cancelled <int>, CancellationCode <chr>, Diverted <int>, ## # CarrierDelay <int>, WeatherDelay <int>, NASDelay <int>, ## # SecurityDelay <int>, LateAircraftDelay <int> # we could of course save this too # delay.dat.houston.jan <- fitler(delay.dat.houston, Month==1) Data Manipulation in R May 15, 2017 12 / 67
filter - Example 2 # Using airport data, find a list of iata abbreviations for houston texas airports filter(airport.dat, state=='TX', city=='Houston') ## # A tibble: 8 x 7 ## iata airport city state country lat ## <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 DWH David Wayne Hooks Memorial Houston TX USA 30.06186 ## 2 EFD Ellington Houston TX USA 29.60733 ## 3 HOU William P Hobby Houston TX USA 29.64542 ## 4 IAH George Bush Intercontinental Houston TX USA 29.98047 ## 5 IWS West Houston Houston TX USA 29.81819 ## 6 LVJ Clover Houston TX USA 29.52131 ## 7 SGR Sugar Land Municipal/Hull Houston TX USA 29.62225 ## 8 SPX Houston-Gulf Houston TX USA 29.50836 ## # ... with 1 more variables: long <dbl> Data Manipulation in R May 15, 2017 13 / 67
filter - Try it out Find the subset of flight departing from Hobby for which the Actual Elapsed Time was greater than the CRS Elapsed Time. Find the subset of flights departing on the weekend. Data Manipulation in R May 15, 2017 14 / 67
Try it out cont. filter(delay.dat.houston, Origin == 'HOU', # iata code for Hobby ActualElapsedTime > CRSElapsedTime) Data Manipulation in R May 15, 2017 15 / 67
Try in out cont. filter(delay.dat.houston, DayOfWeek == 6 | DayOfWeek == 7) # another alternative filter(delay.dat.houston, DayOfWeek %in% c(6,7)) Data Manipulation in R May 15, 2017 16 / 67
arrange arrange, like filter, operates on data.frame rows arrange is used for sorting data.frame rows w.r.t. a given column(s) Data Manipulation in R May 15, 2017 17 / 67
arrange Example 1 # sort by DayofMonth, smallest to largest arrange(delay.dat.houston, DayofMonth) ## # A tibble: 241,105 x 29 ## Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime ## <int> <int> <int> <int> <int> <int> <int> <int> ## 1 2008 1 1 2 1531 1525 1626 1622 ## 2 2008 1 1 2 1848 1850 2022 2025 ## 3 2008 1 1 2 1024 1025 1353 1352 ## 4 2008 1 1 2 707 705 818 822 ## 5 2008 1 1 2 1047 1045 1423 1415 ## 6 2008 1 1 2 1110 1110 1237 1240 ## 7 2008 1 1 2 1653 1655 2038 2058 ## 8 2008 1 1 2 2013 1950 2335 2319 ## 9 2008 1 1 2 1212 1220 1454 1512 ## 10 2008 1 1 2 1021 1020 1136 1132 ## # ... with 241,095 more rows, and 21 more variables: UniqueCarrier <chr>, ## # FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>, ## # CRSElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>, ## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>, ## # Cancelled <int>, CancellationCode <chr>, Diverted <int>, ## # CarrierDelay <int>, WeatherDelay <int>, NASDelay <int>, ## # SecurityDelay <int>, LateAircraftDelay <int> Data Manipulation in R May 15, 2017 18 / 67
Recommend
More recommend