Hands-on data exploration using R Sebastian Sauer last update: 2018-11-12 1 / 97
Setup 2 / 97
Overview Setup Tidyverse 101 Data diagrams 101 Case study 3 / 97
whoami system("whoami") R enthusiast Data analyst/scientist Professor at FOM Hochschule 4 / 97
The lights are on + − Leaflet 5 / 97
Upfront preparation Please install the following software upfront: R RStudio Desktop Starting RStudio will start R automatically. Please also make sure: Your OS is up to date You have internet access during the course You reach the next power socket (maybe better bring a power cable) 6 / 97
You, after this workshop Well, kinda off... 7 / 97
Learning goals Understanding basic tidyverse goals Applying tidyverse tools Visualizing data Basic modeling 8 / 97
We'll use the following R packages pckgs <- c("nycflights13", "mosaic", "broom", "corrr", "lubridate", "viridis", "GGally", "ggmap", "pacman", "sjmisc", "leaflet", "knitr", "tidyverse") Please install them prior to the workshop from within R: Install each missing package like this: install.packages("nycflights13") Load each package after each start of Rstudio: library (pacman) p_load(pckgs, character.only = TRUE) 9 / 97
Data we'll use: mtcars mtcars is a toy dataset built into R (no need for installing). Data come from 1974 motorsports magazine describing some automobiles. Columns: e.g., horsepower, weight, fuel consumption Load the dataset: data(mtcars) Get help: ?mtcars 10 / 97
Data we'll use: flights flights is a dataset from R package nycflights13 (package must be installed). Data come from flights leaving the NYC airports in 2013. Coumns: eg., delay, air time, carrier name Load the dataset: data(flights, package = "nycflights13") Get help: ?flights Load the data each time you open RStudio (during this workshop). 11 / 97
RStudio running 12 / 97
The tidyverse 13 / 97
14 / 97
The data analysis (science) pipeline 15 / 97
Get the power of the uni tidyverse 16 / 97
But I love the old way ... 17 / 97
Nice data 18 / 97
Tidy data More Details 19 / 97
Dataset mtcars glimpse(mtcars) #> Observations: 32 #> Variables: 11 #> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, #> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146. #> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 1 #> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, #> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.19 #> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.0 #> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, #> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, #> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 20 / 97
Data wrangling 21 / 97
Two tidyverse principles Knock-down principle Pipe princriple 22 / 97
23 / 97
Atoms of the knock-down principle filter() select() mutate() group_by() ... 24 / 97
Filtering rows with filter() Extract rows that meet logical Filter table mtcars such that only criteria. rows remain where cols equal 6 filter(mtcars, cyl == 6) #> mpg cyl disp hp drat #> 1 21.0 6 160.0 110 3.90 2.6 #> 2 21.0 6 160.0 110 3.90 2.8 #> 3 21.4 6 258.0 110 3.08 3.2 #> 4 18.1 6 225.0 105 2.76 3.4 #> 5 19.2 6 167.6 123 3.92 3.4 #> 6 17.8 6 167.6 123 3.92 3.4 #> 7 19.7 6 145.0 175 3.62 2.7 25 / 97
filter() - exercises Filter the automatic cars. Filter the automatic cars with more than 4 cylinders. Filter cars with either low consumption or the the super. thirsty ones 26 / 97
filter() - solutions to exercises data(mtcars) # only if dataset is not yet loaded filter(mtcars, am == 1) filter(mtcars, cyl > 4) filter(mtcars, mpg > 30 | mpg < 12) 27 / 97
Select columns with select() Extract columns by name. Select the columns cyl and hp . Discard the rest. select(mtcars, cyl, hp) #> cyl hp #> Mazda RX4 6 110 #> Mazda RX4 Wag 6 110 #> Datsun 710 4 93 #> Hornet 4 Drive 6 110 #> Hornet Sportabout 8 175 #> Valiant 6 105 28 / 97
select() - exercises Select the first three columns. Select the first and third column. Select all columns containing the letter "c". 29 / 97
select() - solutions to exercises select(mtcars, 1:3) select(mtcars, 1, disp) select(mtcars, contains("c")) # regex supported 30 / 97
Add or change a column with mutate Apply vectorized functions to Define weight in kg for each car. columns to create new columns. mtcars <- mutate(mtcars, weight_kg = wt head(select(mtcars, wt, weight_k #> wt weight_kg #> 1 2.620 5.24 #> 2 2.875 5.75 #> 3 2.320 4.64 #> 4 3.215 6.43 #> 5 3.440 6.88 #> 6 3.460 6.92 31 / 97
mutate() - exercises Compute a variable for consumption (gallons per 100 miles). Compute two variables in one mutate-call. 32 / 97
mutate() - solutions to exercises mtcars <- mutate(mtcars, consumption = (1/mpg) * 100 * 3.8 / 1.6) mtcars <- mutate(mtcars, consumption_g_per_m = (1/mpg), consumption_l_per_100_k = consumption_g_per_m * 3.8 33 / 97
Summarise a column with summarise() Apply function to summarise Summarise the values to their column to single value. mean. summarise(mtcars, mean_hp = mean(hp)) #> mean_hp #> 1 146.6875 34 / 97
summarise() - exercises Compute the median of consumption. Compute multiple statistics at once. 35 / 97
summarise() - solution to exercises summarise(mtcars, median(consumption)) summarise(mtcars, consumption_md = median(consumption), consumption_avg = mean(consumption) ) 36 / 97
Group with group_by() Create "gruoped" copy of table. dplyr Group cars by am (automatic vs. functions will manipulate each manual). Then summarise to mean group separately and then combine in each group. the results. mtcars_grouped <- group_by(mtcar summarise(mtcars_grouped, mean_h #> # A tibble: 2 x 2 #> am mean_hp #> <dbl> <dbl> #> 1 0 160. #> 2 1 127. 37 / 97
group_by() - exercises Compute the median consumption, grouped by cylinder. Compute the median consumption, grouped by cylinder and am . 38 / 97
group_by() - exercises #> # A tibble: 3 x 2 #> cyl mean_hp #> <dbl> <dbl> #> 1 4 9.14 #> 2 6 12.1 #> 3 8 16.2 39 / 97
Enter the pipe 40 / 97
41 / 97
Life without the pipe operator summarise( raise_to_power( compute_differences(data, mean), 2 ), mean ) 42 / 97
Life with the pipe operator data %>% compute_differences(mean) %>% raise_to_power(2) %>% summarise(mean) 43 / 97
Data diagrams 44 / 97
Why we need diagrams 45 / 97
Anatomy of a diagram 46 / 97
First plot with ggplot mtcars %>% ggplot() + # initialize plot aes(x = hp, y = mpg) + # define axes etc. geom_point() + # graw points geom_smooth() # draw smoothing line Notice the + in contrast to the pipe %>% . 47 / 97
Groups and colors mtcars %>% ggplot() + aes(x = hp, y = mpg, color = am) + geom_point() + geom_smooth() + scale_color_viridis_c() + theme_bw() 48 / 97
Diagrams - exercises Plot the mean and the median for each cylinder group (dataset mtcars ). 49 / 97
Diagrams - solutions to exercises mtcars_summarized %>% ggplot() + aes(x = cyl, y = mean_hp, color = factor(am), shape = factor(am)) + geom_point(size = 5) 50 / 97
Case study Why are �ights delayed? 51 / 97
Know thy data Don't forget to load it from the package via: data(flights) A look to the help page: ?flights 52 / 97
Recommend
More recommend