Hands-on data exploration using R Sebastian Sauer last update: - PowerPoint PPT Presentation

Hands-on data exploration using R Sebastian Sauer last update: 2018-11-12 1 / 97

Setup 2 / 97

Overview Setup Tidyverse 101 Data diagrams 101 Case study 3 / 97

whoami system("whoami") R enthusiast Data analyst/scientist Professor at FOM Hochschule 4 / 97

The lights are on + − Leaflet 5 / 97

Upfront preparation Please install the following software upfront: R RStudio Desktop Starting RStudio will start R automatically. Please also make sure: Your OS is up to date You have internet access during the course You reach the next power socket (maybe better bring a power cable) 6 / 97

You, after this workshop Well, kinda off... 7 / 97

Learning goals Understanding basic tidyverse goals Applying tidyverse tools Visualizing data Basic modeling 8 / 97

We'll use the following R packages pckgs <- c("nycflights13", "mosaic", "broom", "corrr", "lubridate", "viridis", "GGally", "ggmap", "pacman", "sjmisc", "leaflet", "knitr", "tidyverse") Please install them prior to the workshop from within R: Install each missing package like this: install.packages("nycflights13") Load each package after each start of Rstudio: library (pacman) p_load(pckgs, character.only = TRUE) 9 / 97

Data we'll use: mtcars mtcars is a toy dataset built into R (no need for installing). Data come from 1974 motorsports magazine describing some automobiles. Columns: e.g., horsepower, weight, fuel consumption Load the dataset: data(mtcars) Get help: ?mtcars 10 / 97

Data we'll use: flights flights is a dataset from R package nycflights13 (package must be installed). Data come from flights leaving the NYC airports in 2013. Coumns: eg., delay, air time, carrier name Load the dataset: data(flights, package = "nycflights13") Get help: ?flights Load the data each time you open RStudio (during this workshop). 11 / 97

RStudio running 12 / 97

The tidyverse 13 / 97

14 / 97

The data analysis (science) pipeline 15 / 97

Get the power of the uni tidyverse 16 / 97

But I love the old way ... 17 / 97

Nice data 18 / 97

Tidy data More Details 19 / 97

Dataset mtcars glimpse(mtcars) #> Observations: 32 #> Variables: 11 #> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, #> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146. #> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 1 #> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, #> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.19 #> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.0 #> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, #> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, #> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 20 / 97

Data wrangling 21 / 97

Two tidyverse principles Knock-down principle Pipe princriple 22 / 97

23 / 97

Atoms of the knock-down principle filter() select() mutate() group_by() ... 24 / 97

Filtering rows with filter() Extract rows that meet logical Filter table mtcars such that only criteria. rows remain where cols equal 6 filter(mtcars, cyl == 6) #> mpg cyl disp hp drat #> 1 21.0 6 160.0 110 3.90 2.6 #> 2 21.0 6 160.0 110 3.90 2.8 #> 3 21.4 6 258.0 110 3.08 3.2 #> 4 18.1 6 225.0 105 2.76 3.4 #> 5 19.2 6 167.6 123 3.92 3.4 #> 6 17.8 6 167.6 123 3.92 3.4 #> 7 19.7 6 145.0 175 3.62 2.7 25 / 97

filter() - exercises Filter the automatic cars. Filter the automatic cars with more than 4 cylinders. Filter cars with either low consumption or the the super. thirsty ones 26 / 97

filter() - solutions to exercises data(mtcars) # only if dataset is not yet loaded filter(mtcars, am == 1) filter(mtcars, cyl > 4) filter(mtcars, mpg > 30 | mpg < 12) 27 / 97

Select columns with select() Extract columns by name. Select the columns cyl and hp . Discard the rest. select(mtcars, cyl, hp) #> cyl hp #> Mazda RX4 6 110 #> Mazda RX4 Wag 6 110 #> Datsun 710 4 93 #> Hornet 4 Drive 6 110 #> Hornet Sportabout 8 175 #> Valiant 6 105 28 / 97

select() - exercises Select the first three columns. Select the first and third column. Select all columns containing the letter "c". 29 / 97

select() - solutions to exercises select(mtcars, 1:3) select(mtcars, 1, disp) select(mtcars, contains("c")) # regex supported 30 / 97

Add or change a column with mutate Apply vectorized functions to Define weight in kg for each car. columns to create new columns. mtcars <- mutate(mtcars, weight_kg = wt head(select(mtcars, wt, weight_k #> wt weight_kg #> 1 2.620 5.24 #> 2 2.875 5.75 #> 3 2.320 4.64 #> 4 3.215 6.43 #> 5 3.440 6.88 #> 6 3.460 6.92 31 / 97

mutate() - exercises Compute a variable for consumption (gallons per 100 miles). Compute two variables in one mutate-call. 32 / 97

mutate() - solutions to exercises mtcars <- mutate(mtcars, consumption = (1/mpg) * 100 * 3.8 / 1.6) mtcars <- mutate(mtcars, consumption_g_per_m = (1/mpg), consumption_l_per_100_k = consumption_g_per_m * 3.8 33 / 97

Summarise a column with summarise() Apply function to summarise Summarise the values to their column to single value. mean. summarise(mtcars, mean_hp = mean(hp)) #> mean_hp #> 1 146.6875 34 / 97

summarise() - exercises Compute the median of consumption. Compute multiple statistics at once. 35 / 97

summarise() - solution to exercises summarise(mtcars, median(consumption)) summarise(mtcars, consumption_md = median(consumption), consumption_avg = mean(consumption) ) 36 / 97

Group with group_by() Create "gruoped" copy of table. dplyr Group cars by am (automatic vs. functions will manipulate each manual). Then summarise to mean group separately and then combine in each group. the results. mtcars_grouped <- group_by(mtcar summarise(mtcars_grouped, mean_h #> # A tibble: 2 x 2 #> am mean_hp #> <dbl> <dbl> #> 1 0 160. #> 2 1 127. 37 / 97

group_by() - exercises Compute the median consumption, grouped by cylinder. Compute the median consumption, grouped by cylinder and am . 38 / 97

group_by() - exercises #> # A tibble: 3 x 2 #> cyl mean_hp #> <dbl> <dbl> #> 1 4 9.14 #> 2 6 12.1 #> 3 8 16.2 39 / 97

Enter the pipe 40 / 97

41 / 97

Life without the pipe operator summarise( raise_to_power( compute_differences(data, mean), 2 ), mean ) 42 / 97

Life with the pipe operator data %>% compute_differences(mean) %>% raise_to_power(2) %>% summarise(mean) 43 / 97

Data diagrams 44 / 97

Why we need diagrams 45 / 97

Anatomy of a diagram 46 / 97

First plot with ggplot mtcars %>% ggplot() + # initialize plot aes(x = hp, y = mpg) + # define axes etc. geom_point() + # graw points geom_smooth() # draw smoothing line Notice the + in contrast to the pipe %>% . 47 / 97

Groups and colors mtcars %>% ggplot() + aes(x = hp, y = mpg, color = am) + geom_point() + geom_smooth() + scale_color_viridis_c() + theme_bw() 48 / 97

Diagrams - exercises Plot the mean and the median for each cylinder group (dataset mtcars ). 49 / 97

Diagrams - solutions to exercises mtcars_summarized %>% ggplot() + aes(x = cyl, y = mean_hp, color = factor(am), shape = factor(am)) + geom_point(size = 5) 50 / 97

Case study Why are �ights delayed? 51 / 97

Know thy data Don't forget to load it from the package via: data(flights) A look to the help page: ?flights 52 / 97

Hands-on data exploration using R Sebastian Sauer last update: - PowerPoint PPT Presentation

Hands-on data exploration using R Sebastian Sauer last update: 2018-11-12 1 / 97 Setup 2 / 97 Overview Setup Tidyverse 101 Data diagrams 101 Case study 3 / 97 whoami system("whoami") R enthusiast Data analyst/scientist

Hands Overview Outline Existing hands Robot hands of the 80s Commercial hands Research

Presentation GSPP More pictures Disinfection of hands Disinfection of hands Disinfection of

Outline Existing hands Robot hands of the 80s Commercial hands Research hands Prosthetics

Lecture 3 0/ 16 Probability Computations Bridge Hands and Poker Hands Bridge Hands If you play

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Hands-On tools@bsc.es 2018 Copy files for the hands-on You can download the material for

Hands-On tools@bsc.es 2018 Copy files for the hands-on You can download the material for

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Data validation and exploration Data validation and exploration Abhijit Dasgupta Abhijit

Data Exploration Tyler Moore CSE 7338 Computer Science & Engineering Department, SMU,

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Hands-On Training: Hands-On Training: Tips and Tools Tips and Tools Presentation Notes

Designing Better Places: Designing Better Places: Hands- H Hands H d d -On Design Training

Software Architecture and Reuse R. Kuehl p. 1 R I T Software Engineering Reuse: The Big

Designing an Architecture for Delivering Mobile Information Services to the Rural Developing World

f TTC Meeting October 2008 WG3 Acknowledgments Welding Mike Foley, Sciaky EP

'il I (^-b)t(,.A) , ' , m t - ( * + e ) - ( b + d ) ,,. ^ tsc ! b+.1 $J+, ) ll-l i { c c E

Computer Science II (Summer Semester 2003) Prof. Dr. Dieter Hogrefe Dr. Xiaoming Fu Kevin

Big Data Analytics 1. Rapid development: HEP at new level of Big Data Analysis 2. Education:

Betriebssystem-Energiebuchhaltung Ausgew ahlte Kapitel der Systemsoftwaretechnik:

Not Notes Based on on Previou ous We Weeks Class Many Th Thousands Gone Belle Marion

Hands-on data exploration using R Sebastian Sauer last update: - PowerPoint PPT Presentation

Hands-on data exploration using R Sebastian Sauer last update: 2018-11-12 1 / 97 Setup 2 / 97 Overview Setup Tidyverse 101 Data diagrams 101 Case study 3 / 97 whoami system("whoami") R enthusiast Data analyst/scientist

Hands Overview Outline Existing hands Robot hands of the 80s Commercial hands Research

Presentation GSPP More pictures Disinfection of hands Disinfection of hands Disinfection of

Outline Existing hands Robot hands of the 80s Commercial hands Research hands Prosthetics

Lecture 3 0/ 16 Probability Computations Bridge Hands and Poker Hands Bridge Hands If you play

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Hands-On tools@bsc.es 2018 Copy files for the hands-on You can download the material for

Hands-On tools@bsc.es 2018 Copy files for the hands-on You can download the material for

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Data validation and exploration Data validation and exploration Abhijit Dasgupta Abhijit

Data Exploration Tyler Moore CSE 7338 Computer Science &amp; Engineering Department, SMU,

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Hands-On Training: Hands-On Training: Tips and Tools Tips and Tools Presentation Notes

Designing Better Places: Designing Better Places: Hands- H Hands H d d -On Design Training

Software Architecture and Reuse R. Kuehl p. 1 R I T Software Engineering Reuse: The Big

Designing an Architecture for Delivering Mobile Information Services to the Rural Developing World

f TTC Meeting October 2008 WG3 Acknowledgments Welding Mike Foley, Sciaky EP

'il I (^-b)t(,.A) , ' , m t - ( * + e ) - ( b + d ) ,,. ^ tsc ! b+.1 $J+, ) ll-l i { c c E

Computer Science II (Summer Semester 2003) Prof. Dr. Dieter Hogrefe Dr. Xiaoming Fu Kevin

Big Data Analytics 1. Rapid development: HEP at new level of Big Data Analysis 2. Education:

Betriebssystem-Energiebuchhaltung Ausgew ahlte Kapitel der Systemsoftwaretechnik:

Not Notes Based on on Previou ous We Weeks Class Many Th Thousands Gone Belle Marion

Data Exploration Tyler Moore CSE 7338 Computer Science & Engineering Department, SMU,