Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools
Overview In-memory analytics Python and R More on visualization The road to big data Notebooks and development environments A word on file formats A word on packaging and versioning systems Model deployment 2
In-memory Analytics 3
The landscape is incredibly complex 4
Heard about Hadoop? Spark? H2O? Many vendors with their “big data and analytics” stack Amazon “My data lake versus yours” Cloudera There’s always “roll your own” Datameer Open source, or walled garden? DataStax Support, features, speed of upgrades? Dell The situation has stabilized a bit Oracle (the champions have settled), but IBM does it matter? MapR Pentaho Databricks Microsoft Hortonworks EMC 2 5
Two sides emerge Infrastructure “Big Data” “Integration” “Architecture” NoSQL and NewSQL “Streaming” 6
Two sides emerge Analytics “Data Science” “Machine Learning” “AI” But also still: BI and Visualization 7
There’s a difference 8
In-memory analytics Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster (especially in today’s cloud environment) Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode 9
Python and R 10
The big two The “big two” in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but still less adopted Not (really) due to the language itself Thanks to their huge ecosystem: many packages for data science available “Python is the second best language for everything” Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first 11
Analytics with R Native concept of a “data frame”: a table in which each column contains measurements on one variable, and each row contains one case Unlike a matrix, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values) Fun read: Is a Dataframe Just a Table?, Yifan Wu, 2019 12
Analytics with R R is great thanks to its ecosystem Hadley Wickham: Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “tidyverse” ggplot2 for visualising data dplyr for manipulating data tidyr for tidying data stringr for working with strings lubridate for working with date/times https://www.tidyverse.org/ Data Import readr for reading .csv and fwf files readxl for reading .xls and .xlsx files haven for SAS, SPSS, and Stata files (also: “foreign” package) httr for talking to web APIs rvest for scraping websites xml2 for importing XML files Concept of “tidy” data and operations 13
Modern R Learning R today? Make sure to use “modern R” principles tidyverse should be the first package you install Especially thanks to dplyr , tidyr , stringr , and lubridate dplyr implements a verb-based data manipulation language Works on normal data frames but can also work with database connections (already a simple way to solve the mid-to-big sized data issue) Verbs can be piped together, similar to a Unix pipe operator flights %>% select ( year , month , day ) %>% arrange ( desc ( year )) %>% head 14
Modern R delay <- flights %>% group_by ( tailnum ) %>% summarise ( count = n (), dist = mean ( distance , na.rm = TRUE), delay = mean ( arr_delay , na.rm = TRUE)) delay %>% filter ( count > 20, dist < 2000) %>% ggplot ( aes ( dist , delay )) + geom_point ( aes ( size = count ), alpha = 1/2) + geom_smooth () + scale_size_area () Also see: https://www.rstudio.com/resources/cheatsheets/ 15
Modeling with R Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface You can just use the original package as well if you know what you want Still widely used 16
Modeling with R require ( caret ) require ( ggplot2 ) require ( randomForest ) training <- read.csv ( "train.csv" , na.strings = c ( "NA" , "" )) test <- read.csv ( "test.csv" , na.strings = c ( "NA" , "" )) # Invoke caret with random forest and 5-fold cross validation rf_model <- train ( TARGET ~ ., data = training , method = "rf" , trControl = trainControl ( method = "cv" , number =5), ntree =500) # Other parameters can be passed here print ( rf_model ) ## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 17
Modeling with R print ( rf_model $ finalModel ) ## ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554 18
Modeling with R The mlr package is an alternative to caret R does not define a standardized interface for all its machine learning algorithms The mlr package provides infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering The package is connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research mlr3 : https://mlr3.mlr-org.com/ Newer, though gaining uptake 19
Modeling with R library ( mlr3 ) set.seed (1) task_iris = TaskClassif $ new ( id = "iris" , backend = iris , target = "Species" ) learner = lrn ( "classif.rpart" , cp = 0.01) train_set = sample ( task_iris $ nrow , 0.8 * task_iris $ nrow ) test_set = setdiff ( seq_len ( task_iris $ nrow ), train_set ) # train the model learner $ train ( task_iris , row_ids = train_set ) # predict data prediction = learner $ predict ( task_iris , row_ids = test_set ) # calculate performance prediction $ confusion ## truth ## response setosa versicolor virginica ## setosa 11 0 0 ## versicolor 0 12 1 ## virginica 0 0 6 measure = msr ( "classif.acc" ) prediction $ score ( measure ) ## classif.acc ## 0.9666667 20
Modeling with R The modelr package provides functions that help you create elegant pipelines when modelling By Hadley Wickham Mainly for simple regression models More information: http://r4ds.had.co.nz/ Modern R approach Starts simple – linear and visual models Good introduction 21
Visualizations with R ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006) ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, … 22
Visualizations with R shiny : a web application framework for R Construct interactive dashboards 23
Recommend
More recommend