Data Science 101: Using R Language to get Big Insights Satnam - - PowerPoint PPT Presentation

▶

Dec 22, 2022 347 likes •527 views

Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013 Motivation: Using Data to

SLIDE 1

Data Science 101: Using R Language to get Big Insights

Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013

SLIDE 2

Motivation: Using Data to get Business Insights

Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?

SLIDE 3

Ref. [kaggle.com]

Data Science Programming Languages

Why R?

Popular, Free
Open source
Multi-platform
Vectorization
Many statistical packages
Large support base
Obj. oriented prog. lang.

Ref [http://www.r-project.org]

SLIDE 4

R Language Basics

> y <- c(1,2,3,4) > y [1] 1 2 3 4 Vector Operations Function Calls

> y <- 21 > y [1] 21 > z = 233 > z [1] 233

Simple Operations

SLIDE 5

R Language: Data Structures Examples

Data frame
Matrix

! "# $%$&%

'()"#

Matrix
List

+( ')+('()* ,+ (+')

SLIDE 6

Case Study: Activity Recognition

Example of Accelerometer data Smartphone’s Accelerometer

Activity Recognition: Detect walking,

driving, biking, climbing stairs, standing, etc. Accelerometer Sensor

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar

SLIDE 7

Data Analysis - Steps

Feature Extraction Time Series Data 43 Features

Mean for each

acc. Axis (3)
Std. dev. for each
acc. Axis (3)

200 samples (10 sec)

Avg. Abs. diff. from

Mean for each

acc. Axis (3)
Avg. Resultant Acc. (1)

Avg. Resultant Acc. (1)

Histogram (30)

Classifiers CART: Decision Tree RF: Random Forest Classify the Activity

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University

SLIDE 8

Data Visualization – Activity (Class Variable)

ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase

Bar Plot

[Ref] Rattle R Data Mining Tool

summary(na.omit(crs$dataset[,][crs$datase t$class=="Walking",]$class)))

rd <- order(ds[1,], decreasing=TRUE)

bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))

Dot Plot

SLIDE 9

Data Visualization Example – Variable Yavg.

ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)

[Ref] Rattle R Data Mining Tool

SLIDE 10

Easy to interpret

Blue : Positive correlation Red: Negative correlation

Correlation Plot

require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) [Ref] Rattle R Data Mining Tool crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]

SLIDE 11

Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees

Data Science R Packages

regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression

SLIDE 12

Decision Tree - Visualization

[Ref] Rattle R Data Mining Tool

SLIDE 13

Decision Tree Model Results:

n= 3792

1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041)

Decision Tree

rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))

2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342

SLIDE 14

Random Forest: Ensemble of Trees

…

Tree1 Tree2 Treen

[Ref] Rattle R Data Mining Tool

Σ

Random Forest Tree1 Tree2

SLIDE 15

Random Forest Model Results:

Number of observations used to build the model: 3792 Type of random forest: classification

Random Forest Package in R

randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)

Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064

SLIDE 16

Fusion of data science and domain knowledge

enables the big insights from the data

R language provides a platform to rapidly build

prototypes and test the ideas

Getting data insights is an outcome of intense

Summary

Getting data insights is an outcome of intense

team effort between various stakeholders

SLIDE 17

R Project: http://www.r-project.org
Activity Recognition Dataset- “ The Impact of Personalization on

Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05

“Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,

AAAI Conference on Artificial Intelligence -2010

R wiki:

http://rwiki.sciviews.org/doku.php

R graph gallery:

References

R graph gallery:

http://addictedtor.free.fr/graphiques/thumbs.php

Kickstarting R:

http://cran.r-project.org/doc/contrib/Lemon-kickstart/

Rattle – R Data Mining Tool [http://rattle.togaware.com/]
Sensor Platforms, http://www.sensorplatforms.com/context-aware/
Movea, http://www.movea.com/
Alohar, https://www.alohar.com

Data Science 101: Using R Language to get Big Insights

Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013

Motivation: Using Data to get Business Insights

Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?

Data Science Programming Languages

Why R?

R Language Basics

> y <- c(1,2,3,4) > y [1] 1 2 3 4 Vector Operations Function Calls

Simple Operations

R Language: Data Structures Examples

*+(* ')+('()* ,+ (+')

Case Study: Activity Recognition

Example of Accelerometer data Smartphone’s Accelerometer

driving, biking, climbing stairs, standing, etc. Accelerometer Sensor

Data Analysis - Steps

Feature Extraction Time Series Data 43 Features

200 samples (10 sec)

Classifiers CART: Decision Tree RF: Random Forest Classify the Activity

Data Visualization – Activity (Class Variable)

Bar Plot

Dot Plot

Data Visualization Example – Variable Yavg.

Blue : Positive correlation Red: Negative correlation

Correlation Plot

Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees

Data Science R Packages

regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression

Decision Tree - Visualization

n= 3792

Decision Tree

rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))

Random Forest: Ensemble of Trees

…

Tree1 Tree2 Treen

Σ

Random Forest Tree1 Tree2

Random Forest Package in R

randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)

enables the big insights from the data

prototypes and test the ideas

Summary

team effort between various stakeholders

Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05

AAAI Conference on Artificial Intelligence -2010

http://rwiki.sciviews.org/doku.php

References

http://addictedtor.free.fr/graphiques/thumbs.php

http://cran.r-project.org/doc/contrib/Lemon-kickstart/

+( ')+('()* ,+ (+')