Tutorial: An Introduc0on to SparkR Hao Lin (Purdue - PowerPoint PPT Presentation

Tutorial: ¡An ¡Introduc0on ¡to ¡SparkR ¡ Hao ¡Lin ¡(Purdue ¡University) ¡ Morgantown, ¡WV ¡ Jun ¡12, ¡2015 ¡ Part ¡of ¡the ¡slides ¡are ¡modified ¡from ¡Shivaram’s ¡slides ¡

Outline ¡ Download ¡VMware ¡Player: ¡ hHps://my.vmware.com/web/vmware/free#desktop_end_user_compuQng/ vmware_player/7_0 ¡ ¡ Download ¡VM ¡image: ¡ ¡ hHp://web.ics.purdue.edu/~lin116/sparkr-‑ubuntu-‑1204-‑interface15.zip ¡ ¡ Examples ¡& ¡Datasets: ¡ ¡ hHp://web.ics.purdue.edu/~lin116/examples.tar.gz ¡ hHp://web.ics.purdue.edu/~lin116/data.tar.gz ¡ ¡

Outline ¡ Overview ¡and ¡Environment ¡Setup ¡ SparkR ¡APIs ¡ New ¡DataFrame ¡APIs ¡ Basic ¡Examples ¡ More ¡Examples ¡ Q ¡& ¡A ¡

Overview ¡ Fast ¡ StaQsQcal ¡ DataFrame ¡ Scalable ¡ Packages ¡ Flexible ¡ Plots ¡

Overview ¡ Data ¡Science ¡ SparkR ¡ Interface ¡ Data ¡Processing ¡ Spark ¡ Engine ¡ Cluster ¡ Mesos ¡/ ¡YARN ¡ Management ¡ HDFS ¡/ ¡HBase ¡/ ¡Cassandra ¡... ¡ Data ¡Storage ¡

Environment ¡Setup ¡ Install ¡R ¡& ¡RStudio: ¡ hHp://cran.r-‑project.org/, ¡hHp://www.rstudio.com/products/RStudio/ ¡ Java ¡(Scala) ¡& ¡Maven: ¡Java ¡6+, ¡Maven ¡3.0.4+ ¡ Install ¡Spark ¡ For ¡latest ¡Spark ¡1.4, ¡SparkR ¡is ¡in ¡the ¡release ¡hHps://github.com/apache/spark ¡ ¡build ¡SparkR ¡by ¡Maven, ¡following ¡hHps://github.com/apache/spark/tree/master/R ¡ For ¡Spark ¡1.3 ¡or ¡maybe ¡before: ¡download ¡Spark ¡from ¡hHps://spark.apache.org/downloads.html ¡ Also ¡install ¡SparkR ¡package ¡in ¡hHps://github.com/amplab-‑extras/SparkR-‑pkg ¡ Other ¡packages ¡like ¡HDFS ¡if ¡necessary ¡

Environment ¡Setup ¡ Docker: ¡ hHps://registry.hub.docker.com/u/beniyama/sparkr-‑docker ¡ Amazon ¡EC2: ¡ hHps://github.com/amplab-‑extras/SparkR-‑pkg/wiki/SparkR-‑on-‑EC2 ¡ MulQple ¡node ¡(Distributed ¡mode) ¡with ¡HDFS ¡ ¡ Install ¡Hadoop: ¡hHp://goo.gl/OXt1mC ¡ ¡Spark ¡Standalone: ¡hHps://spark.apache.org/docs/latest/spark-‑standalone.html ¡ ¡YARN: ¡ hHps://spark.apache.org/docs/latest/running-‑on-‑yarn.html ¡ ¡

Quick ¡Start ¡ Start ¡RStudio ¡ ¡ Desktop ¡version: ¡start ¡RStudio ¡applicaQon ¡ ¡Server ¡version: ¡open ¡web ¡browser ¡with ¡ <your ¡host>:8787 ¡ Spark ¡Context ¡IniQalizaQon ¡ ¡Setup ¡SPARK_HOME: ¡alternaQvely ¡we ¡can ¡store ¡in ¡~/.Renviron ¡so ¡that ¡it ¡will ¡not ¡execute ¡every ¡Qme ¡ ¡ # ¡Set ¡this ¡to ¡where ¡Spark ¡is ¡installed ¡ ¡ ¡ ¡Sys.setenv(SPARK_HOME="/home/sparkr/workspace/spark") ¡ ¡ ¡# ¡This ¡line ¡loads ¡SparkR ¡from ¡the ¡installed ¡directory, ¡ ¡ ¡ ¡.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), ¡"R", ¡"lib"), ¡.libPaths())) ¡ ¡ Load ¡SparkR ¡package: ¡ library(SparkR) ¡ ¡ Init ¡Spark ¡Context: ¡ ¡sc ¡<-‑ ¡sparkR.init(master="local[n]") ¡ ¡ ¡

Quick ¡Start ¡ Also, ¡you ¡can ¡always ¡use ¡terminal ¡ ¡ For ¡Spark ¡1.4: ¡ cd ¡$SPARK_HOME; ¡./bin/sparkR ¡-‑-‑master ¡“local[n]” ¡ ¡For ¡Spark ¡1.3 ¡with ¡SparkR-‑pkg: ¡ cd ¡SparkR-‑pkg; ¡./sparkR ¡-‑-‑master ¡“local[n]” ¡ ¡ Spark ¡Context ¡will ¡automaQcally ¡be ¡created, ¡call ¡ sc ¡ ¡

Spark ¡Distributed ¡Dataset ¡(RDD) ¡APIs ¡

SparkR ¡R-‑RDD ¡APIs ¡ Considered ¡as ¡Distributed ¡version ¡of ¡R ¡List ¡ ¡ GeneraQon ¡funcQons: ¡ textFile, ¡parallelize, ¡... ¡ TransformaQon ¡funcQons: ¡ lapply, ¡filter, ¡sampleRDD, ¡... ¡ persistence ¡funcQon: ¡ cache, ¡persist, ¡... ¡ AcQon ¡funcQons: ¡ reduce, ¡collect, ¡... ¡ Paired ¡Value ¡Shuffle ¡funcQons: ¡ groupByKey, ¡reduceByKey, ¡... ¡ Binary ¡FuncQons: ¡ unionRDD, ¡cogroup, ¡... ¡ Output ¡FuncQons: ¡ saveAsTextFile, ¡saveAsObjectFile, ¡... ¡

Data ¡Frames ¡ More ¡structured ¡data ¡in ¡Tables ¡ ¡Data ¡source ¡like ¡CSV, ¡JSON, ¡JDBC, ¡… ¡ ¡ Want ¡to ¡use ¡your ¡favorite ¡package ¡“dplyr” ¡? ¡DataFrame ¡type ¡in ¡SparkR ¡ ¡ ¡ Embeded ¡SQL ¡in ¡R ¡

DataFrame ¡APIs ¡ Filter ¡-‑-‑ ¡Select ¡some ¡rows ¡ ¡filter(df, ¡df$col1 ¡> ¡0) ¡ Project ¡-‑-‑ ¡Select ¡some ¡columns ¡ ¡df$col1 ¡or ¡df[“col”] ¡

DataFrame ¡APIs ¡ Aggregate ¡-‑-‑ ¡Group ¡and ¡Summarize ¡data ¡ groupDF ¡<-‑ ¡groupBy(df, ¡df$col1) ¡ agg(groupDF, ¡sum(groupDF$col2), ¡max(groupDF$col3)) ¡ Sort ¡-‑-‑ ¡Sort ¡data ¡by ¡a ¡parQcular ¡column ¡ sortDF(df, ¡asc(df$col1)) ¡

Column ¡Average ¡using ¡RDD ¡ peopleRDD ¡<-‑ ¡textFile(sc, ¡“people.txt”) ¡ ¡ lines ¡<-‑ ¡flatMap(peopleRDD, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡function(line) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡strsplit(line, ¡", ¡") ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡}) ¡ ¡ ageInt ¡<-‑ ¡lapply(lines, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡function(line) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡as.numeric(line[2]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡}) ¡ ¡ sum ¡<-‑ ¡reduce(ageInt, ¡function(x, ¡y) ¡{x+y}) ¡ avg ¡<-‑ ¡sum ¡/ ¡count(peopleRDD) ¡ ¡

Column ¡Average ¡using ¡DataFrame ¡ # ¡JSON ¡File ¡contains ¡two ¡columns ¡age, ¡name ¡ df ¡<-‑ ¡jsonFile(“people.json”) ¡ ¡ avg ¡<-‑ ¡select(df, ¡avg(df$age)) ¡

Pi Example

Logistic Regression

Predicting Customer Behavior Demo ¡from ¡Chris ¡Freeman ¡from ¡Alteryx ¡ ¡ 3 ¡datasets: ¡ ¡TransacQons ¡ ¡Demographic ¡Info ¡Per ¡Customer ¡ ¡DM ¡Treatment ¡Sample ¡ How ¡do ¡we ¡decide ¡who ¡to ¡send ¡the ¡offer ¡to? ¡ ¡

Predicting Customer Behavior Demo ¡from ¡Chris ¡Freeman ¡from ¡Alteryx ¡ ¡ Use ¡the ¡DataFrame ¡API ¡to ¡load, ¡prepare ¡and ¡combine ¡all ¡3 ¡ datasets ¡and ¡create ¡training ¡and ¡esQmaQon ¡sets. ¡ ¡ Use ¡R’s ¡glm ¡method ¡to ¡train ¡a ¡logisQc ¡regression ¡model ¡on ¡the ¡ treatment ¡sample ¡ ¡ Profit! ¡ ¡

Reference and Guide Starter & RDD: https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR- Quick-Start Data Frame: http://people.apache.org/~pwendell/spark-releases/latest/ sparkr.html Chris’s demo: https://github.com/cafreeman/Demo_SparkR

Ques0ons? ¡ Ques0ons? ¡ hHps://github.com/apache/spark hHps://github.com/apache/spark ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡hHps://github.com/amplab-‑extras/SparkR-‑pkg ¡ ¡ ¡

Tutorial: An Introduc0on to SparkR Hao Lin (Purdue - PowerPoint PPT Presentation

Tutorial: An Introduc0on to SparkR Hao Lin (Purdue University) Morgantown, WV Jun 12, 2015 Part of the slides are modified from Shivarams slides

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

Jemimah Njuki Interna0onal Livestock Research Ins0tute Introduc0on:

Dynamic Memory Alloca/on: Basic Concepts 15-213: Introduc0on

Virtual Memory: Concepts 15-213: Introduc0on to Computer Systems

CSSE132 Introduc0on 32 : Virtual Memory May 2, 2013 Today

EECS$373$ $ An$Introduc0on$to$Real$Time$Oses$ $ Slides$originally$created$by$Mark$Brehob$

CSSE132 Introduc0on to Computer Systems 20 : Memory hierarchy

CSSE132 Introduc0on to Computer Systems 24 : Compilers and

Extending R through packages: Theres a package for everything R packages are available on CRAN

5G Softwarization and Slicing st /2015 ITU FG-IMT-2020 , Turin, Italy, Sept 21 Peter

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

Statistical LeaRning Katja Nowick, Lydia Mueller Bioinformatics group, Markus Kreuz IMISE

DATA DRIVEN VALUE CREATION DATA SCIENCE & ANALYTICS | DATA MANAGEMENT | VISUALIZATION

Computation of the Aggregate Claim Amount Distribution Using R and actuar Vincent Goulet, Ph.D.

The R-to-MOSEK Optimization Interface Henrik Alsing Friberg MOSEK ApS, Fruebjergvej 3, Box 16,

Bayesian Subnational Estimation using Complex Survey Data: Introduction to R Zehang Richard Li

Tutorial: An Introduc0on to SparkR Hao Lin (Purdue - PowerPoint PPT Presentation

Tutorial: An Introduc0on to SparkR Hao Lin (Purdue University) Morgantown, WV Jun 12, 2015 Part of the slides are modified from Shivarams slides

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

Jemimah Njuki Interna0onal Livestock Research Ins0tute Introduc0on:

Dynamic Memory Alloca/on: Basic Concepts 15-213: Introduc0on

Virtual Memory: Concepts 15-213: Introduc0on to Computer Systems

CSSE132 Introduc0on 32 : Virtual Memory May 2, 2013 Today

EECS$373$ $ An$Introduc0on$to$Real$Time$Oses$ $ Slides$originally$created$by$Mark$Brehob$

CSSE132 Introduc0on to Computer Systems 20 : Memory hierarchy

CSSE132 Introduc0on to Computer Systems 24 : Compilers and

Extending R through packages: Theres a package for everything R packages are available on CRAN

5G Softwarization and Slicing st /2015 ITU FG-IMT-2020 , Turin, Italy, Sept 21 Peter

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

Statistical LeaRning Katja Nowick, Lydia Mueller Bioinformatics group, Markus Kreuz IMISE

DATA DRIVEN VALUE CREATION DATA SCIENCE &amp; ANALYTICS | DATA MANAGEMENT | VISUALIZATION

Computation of the Aggregate Claim Amount Distribution Using R and actuar Vincent Goulet, Ph.D.

The R-to-MOSEK Optimization Interface Henrik Alsing Friberg MOSEK ApS, Fruebjergvej 3, Box 16,

Bayesian Subnational Estimation using Complex Survey Data: Introduction to R Zehang Richard Li

DATA DRIVEN VALUE CREATION DATA SCIENCE & ANALYTICS | DATA MANAGEMENT | VISUALIZATION