ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 - PowerPoint PPT Presentation

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2 Cosmin Basca 2 Jens Meiners 1 Sebastian Breß 1 Tilmann Rabl 1 Juan Fumero 3 Volker Markl 2 Technische Universität Berlin 1 Oracle Labs 2 University of Manchester 3 0

R gained increased traction • Dynamically typed, open-source language • Rich support for analytics & statistics 1

R gained increased traction • Dynamically typed, open-source language • Rich support for analytics & statistics But • Standalone R is not well suited for out-of-core data loads 2

Analytics pipelines often work on large amounts of raw data • Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out • Provide rich support for user-defined functions (UDFs) 3

Analytics pipelines often work on large amounts of raw data • Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out • Provide rich support for user-defined functions (UDFs) But • R users are often unfamiliar with DF APIs and concepts 4

Combine the usability of f R wit ith the scala lability of f dataflow engines - Goals - From functions calls to an operator graph - Approaches to execute R UDFs - Our Approach: ScootR - Evaluation 5

GOALS 1. Provide data.frame API with natural feeling df <- select(df, count = flights, distance) • df$km <- df$miles * 1.6 • df <- apply(df, func) • 6

GOALS 1. Provide data.frame API with natural feeling df <- select(df, count = flights, distance) • df$km <- df$miles * 1.6 • df <- apply(df, func) • 2. Achieve comparable performance to native dataflow API 7

From function calls to an operator graph 8

MAPPING DATA TYPES • R data.frame(T 1 ,T 2 ,…, T N ) as Flink DataSet<TupleN<T 1 ,T 2 ,…,T N >> N columns Fixed element type of N fields Tuple with arity N • E.g., data.frame(integer, character) as DataSet<Tuple2<Integer, String>> 9

MAPPING R FUNCTIONS TO OPERATORS • Functions on data.frames lazily build an operator graph 10

MAPPING R FUNCTIONS TO OPERATORS • Functions on data.frames lazily build an operator graph 1. Functions w/o UDFs are handled before execution, e.g., a select function is mapped to a project operator select(df$id, df$arrival) to ds.project(1, 3) 11

MAPPING R FUNCTIONS TO OPERATORS • Functions on data.frames lazily build an operator graph 1. Functions w/o UDFs are handled before execution 2. Functions w/ UDFs call R functions during execution 12

Approaches to execute R UDFs 13

INTER PROCESS COMMUNICATION (IPC) Driver Worker R Process Task Task R Process Client Worker Task R Process Task R Process 14

INTER PROCESS COMMUNICATION (IPC) Worker 1 filter <- function(df) { df$language == ‘ english ’ filter R Process } 2 JVM Communication + Serialization (R <> Java) 1 JVM and R compete for memory 2 15

SOURCE-TO-SOURCE TRANSLATION (STS) • Translate restricted set of functions to native dataflow API • Constant translation overhead, but native execution performance 16

SOURCE-TO-SOURCE TRANSLATION (STS) • E.g., STS translation in SparkR to Spark’s Scala Dataframe API: df <- filter(df, df$language == ‘ english ’ val df = df.filter($ ”language” === “ english ” ) ) df$km <- df$miles * 1.6 val df = df.withColumn( “km” , $ ”miles” * 1.6) 17

Inter Process Communication Source-to-source translation + Execute arbitrary R code + Native performance - Data serialization - Restricted to a language subset or requires to build full-fledged - Data exchange compiler - Java and R process compete for memory 18

A common runtime for R and Ja Java 19

BACKGROUND: TRUFFLE/GRAAL Bytecode HotSpot JIT 20

BACKGROUND: TRUFFLE/GRAAL Bytecode HotSpot Graal JIT 21

BACKGROUND: TRUFFLE/GRAAL Truffle HotSpot GraalVM Graal 22

BACKGROUND: TRUFFLE/GRAAL *.java *.R *.js Source Code AST Interpreter javac TruffleR (fastR) TruffleJS Truffle Graal Interpreter GC … GraalVM HotSpot Runtime 23 Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices . Vol. 51. No. 2. ACM, 2015.

SCOOTR: FASTR + FLINK 24

SCOOTR OVERVIEW flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words ) flink.writeAsText(df, SINK) flink.execute() 25

SCOOTR OVERVIEW flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words ) flink.writeAsText(df, SINK) flink.execute() 26

Efficient data access in R UDFs 27

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } 28

function(df) { len <- length(strsplit( df$body , " ")[[1]]) list(df$id, df$body, len) } function(tuple) { len <- length(strsplit( tuple[[2]] , " ")[[1]]) list(tuple[[1]], tuple[[2]], len) } Dataframe proxy keeps track of columns and provides efficient access 1 29

function(df) { len <- length(strsplit( df$body , " ")[[1]]) list (df$id, df$body, len) } function(tuple) { len <- length(strsplit( tuple[[2]] , " ")[[1]]) flink.tuple (tuple[[1]], tuple[[2]], len) } Dataframe proxy keeps track of columns and provides efficient access 1 Rewrite to directly instantiate a Flink tuple instead of an R list 2 30

IMPACT OF DIRECT TYPE ACCESS • From list(...) to flink.tuple(...) • Avoids additional pass over R list to create Flink tuple • Up to 1.75 x performance improvement Output w/ arity 2 Output w/ arity 19 Purple is function execution, pink (hatched) conversion from list to tuple 31

Evaluation 32

APPLY FUNCTION MICROBENCHMARK • Airline On-Time Performance Dataset (2005 – 2016) CSV, 19 columns, 9.5GB • UDF: df$km <- df$miles * 1.6 33

APPLY FUNCTION MICROBENCHMARK • Airline On-Time Performance Dataset (2005 – 2016) CSV, 19 columns, 9.5GB • UDF: df$km <- df$miles * 1.6 ScootR and SparkR (STS) achieve near native performance 34

APPLY FUNCTION MICROBENCHMARK • Airline On-Time Performance Dataset (2005 – 2016) CSV, 19 columns, 9.5GB • UDF: df$km <- df$miles * 1.6 ScootR and SparkR (STS) achieve near native performance Both heavily outperform gnu R and fastR 35

APPLY FUNCTION MICROBENCHMARK: SCALABILITY 36

MIXED PIPELINE W/ PREPROCESSING AND ML Pipeline: - (Distributed) preprocessing of the dataset - Data is collected locally and an generalized linear model is trained Majority of the time is spent in preprocessing ScootR is up to 11x faster than gnu R and fastR 37

RECAP • ScootR provides a data.frame API in R for Apache Flink • R and Flink run within the same runtime • Avoids serialization and data exchange • Avoids type conversion > Achieves near native performance for a rich set of operators 38

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 - PowerPoint PPT Presentation

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2 Cosmin Basca 2 Jens Meiners 1 Sebastian Bre 1 Tilmann Rabl 1 Juan Fumero 3 Volker Markl 2 Technische Universitt Berlin 1 Oracle Labs 2 University

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Appending & concatenating Series Merging DataFrames with pandas append() .append():

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan

Analyzing Simulated Data Matthew Turk There is only one sky. (but there are many simulation

Rockwell Collins, Oct 1, 2002, based on JSLC, Grenoble 68 November 2001 and FTRTFT September

Two-dimensional arrays, Copying arrays (shallow copies) , Software Engineering Techniques

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

Technical mechanics of a trans-border Waste Flow Tracking solution based on Blockchain technology

Meat Evaluation beef brisket whole brisket corned brisket flat half point half chuck arm

COMP30019 Graphics and Interaction Illumination Models Adrian Pearce Department of Computer

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 - PowerPoint PPT Presentation

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2 Cosmin Basca 2 Jens Meiners 1 Sebastian Bre 1 Tilmann Rabl 1 Juan Fumero 3 Volker Markl 2 Technische Universitt Berlin 1 Oracle Labs 2 University

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Appending &amp; concatenating Series Merging DataFrames with pandas append() .append():

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan

Analyzing Simulated Data Matthew Turk There is only one sky. (but there are many simulation

Rockwell Collins, Oct 1, 2002, based on JSLC, Grenoble 68 November 2001 and FTRTFT September

Two-dimensional arrays, Copying arrays (shallow copies) , Software Engineering Techniques

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

Technical mechanics of a trans-border Waste Flow Tracking solution based on Blockchain technology

Meat Evaluation beef brisket whole brisket corned brisket flat half point half chuck arm

COMP30019 Graphics and Interaction Illumination Models Adrian Pearce Department of Computer

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Appending & concatenating Series Merging DataFrames with pandas append() .append():

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms