Stateful Distributed Dataflow Graphs: Imperative Big Data - PowerPoint PPT Presentation

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Department of Computing, Imperial College London Peter R. Pietzuch http://lsds.doc.ic.ac.uk prp@doc.ic.ac.uk EIT Digital Summer School on Cloud and Big Data 2015 – Stockholm, Sweden

Growth of Big Data Analytics • Big Data Analytics: gaining value from data – Web analytics, fraud detection, system management, networking monitoring, business dashboard, … Need to enable more users to perform data analytics 2

Programming Language Popularity 3

Programming Models For Big Data? • Distributed dataflow frameworks tend to favour functional, declarative programming models – MapReduce, SQL, PIG, DryadLINQ, Spark, … – Facilitates consistency and fault tolerance issues • Domain experts tend to write imperative programs – Java, Matlab, C++, R, Python, Fortran, …

Example: Recommender Systems • Recommendations based on past user behaviour through collaborative filtering (cf. Netflix, Amazon, …): User A User A Recommend: Rating: 3 Item: “Apple “iPhone” Watch” Rating: 5 Up-to-date Customer activity recommendations on website Distributed dataflow graph (eg MapReduce, Hadoop, Spark, Dryad, Naiad, …) Exploits data-parallelism on cluster of machines

Collaborative Filtering in Java Update with new ratings Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); Item-A Item-B User-A 4 5 User-B 0 5 void addRating(int user, int item, int rating) { userItem .setElement(user, item, rating); User-Item matrix ( UI ) updateCoOccurrence( coOcc , userItem); } Vector getRec(int user) { Vector userRow = userItem .getRow(user); Vector userRec = coOcc .multiply(userRow); return userRec; } Multiply for recommendation Item-A Item-B 2 x User-B 1 Item-A 1 1 Item-B 1 2 Co-Occurrence matrix ( CO ) 6

Collaborative Filtering in Spark (Java) // Build the recommendation model using ALS int rank = 10; int numIterations = 20; MatrixFactorizationModel model = ALS.train (JavaRDD.toRDD(ratings), rank, numIterations, 0.01); // Evaluate the model on rating data JavaRDD<Tuple2<Object, Object>> userProducts = ratings.map ( new Function<Rating, Tuple2<Object, Object>>() { public Tuple2<Object, Object> call(Rating r) { return new Tuple2<Object, Object>(r.user(), r.product()); } } ); JavaPairRDD<Tuple2<Integer, Integer>, Double> predictions = JavaPairRDD.fromJavaRDD( model.predict (JavaRDD.toRDD(userProducts)).toJavaRDD().map( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )); JavaRDD<Tuple2<Double, Double>> ratesAndPreds = JavaPairRDD.fromJavaRDD( ratings.map ( new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() { public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){ return new Tuple2<Tuple2<Integer, Integer>, Double>( new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating()); } } )). join(predictions) .values(); 7

Collaborative Filtering in Spark (Scala) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train (ratings, rank, numIterations, 0.01) // Evaluate the model on rating data val usersProducts = ratings. map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts). map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings. map { case Rating(user, product, rate) => ((user, product), rate) }. join (predictions) • All data immutable • No fine-grained model updates 8

Stateless MapReduce Model • Data model: (key, value) pairs reduce R R R • • Two processing functions: map(k 1 ,v 1 ) à list(k 2 ,v 2 ) reduce(k 2 , list(v 2 )) à list (v 3 ) shuffle • • Benefits: – Simple programming model – Transparent parallelisation map M M M – Fault-tolerant processing partitioned data on distributed file system 9

Big Data Programming for the Masses • Our goals: • Imperative Java programming model for big data apps • High throughput through data-parallel execution on cluster • Fault tolerance against node failures System Mutable Large Low Iteration State State Latency MapReduce No n/a No No Spark No n/a No Yes Storm No n/a Yes No Naiad Yes No Yes Yes SDG Yes Yes Yes Yes 10

Stateful Dataflow Graphs (SDGs) 1 3 2 SEEP distributed dataflow Annotated Java program framework (@Partitioned, @Partial, @Global, …) Data-parallel Cluster Program.java Stateful Dynamic Static scale out & program Dataflow Graph checkpoint-based analysis (SDG) fault tolerance 4 Experimental evaluation results 11

State as First Class Citizen Tasks process data Item 1 Item 2 5 User A 2 Dataflows User B 4 1 represent data State Elements ( SEs ) represent state • Tasks have access to arbitrary state • State elements (SEs) represent in-memory data structures – SEs are mutable – Tasks have local access to SEs – SEs can be shared between tasks 12

Challenges with Large State • Mutable state leads to concise algorithms but complicates scaling and fault tolerance Big Data problem: Matrix userItem = new Matrix(); Matrices Matrix coOcc = new Matrix(); become large • State will not fit into single node • Challenge: Handling of distributed state? 13

Distributed Mutable State • State Elements support two abstractions for distributed mutable state: • Partitioned SEs: Tasks access partitioned state by key • Partial SEs: Tasks can access replicated state 14

(I) Partitioned State Elements • Partitioned SE split into disjoint partitions [0-k] Key space: [0-N] [(k+1)-N] User-Item matrix (UI) Item-A Item-B Access hash(msg.id) User-A 4 5 by key User-B 0 5 Dataflow routed according to State partitioned according hash function to partitioning key 15

(II) Partial State Elements • Partial SEs are replicated (when partitioning is impossible) – Tasks have local access • Access to partial SEs either local or global Global access: Local access: Data sent to all Data sent to one 16

State Synchronisation with Partial SEs • Reading all partial SE instances results in set of partial values Merge logic • Requires application-specific merge logic – Merge task reconciles state and updates partial SEs 17

State Synchronisation with Partial SEs • Reading all partial SE instances results in set of partial values Merge logic Multiple partial values 18

State Synchronisation with Partial SEs • Reading all partial SE instances results in set of partial values Merge logic Collect partial Multiple values partial values • Barrier collects partial state 19

SDG for Collaborative Filtering n 1 n 2 new updateUserItem updateCoOcc rating State Element user coOcc (SE) Item Task n 3 Element (TE) rec rec getUserVec getRecVec merge result request dataflow 20

SDG for Logistic Regression items train merge weights item result classify • Requires support for iteration 21

Stateful Dataflow Graphs (SDGs) 2 SEEP distributed dataflow Annotated Java program framework (@Partitioned, @Partial, @Global, …) Data-parallel Cluster Program.java Stateful Dynamic Static scale out & program Dataflow Graph checkpoint-based analysis (SDG) fault tolerance 22

Partitioned State Annotation @Partition field annotation indicates partitioned state @Partitioned Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRating(int user, int item, int rating) { userItem.setElement( user , item, rating); updateCoOccurrence( coOcc , userItem); } hash(msg.id) Vector getRec(int user) { Vector userRow = userItem.getRow( user ); Vector userRec = coOcc .multiply(userRow); return userRec; } 23

Partial State and Global Annotations @Partitioned Matrix userItem = new Matrix(); @Partial Matrix coOcc = new Matrix(); void addRating(int user, int item, int rating) { userItem.setElement(user, item, rating); updateCoOccurrence( @Global coOcc, userItem); } @Partial field annotation indicates partial state @Global annotates variable to indicate access to all partial instances 24

Partial and Collection Annotation @Partitioned Matrix userItem = new Matrix(); @Partial Matrix coOcc = new Matrix(); Vector getRec(int user) { Vector userRow = userItem.getRow(user); @Partial Vector puRec = @Global coOcc.multiply(userRow); Vector userRec = merge(puRec); return userRec; } Vector merge( @Collection Vector[] v){ /*…*/ } @Collection annotation indicates merge logic 25

Java2SDG: Translation Process Live variable Extract TEs, SEs analysis and accesses Annotated Program.java Program.java SOOT Framework Extract state and state access patterns through static code analysis TE and SE access code Javassist assembly SEEP runnable Generation of runnable code using TE and SE connections 26

Stateful Distributed Dataflow Graphs: Imperative Big Data - PowerPoint PPT Presentation

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Department of Computing, Imperial College London Peter R. Pietzuch

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip & Amit -

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Imperative vs. object- oriented paradigms 1 11/14/17 Imperative vs. object-oriented u

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Stateful access control using LSM CS547 Thomas Uphill Stateful access cont rol using LSM 11

Scalable Verification of Stateful Networks Aurojit Panda, Ori Lahav, Katerina Argyraki, Mooly

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Imperative vs. object- oriented paradigms 1 11/17/14 Imperative vs. object-oriented

Imperative vs. object- oriented paradigms 1 11/11/14 Imperative vs. object-oriented

CS4450: Principles of Programming Languages Imperative features; reference types Dr William

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

CS603: Distributed Systems Lecture 2: Client-Server Architecture, RPC, Corba Cristina

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen,

CPSC 410/611: Final Week -- File Systems File Systems over a Networks: Sun NFS Aspects

Distributed Systems Distributed File Systems Paul Krzyzanowski pxk@cs.rutgers.edu Except as

Nubomedia: the cloud infrastructure for WebRTC and IMS multimedia real-time communications Luis

Distributed 3: Network FS (fjnish) / Failure 1 Changelog Changes made in this version not seen

Distributed File Systems 14A. Remote Data Access: Architectures Operating Systems Principles

stateless analysis of a cryptographic protocol emina torlak february 22, 2005 authentication

Stateful Distributed Dataflow Graphs: Imperative Big Data - PowerPoint PPT Presentation

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Department of Computing, Imperial College London Peter R. Pietzuch

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip &amp; Amit -

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Imperative vs. object- oriented paradigms 1 11/14/17 Imperative vs. object-oriented u

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Stateful access control using LSM CS547 Thomas Uphill Stateful access cont rol using LSM 11

Scalable Verification of Stateful Networks Aurojit Panda, Ori Lahav, Katerina Argyraki, Mooly

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous

Imperative vs. object- oriented paradigms 1 11/17/14 Imperative vs. object-oriented

Imperative vs. object- oriented paradigms 1 11/11/14 Imperative vs. object-oriented

CS4450: Principles of Programming Languages Imperative features; reference types Dr William

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

CS603: Distributed Systems Lecture 2: Client-Server Architecture, RPC, Corba Cristina

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen,

CPSC 410/611: Final Week -- File Systems File Systems over a Networks: Sun NFS Aspects

Distributed Systems Distributed File Systems Paul Krzyzanowski pxk@cs.rutgers.edu Except as

Nubomedia: the cloud infrastructure for WebRTC and IMS multimedia real-time communications Luis

Distributed 3: Network FS (fjnish) / Failure 1 Changelog Changes made in this version not seen

Distributed File Systems 14A. Remote Data Access: Architectures Operating Systems Principles

stateless analysis of a cryptographic protocol emina torlak february 22, 2005 authentication

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Mesos Go Stateful An Abstraction for frameworks running stateful workload Dhilip & Amit -