Distributed GraphLab A Framework for Machine Learning and Data - PowerPoint PPT Presentation

Distributed GraphLab A Framework for Machine Learning and Data Mining in the Cloud Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein By Maciej Biskupiak for R212

Motivation Abstractions of parallel computation are necessary. Current Models such as MapReduce, Dryad or Pregel are too limiting or inefficient for our purposes.

GraphLab Abstraction GraphLab is: ● Asynchronous: parameter values are not necessarily updated at the same time ● Dynamic: Parameters are not updated equally often ● Serialisable: All parallel executions have an equivalent serial execution (no data races) It was originally developed for the multicore in memory setting.

GraphLab Abstraction GraphLab consists of three main parts: ● The Data Graph ● Update Function ● Sync Function

Data Graph ● Computation can be V1 D v2<->v3 expressed as an arbitrary V2 V3 graph. D v1<->v4 D v2<->v4 D v4<->v3 ● Data is associated either with vertices or edges D v1<->v5 V4 D v3<->v6 ● The data itself is mutable, but the structure of the graph is not V6 D v5<->v6 V5

Update Function Vertex to be updated Takes a vertex V and it’s Context D v2<->v3 V1 surrounding context S v . V2 V3 D v1<->v4 Returns the new values of it’s D v2<->v4 D v4<->v3 context S v and a list T of D v1<->v5 V4 D v3<->v6 vertices that will eventually be updated. V6 D v5<->v6 V5

Sync Function The sync function provides a way to track global state. Each vertex v can publish a global value S v . The sync function performs an associative sum over all of these values.

Distributed GraphLab In order to bring GraphLab to the distributed setting, we need a solution for the following: ● Distributing the graph data and balancing the computation ● Maintaining consistency across nodes ● Achieving fault tolerance

Distributing Graph Data An Atom We partition the graph into a set of K atoms V2 V1 (where K is much greater than the number V3 of servers). D v1<->v4 Each atom is stored as a separate file and contains information about ‘ghosts’, the V4 D v1<->v5 vertices and edges adjacent to the atoms boundary V6 V5

Maintaining Consistency Data races are possible if the contexts V1 of update functions overlap. D v2<->v3 V2 V3 GraphLab provides two means of D v1<->v4 D v2<->v4 D v4<->v3 dealing with this: ● A chromatic engine based on D v1<->v5 V4 D v3<->v6 graph coloring ● A distributed read/writer lock system V6 D v5<->v6 V5

Levels of consistency Distributed locking supports various levels of consistency. ● Vertex consistency: Obtains a write lock on the vertex and a read lock on data belonging to adjacent vertices ● Vertex and edge consistency: Obtains a write lock on the vertex and it’s adjacent edges and a read lock on it’s adjacent vertices ● Total consistency: Obtains a write lock on a vertex and it’s adjacent edges and vertices Gives greater performance, as some problems do not require total consistency (EG. Pagerank)

Fault Tolerance In the event of a failure the system can recover to a snapshot taken at a previous point. The snapshot mechanism has to be asynchronous in order to avoid suspending execution. GraphLab implements the Chandy-Lamport algorithm to achieve this

Performance ● Achieves 20-60x improvement over Hadoop ● Competitive with tailored MPI implementations ● Error can converge almost two times faster than in non-dynamic computation

Performance Asynchronous vs synchronous performance of pagerank Comparison of scalability on Named Entity Recognition (First) and The Netflix Collaborative Filtering (Second)

Conclusion Powerful abstraction of parallel computation brought to the distributed setting Provides more flexibility than other models, constrained only by inability to modify the graph structure.

Distributed GraphLab A Framework for Machine Learning and Data - PowerPoint PPT Presentation

Distributed GraphLab A Framework for Machine Learning and Data Mining in the Cloud Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein By Maciej Biskupiak for R212 Motivation Abstractions of parallel computation are

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab

Multilayer networks in GraphLab An open source project study Mariana Marasoiu, R212 GraphLab

Analyzing the Graph-Processing Pipeline: A comparative study of GraphLab and GraphX An open

Deep Learning Made Easy with GraphLab Create Piotr Teterwak Dato Team 1 Who I am Piotr

Exploring Graph Colouring Heuristics in GraphLab Open Source Project Philip Leonard December 1 st

Graph Analytics for Community Detection with GraphLab Petko Georgiev Motivation Community

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs by Gonzalez, Joseph E., et

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Parallel decom position of Mueller m atrices and polarim etric subtraction Jos J. Gil

2019 earnings presentation February 21, 2020 1 Forward-looking statements From time to time

Trondheim to the next level? Results of the EUniverCities conference in Trondheim, 16-18 April

families for 21st century living Professor Donna Pendergast Future proofing Action Plan SDGs

MOVING THE WORLD AT WORK Oshkosh Corporation (NYSE:OSK) Jefferies 2015 Global Industrials

,, : I I ,_ : t Interest Rate 3.03% RESOLUTION NO. 20 18 -17 RESOLUTION A WARDING THE SALE

Evaluation of COTS Diodes for Long Term High Reliability Applications James Loman June 20, 2018

CSE 331 Memento Pattern and Serialization slides created by Marty Stepp based on materials by M.

Distributed GraphLab A Framework for Machine Learning and Data - PowerPoint PPT Presentation

Distributed GraphLab A Framework for Machine Learning and Data Mining in the Cloud Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein By Maciej Biskupiak for R212 Motivation Abstractions of parallel computation are

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab

Multilayer networks in GraphLab An open source project study Mariana Marasoiu, R212 GraphLab

Analyzing the Graph-Processing Pipeline: A comparative study of GraphLab and GraphX An open

Deep Learning Made Easy with GraphLab Create Piotr Teterwak Dato Team 1 Who I am Piotr

Exploring Graph Colouring Heuristics in GraphLab Open Source Project Philip Leonard December 1 st

Graph Analytics for Community Detection with GraphLab Petko Georgiev Motivation Community

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

PowerGraph Distributed Graph-Parallel Computation on Natural Graphs by Gonzalez, Joseph E., et

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Parallel decom position of Mueller m atrices and polarim etric subtraction Jos J. Gil

2019 earnings presentation February 21, 2020 1 Forward-looking statements From time to time

Trondheim to the next level? Results of the EUniverCities conference in Trondheim, 16-18 April

families for 21st century living Professor Donna Pendergast Future proofing Action Plan SDGs

MOVING THE WORLD AT WORK Oshkosh Corporation (NYSE:OSK) Jefferies 2015 Global Industrials

,, : I I ,_ : t Interest Rate 3.03% RESOLUTION NO. 20 18 -17 RESOLUTION A WARDING THE SALE

Evaluation of COTS Diodes for Long Term High Reliability Applications James Loman June 20, 2018

CSE 331 Memento Pattern and Serialization slides created by Marty Stepp based on materials by M.

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges