Unleash Data Science Danny Bickson Co-Founder GraphLab Project - PowerPoint PPT Presentation

Unleash Data Science Danny Bickson Co-Founder

GraphLab Project History GraphLab GraphLab GraphChi Create (2009) (2011) (2014)

GraphLab Open Source (2009)

Graphs are Everywhere

Graphs are Essential to Data Mining and Machine Learning Identify influential information Reason about latent properties Model complex data dependencies

Examples of Graphs in Machine Learning

Predicting User Behavior ? ? ? Liberal Conservative ? ? ? ? ? ? ? ? Post Post ? Post Post Post Post Post Post ? ? Post Post Post Post ? Post ? Post ? ? ? ? ? Post Post Post Post ? Post Post Post Post Label Propagation ? ? ? ? ? ? ? ?

Finding Communities Count triangles passing through each vertex: 2 3 1 4 Measures “cohesiveness” of local community More Triangles Fewer Triangles Stronger Community Weaker Community

Recommending Products Users Items Ratings

Many More Applications • Collaborative Filtering - CoEM • Community Detection - Alternating Least Squares - Stochastic Gradient - Triangle-Counting Descent - K-core Decomposition - Tensor Factorization - K-Truss • Structured Prediction • Graph Analytics - Loopy Belief Propagation - PageRank - Max-Product Linear - Personalized PageRank Programs - Shortest Path - Gibbs Sampling - Graph Coloring • Semi-supervised M L • Classification - Graph SSL 10 - Neural Networks

The Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors

The GraphLab Framework Data Model Computation Property Graph Vertex Programs

Machine Learning Pipeline Extract Graph Structured Features Formation Value Data Machine from Learning Data Algorithm face similar belief faces images labels faces propagation important doc docs shared LDA words topics words movie collaborative side ratings movie rated filtering info recommend. movies

Parallelizing Machine Learning Extract Graph Structured Features Formation Value Data Machine from Learning Data Algorithm Graph Ingress Graph-Structured mostly data-parallel Computation graph-parallel

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Graphical Models Semi-Supervised Feature Cross Gibbs Sampling Learning Extraction Validation Belief Propagation Label Propagation Variational Opt. Computing Sufficient CoEM Statistics Collaborative Graph Analysis Filtering PageRank Triangle Counting Tensor Factorization

Example of a Graph-Parallel Algorithm

Flashback to 1998 Why? First Google advantage: a Graph Algorithm & System to Support it!

PageRank: Identifying Leaders

Depends on rank PageRank of who follows them… Depends on rank of who follows her What’s the rank of this user? Rank? Loops in graph  Must iterate!

PageRank Iteration R[j] Iterate until convergence: w ji “My rank is weighted average of my friends’ ranks” R[i] α is the random reset probability - w ji is the prob. transitioning (similarity) from j to i -

Properties of Graph Parallel Algorithms Dependency Local Iterative Graph Updates Computation My Rank Friends Rank

Data-Parallel vs Graph Parallel Table Dependency Graph Row Row Result Row Row MapReduce

Addressing Graph-Parallel ML Data-Parallel Graph-Parallel Map Reduce Map Reduce? Graph-Parallel Abstraction Feature Cross Graphical Models Semi-Supervised Extraction Validation Gibbs Sampling Learning Belief Propagation Label Propagation Computing Sufficient Variational Opt. CoEM Statistics Collaborative Data-Mining PageRank Filtering Triangle Counting Tensor Factorization

Data Graph Data associated with vertices and edges Graph: • Social Network Vertex Data: • User profile text • Current interests estimates Edge Data: • Similarity weights

How do we program graph computation? “Think like a Vertex.” -Malewicz et al. [SIGMOD ’ 10]

Update Functions User-defined program: applied to vertex transforms data in scope of vertex pagerank(i, scope){ // Get Neighborhood data ( R[i] , w ij , R[j] )  scope; Update function applied (asynchronously) // Update the vertex data å [ i ] ¬ a + (1 - a ) w ji ´ R R [ j ] ; in parallel until convergence j Î N [ i ] // Reschedule Neighbors if needed if R[i] changes then Many schedulers available to prioritize computation reschedule_neighbors_of(i); } Dynamic computation

Ensuring Race-Free Code How much can computation overlap ?

Need for Consistency? Higher Throughput (#updates/sec) No Consistency Potentially Slower Convergence of ML

The GraphLab Framework Graph Based Update Functions Data Representation User Computation Consistency Model Scheduler

Never Ending Learner Project (CoEM) Hadoop 95 Cores 7.5 hrs Distributed 32 EC2 80 secs GraphLab machines 0.3% of Hadoop time 2 orders of mag faster  2 orders of mag cheaper

Thus far… GraphLab 1 provided exciting scaling performance We couldn ’ t scale up to But… Altavista Webgraph 2002 1.4B vertices, 6.7B edges

Natural Graphs [Image from WikiCommons]

Achilles Heel : Idealized Graph Assumption But, Natural Assumed… Graphs… Small degree  Easy to partition Many high degree vertices (power-law degree distribution)  Very hard to partition

Power-Law Degree Distribution 10 10 10 8 Number of Vertices High-Degree 10 6 Vertices: count 1% vertices adjacent 10 4 to 50% of edges 10 2 AltaVista WebGraph 1.4B Vertices, 6.6B Edges 10 0 10 0 10 2 10 4 10 6 10 8 Degree degree

High Degree Vertices are Common Popular Movies “Social” People Users Netflix Movies Hyper Parameters Common Words B θ α Docs θ LDA θ Z θ Z Z Z Z Z Z Z Obama w Z w Z w Z w Z Z w Z w w Z Z w w w w w Words w w w w

GraphLab 2 Solution Program Run on This For This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction  Leads to this Split Vertex Strategy

GAS Decomposition A pply G ather (Reduce) S catter Accumulate information Apply the accumulated Update adjacent edges about neighborhood value to center vertex and vertices. Y Σ Y Y’ Y ’ Parallel + … +  + “Sum” Y Y Y

PageRank on the Live-Journal Graph Mahout/Hadoop 1340 Spark 354 GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

Topic Modeling (LDA) Million Tokens Per Second 0 20 40 60 80 100 120 140 160 100 Yahoo! Machines Yahoo! Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes GraphLab 200 lines of code & 4 human hours English language Wikipedia • 2.6M Documents, 8.3M Words, 500M Tokens Computationally intensive

GraphLab vs. Giraph Source: SC13 paper

GraphChi (2011)

GraphChi : Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)

GraphChi – disk-based GraphLab Challenge : Random Accesses Novel GraphChi solution : Parallel sliding windows method  minimizes number of random accesses

Triangle Counting on Twitter Graph Total: 34.8 Billion Triangles 40M Users 1.2B Edges 1636 Machines 423 Minutes Hadoop 59 Minutes 59 Minutes, 1 Mac Mini! GraphChi 64 Machines, 1024 Cores 1.5 Minutes GraphLab2 Hadoop results from [Suri & Vassilvitskii WWW ‘11 ]

Netflix Collaborative Filtering • Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges 3 MPI 10 Hadoop Runtime(s) MPI Hadoop GraphLab GraphLab 2 10 1 10 4 8 16 24 32 40 48 56 64 #Nodes

Intel Labs Report on GraphLab Data source: Nezih Yigitbasi, Intel Labs

The Cost of Hadoop 2 10 Hadoop 1 10 Cost($) GraphLab 0 10 −1 10 1 2 3 4 10 10 10 10 Runtime(s)

Growing User Community and Adoption

GraphLab Conferences 2012  2013

Growing community contribution

Unleash Data Science Power + Simplicity

Real-World Pipelines Combine Graphs & Tables Hyperlinks PageRank Top 20 Pages Title PR Raw Text Wikipedia Table Title Body < / > < / > < / > XML Word Topic Term-Doc Topic Model Word Topics Graph (LDA)

GraphLab Create: Blend Graphs & Tables Enabling users to easily and efficiently express the entire graph analytics pipelines within a simple Python API.

even basic applications can be challenging. Machine Learning is a 6 months from R/Matlab to production (at best). powerful tool but … state-of-art algorithms are trapped in research papers. Goal of GraphLab: Make large-scale machine learning accessible to all! 

Now with GraphLab: Learn/Prototype/Deploy Learn ML with GraphLab Notebook Even basics of scalable ML can be challenging pip install graphlab 6 months from R/Matlab to then deploy on EC2 production, at best State-of-art ML algorithms Fully integrated trapped in research papers via GraphLab Toolkits

Unleash Data Science Danny Bickson Co-Founder GraphLab Project - PowerPoint PPT Presentation

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab GraphChi Create (2009) (2011) (2014) GraphLab Open Source (2009) Graphs are Everywhere Graphs are Essential to Data Mining and Machine Learning

Unleash Unleash the the power of C++ power of C++ in in Python Python A guide through the

Presentation for Global Talents UNLEASH offers talents opportunity to pivot innovative solutions

Beyond AP Automation: How AI and Analytics Unleash the Power of Finance Data The webinar will

Unleash Your Inner Exercise Expert! Gary Scheiner MS, CDE 2014 AADE Diabetes Educator of the

Unleash Your Inner Superpower Overcoming Anxiety & Depression SPIN Conference 2017

UNLEASH YOUR SUPER POWERS! Finding and growing the superhero librarian within December 6, 2017

The Creative Branding & Design Agency Helping you unleash your full brand potential Let your

Environmental Corps Casa Verde Builders Purpose : To unleash the positive energy of

NoSQL : Unleash the Power of MongoDB Abhishek Bagga 24 th September 2019 1 Abhishek Bagga

AUTHENTICITY UNLEASH YOUR SECRET SUPER POWER 29 August 2019 Image courtesy BP Australia a Gold

To The Next (DOM) Level (or How to leverage on W3C specifications to unleash a can of worms

Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing

Concurrency Unleash your processor(s) Vclav Pech About me Passionate programmer Concurrency

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Parallel Solution of PageRank Problem eero.vainikko@ut.ee Teooriapevad Ruge, 26th January

Constructing Effective and Efficient Topic-Specific Authority Networks For Expert Finding in

Googles eigenvector The secret of PageRank Adhemar Bultheel Dept. Computer Science,

PageRank Model of internet: Users click random link on a page. (byGooglefounder

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 8: Analyzing Graphs,

Virtual Memory Overview / Motivation Simple Approach: Overlays

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm

CU-Boulder Where we are, where we are going Presented by Who am I? Matt Tucker

Unleash Data Science Danny Bickson Co-Founder GraphLab Project - PowerPoint PPT Presentation

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab GraphChi Create (2009) (2011) (2014) GraphLab Open Source (2009) Graphs are Everywhere Graphs are Essential to Data Mining and Machine Learning

Unleash Unleash the the power of C++ power of C++ in in Python Python A guide through the

Presentation for Global Talents UNLEASH offers talents opportunity to pivot innovative solutions

Beyond AP Automation: How AI and Analytics Unleash the Power of Finance Data The webinar will

Unleash Your Inner Exercise Expert! Gary Scheiner MS, CDE 2014 AADE Diabetes Educator of the

Unleash Your Inner Superpower Overcoming Anxiety &amp; Depression SPIN Conference 2017

UNLEASH YOUR SUPER POWERS! Finding and growing the superhero librarian within December 6, 2017

The Creative Branding &amp; Design Agency Helping you unleash your full brand potential Let your

Environmental Corps Casa Verde Builders Purpose : To unleash the positive energy of

NoSQL : Unleash the Power of MongoDB Abhishek Bagga 24 th September 2019 1 Abhishek Bagga

AUTHENTICITY UNLEASH YOUR SECRET SUPER POWER 29 August 2019 Image courtesy BP Australia a Gold

To The Next (DOM) Level (or How to leverage on W3C specifications to unleash a can of worms

Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing

Concurrency Unleash your processor(s) Vclav Pech About me Passionate programmer Concurrency

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Parallel Solution of PageRank Problem eero.vainikko@ut.ee Teooriapevad Ruge, 26th January

Constructing Effective and Efficient Topic-Specific Authority Networks For Expert Finding in

Googles eigenvector The secret of PageRank Adhemar Bultheel Dept. Computer Science,

PageRank Model of internet: Users click random link on a page. (byGooglefounder

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 8: Analyzing Graphs,

Virtual Memory Overview / Motivation Simple Approach: Overlays

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm

CU-Boulder Where we are, where we are going Presented by Who am I? Matt Tucker

Unleash Your Inner Superpower Overcoming Anxiety & Depression SPIN Conference 2017

The Creative Branding & Design Agency Helping you unleash your full brand potential Let your