Unleash Data Science Danny Bickson Co-Founder
GraphLab Project History GraphLab GraphLab GraphChi Create (2009) (2011) (2014)
GraphLab Open Source (2009)
Graphs are Everywhere
Graphs are Essential to Data Mining and Machine Learning Identify influential information Reason about latent properties Model complex data dependencies
Examples of Graphs in Machine Learning
Predicting User Behavior ? ? ? Liberal Conservative ? ? ? ? ? ? ? ? Post Post ? Post Post Post Post Post Post ? ? Post Post Post Post ? Post ? Post ? ? ? ? ? Post Post Post Post ? Post Post Post Post Label Propagation ? ? ? ? ? ? ? ?
Finding Communities Count triangles passing through each vertex: 2 3 1 4 Measures “cohesiveness” of local community More Triangles Fewer Triangles Stronger Community Weaker Community
Recommending Products Users Items Ratings
Many More Applications • Collaborative Filtering - CoEM • Community Detection - Alternating Least Squares - Stochastic Gradient - Triangle-Counting Descent - K-core Decomposition - Tensor Factorization - K-Truss • Structured Prediction • Graph Analytics - Loopy Belief Propagation - PageRank - Max-Product Linear - Personalized PageRank Programs - Shortest Path - Gibbs Sampling - Graph Coloring • Semi-supervised M L • Classification - Graph SSL 10 - Neural Networks
The Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors
The GraphLab Framework Data Model Computation Property Graph Vertex Programs
Machine Learning Pipeline Extract Graph Structured Features Formation Value Data Machine from Learning Data Algorithm face similar belief faces images labels faces propagation important doc docs shared LDA words topics words movie collaborative side ratings movie rated filtering info recommend. movies
Parallelizing Machine Learning Extract Graph Structured Features Formation Value Data Machine from Learning Data Algorithm Graph Ingress Graph-Structured mostly data-parallel Computation graph-parallel
ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Graphical Models Semi-Supervised Feature Cross Gibbs Sampling Learning Extraction Validation Belief Propagation Label Propagation Variational Opt. Computing Sufficient CoEM Statistics Collaborative Graph Analysis Filtering PageRank Triangle Counting Tensor Factorization
Example of a Graph-Parallel Algorithm
Flashback to 1998 Why? First Google advantage: a Graph Algorithm & System to Support it!
PageRank: Identifying Leaders
Depends on rank PageRank of who follows them… Depends on rank of who follows her What’s the rank of this user? Rank? Loops in graph Must iterate!
PageRank Iteration R[j] Iterate until convergence: w ji “My rank is weighted average of my friends’ ranks” R[i] α is the random reset probability - w ji is the prob. transitioning (similarity) from j to i -
Properties of Graph Parallel Algorithms Dependency Local Iterative Graph Updates Computation My Rank Friends Rank
Data-Parallel vs Graph Parallel Table Dependency Graph Row Row Result Row Row MapReduce
Addressing Graph-Parallel ML Data-Parallel Graph-Parallel Map Reduce Map Reduce? Graph-Parallel Abstraction Feature Cross Graphical Models Semi-Supervised Extraction Validation Gibbs Sampling Learning Belief Propagation Label Propagation Computing Sufficient Variational Opt. CoEM Statistics Collaborative Data-Mining PageRank Filtering Triangle Counting Tensor Factorization
Data Graph Data associated with vertices and edges Graph: • Social Network Vertex Data: • User profile text • Current interests estimates Edge Data: • Similarity weights
How do we program graph computation? “Think like a Vertex.” -Malewicz et al. [SIGMOD ’ 10]
Update Functions User-defined program: applied to vertex transforms data in scope of vertex pagerank(i, scope){ // Get Neighborhood data ( R[i] , w ij , R[j] ) scope; Update function applied (asynchronously) // Update the vertex data å [ i ] ¬ a + (1 - a ) w ji ´ R R [ j ] ; in parallel until convergence j Î N [ i ] // Reschedule Neighbors if needed if R[i] changes then Many schedulers available to prioritize computation reschedule_neighbors_of(i); } Dynamic computation
Ensuring Race-Free Code How much can computation overlap ?
Need for Consistency? Higher Throughput (#updates/sec) No Consistency Potentially Slower Convergence of ML
The GraphLab Framework Graph Based Update Functions Data Representation User Computation Consistency Model Scheduler
Never Ending Learner Project (CoEM) Hadoop 95 Cores 7.5 hrs Distributed 32 EC2 80 secs GraphLab machines 0.3% of Hadoop time 2 orders of mag faster 2 orders of mag cheaper
Thus far… GraphLab 1 provided exciting scaling performance We couldn ’ t scale up to But… Altavista Webgraph 2002 1.4B vertices, 6.7B edges
Natural Graphs [Image from WikiCommons]
Achilles Heel : Idealized Graph Assumption But, Natural Assumed… Graphs… Small degree Easy to partition Many high degree vertices (power-law degree distribution) Very hard to partition
Power-Law Degree Distribution 10 10 10 8 Number of Vertices High-Degree 10 6 Vertices: count 1% vertices adjacent 10 4 to 50% of edges 10 2 AltaVista WebGraph 1.4B Vertices, 6.6B Edges 10 0 10 0 10 2 10 4 10 6 10 8 Degree degree
High Degree Vertices are Common Popular Movies “Social” People Users Netflix Movies Hyper Parameters Common Words B θ α Docs θ LDA θ Z θ Z Z Z Z Z Z Z Obama w Z w Z w Z w Z Z w Z w w Z Z w w w w w Words w w w w
GraphLab 2 Solution Program Run on This For This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction Leads to this Split Vertex Strategy
GAS Decomposition A pply G ather (Reduce) S catter Accumulate information Apply the accumulated Update adjacent edges about neighborhood value to center vertex and vertices. Y Σ Y Y’ Y ’ Parallel + … + + “Sum” Y Y Y
PageRank on the Live-Journal Graph Mahout/Hadoop 1340 Spark 354 GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark
Topic Modeling (LDA) Million Tokens Per Second 0 20 40 60 80 100 120 140 160 100 Yahoo! Machines Yahoo! Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes GraphLab 200 lines of code & 4 human hours English language Wikipedia • 2.6M Documents, 8.3M Words, 500M Tokens Computationally intensive
GraphLab vs. Giraph Source: SC13 paper
GraphChi (2011)
GraphChi : Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)
GraphChi – disk-based GraphLab Challenge : Random Accesses Novel GraphChi solution : Parallel sliding windows method minimizes number of random accesses
Triangle Counting on Twitter Graph Total: 34.8 Billion Triangles 40M Users 1.2B Edges 1636 Machines 423 Minutes Hadoop 59 Minutes 59 Minutes, 1 Mac Mini! GraphChi 64 Machines, 1024 Cores 1.5 Minutes GraphLab2 Hadoop results from [Suri & Vassilvitskii WWW ‘11 ]
Netflix Collaborative Filtering • Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges 3 MPI 10 Hadoop Runtime(s) MPI Hadoop GraphLab GraphLab 2 10 1 10 4 8 16 24 32 40 48 56 64 #Nodes
Intel Labs Report on GraphLab Data source: Nezih Yigitbasi, Intel Labs
The Cost of Hadoop 2 10 Hadoop 1 10 Cost($) GraphLab 0 10 −1 10 1 2 3 4 10 10 10 10 Runtime(s)
Growing User Community and Adoption
GraphLab Conferences 2012 2013
Growing community contribution
Unleash Data Science Power + Simplicity
Real-World Pipelines Combine Graphs & Tables Hyperlinks PageRank Top 20 Pages Title PR Raw Text Wikipedia Table Title Body < / > < / > < / > XML Word Topic Term-Doc Topic Model Word Topics Graph (LDA)
GraphLab Create: Blend Graphs & Tables Enabling users to easily and efficiently express the entire graph analytics pipelines within a simple Python API.
even basic applications can be challenging. Machine Learning is a 6 months from R/Matlab to production (at best). powerful tool but … state-of-art algorithms are trapped in research papers. Goal of GraphLab: Make large-scale machine learning accessible to all!
Now with GraphLab: Learn/Prototype/Deploy Learn ML with GraphLab Notebook Even basics of scalable ML can be challenging pip install graphlab 6 months from R/Matlab to then deploy on EC2 production, at best State-of-art ML algorithms Fully integrated trapped in research papers via GraphLab Toolkits
Recommend
More recommend