Giraph: Production-grade graph processing infrastructure for trillion edge graphs 6/22/2014 GRADES Avery Ching
Motivation
Apache Giraph • Inspired by Google’s Pregel but runs on Hadoop • “Think like a vertex” • Maximum value vertex example Processor 1 5 5 5 5 Processor 2 1 5 5 1 5 2 5 2 2 5 Time
Giraph on Hadoop / Yarn Giraph MapReduce YARN Hadoop Hadoop Hadoop Hadoop 0.20.x 0.20.203 1.x 2.0.x
Page rank in Giraph � � public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getSuperstep() >= 1) { Calculate double sum = 0; updated for (DoubleWritable message : messages) { page rank sum += message.get(); value from } neighbors vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum); } if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); Send page rank } else { value to voteToHalt(); neighbors for } 30 iterations } }
Apache Giraph data flow Loading the graph Compute / Iterate Storing the graph Input In-memory Output format graph format Worker 0 Worker 0 Worker 0 Part 0 Load/ Compute/ Send Split 0 Part 0 Part 0 Send Graph Part 1 Messages Master Master Split 1 Part 1 Part 1 Split 2 Part 2 Part 2 Worker 1 Worker 1 Worker 1 Part 2 Load/ Compute/ Send Split 3 Part 3 Send Part 3 Graph Part 3 Messages Send stats/iterate!
Pipelined computation Master “computes” • Sets computation, in/out message, combiner for next super step • Can set/modify aggregator values Master Worker 0 phase 1a phase 1b phase 2 phase 3 Worker 1 phase 1b phase 2 phase 3 phase 1a Time
Use case
Affinity propagation Frey and Dueck “Clustering by passing messages between data points” Science 2007 Organically discover exemplars based on similarity Initialization Intermediate Convergence
3 stages Responsibility r(i,k) • How well suited is k to be an exemplar for i ? Availability a(i,k) • How appropriate for point i to choose point k as an exemplar given all of k ’s responsibilities? Update exemplars • Based on known responsibilities/availabilities, which vertex should be my exemplar? � * Dampen responsibility, availability
Responsibility Every vertex i with an edge to k maintains responsibility of k for i Sends responsibility to k in ResponsibilityMessage (senderid, responsibility(i,k)) r(b,a) B r(b,d) A r(c,a) r(d,a) C D
Availability Vertex sums positive messages Sends availability to i in AvailabilityMessage (senderid, availability(i,k)) a(b,a) B A a(c,a) a(d,a) a(b,d) C D
Update exemplars Dampens availabilities and scans edges to find exemplar k Updates self-exemplar B update A update exemplar=a exemplar=d update C update D exemplar=a exemplar=a
Master logic calculate calculate update initial halt responsibility availability exemplars state if (exemplars agree they are exemplars && changed exemplars < ∆ ) then halt, otherwise continue
Performance & Scalability
Example graph sizes Graphs used in research publications Rough social network scale* 7 300 5.25 225 Billions Billions 3.5 150 1.75 75 0 0 Clueweb 09 Twitter dataset Friendster Yahoo! web Twitter Est* Facebook Est* Twitter 255M MAU (https://about.twitter.com/company), 208 average followers (Beevolve 2012) → Estimated >53B edges Facebook 1.28B MAU (Q1/2014 report), 200+ average friends (2011 S1) → Estimated >256B edges
Faster than Hive? Application Graph Size CPU Time Speedup Elapsed Time Speedup Page rank 400B+ edges 26x 120x (single iteration) Friends of friends score 71B+ edges 12.5x 48x
Apache Giraph scalability Scalability of workers Scalability of edges (50 (200B edges) workers) 500 500 375 375 Seconds Seconds 250 250 125 125 0 0 50 100 150 200 250 300 1E+09 7E+10 1E+11 2E+11 # of Workers # of Edges Giraph Ideal Giraph Ideal
Trillion social edges page rank 4 Improvements • GIRAPH-840 - Netty 4 upgrade Minutes per iteration 3 • G1 Collector / tuning 2 1 0 6/30/2013 6/2/2014
Graph partitioning
Why balanced partitioning Random partitioning == good balance BUT ignores entity affinity 0 1 6 3 10 4 5 7 8 9 2 11
Balanced partitioning application Results from one service: Cache hit rate grew from 70% to 85% , bandwidth cut in 1/2 � � 0 3 6 9 1 4 7 10 2 5 8 11
Balanced label propagation results * Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13
Leveraging partitioning Explicit remapping Native remapping • Transparent • Embedded
Explicit remapping Original Compute graph output Compute - shortest paths from 0 Id Distance Id Edges (Chicago, 4) San 0 0 Jose (New York, 6) (San Jose, 4) Chicago 1 4 Remapped Final compute (New York, 3) graph output (San Jose, 6) New 2 6 York (Chicago, 3) Id Edges Id Distance Join Join (1, 4) Reverse partition Partitioning San 0 0 Jose (2, 6) mapping Mapping (0, 4) Chicago 4 1 Id Alt Id Alt Id Id (2, 3) (0, 6) New San Jose 0 0 San Jose 2 6 York (1, 3) Chicago 1 1 Chicago New York 2 2 New York
Native transparent remapping Original graph Compute - shortest paths from Id Edges Original graph with (Chicago, 4) San group information Jose (New York, 6) (San Jose, 4) Chicago Final compute Id Group Edges (New York, 3) output (San Jose, 6) (Chicago, 4) New “San Jose” San Jose 0 York (Chicago, 3) (New York, 6) Id Distance (San Jose, 4) San Partitioning Chicago 1 0 (New York, 3) Jose Mapping (San Jose, 6) New York 2 Chicago 4 (Chicago, 3) Id Group New San Jose 0 6 York Chicago 1 New York 2
Native embedded remapping Original graph Compute - shortest paths from Id Edges Original graph with (1, 4) 0 mapping embedded in Id (2, 6) (0, 4) 1 Final compute Top bits machine, Id Edges (2, 3) output (0, 6) (Chicago, 4) 2 “San Jose” 0, 0 (1, 3) (New York, 6) Id Distance (San Jose, 4) Partitioning 1, 1 0 0 (New York, 3) Mapping (San Jose, 6) 0, 2 1 4 (Chicago, 3) Id Mach 0 0 2 6 Not all graphs can leverage this 1 1 technique, Facebook 2 0 can since ids are longs with unused bits.
Remapping comparison Native Native Explicit Transparent Embedded • Can also add id • No application change, • Utilize unused bits compression just additional input parameters Pros � •Application aware of • Additional memory • Application changes Id remapping usage on input type •Workflow complexity • Group information uses Cons more memory •Pre and post joins overhead
Partitioning experiments 345B edge page rank 160 Seconds per iteration 120 80 40 0 Random 47% Local 60% Local
Message explosion
Avoiding out-of-core Example: Mutual friends calculation between � � � C:{D} neighbors D:{C} E:{} 1. Send your friends a list of your friends A B 2. Intersect with your friend list A:{D} B:{} � D:{A,E} C:{D} C E 1.23B (as of 1/2014) E:{D} D:{C} 200+ average friends (2011 S1) 8-byte ids (longs) D A:{C} = 394 TB / 100 GB machines C:{A,E} E:{C} 3,940 machines (not including the graph)
Superstep splitting Subsets of sources/destinations edges per superstep * Currently manual - future work automatic! Sources: A (on), B (off) Sources: A (on), B (off) Sources: A (off), B (on) Sources: A (off), B (on) Destinations: A (on), B (off) Destinations: A (off), B (on) Destinations: A (on), B (off) Destinations: A (off), B (on) B B B B A B A B A B A B B A B A B A B A A A A A
Giraph in production Over 1.5 years in production Hundreds of production Giraph jobs processed a week • Lots of untracked experiments 30+ applications in our internal application repository Sample production job - 700B+ edges Job times range from minutes to hours
GiraphicJam demo
Giraph related projects Graft: The distributed Giraph debugger
Giraph roadmap 2/12 - 0.1 5/13 - 1.0 6/14 - 1.1
The future
Scheduling jobs Snapshot automatically after a time period and restart at end of queue Time Time
Recommend
More recommend