Efficient Diameter Approximation for Large Graphs in MapReduce Geppino Pucci - Universit` a di Padova, Italy Based on joint works ([SPAA15], [IPDPS16]) with: Matteo Ceccarello, Andrea Pietracaprina (U. Padova) Eli Upfal (Brown U.)
Outline 1. Context 2. Computational model 3. Previous work 4. Diameter approximation algorithm 5. Experiments 6. Conclusions
Context Scenario ◮ Large graph analytics: major discovery tool for diverse application domains (e.g., social/road/biological network analysis, cybersecurity, NLP, cognitive computing) ◮ (Commodity) computer clusters: cheap, widespread platforms with relatively high communication/synchronization costs Focus ◮ Approximation of graph diameter ◮ very large, undirected, weighted (sparse) graphs ◮ linear space, few parallel rounds, practical efficiency
Computational Model MR model [PPRSU12] ◮ Abstraction of popular programming frameworks (MapReduce/Hadoop, Spark) ◮ Builds upon and simplifies [Karloff+’10][Goodrich’11] ◮ Underlying platform: unspecified number of interconnected commodity machines ◮ Algorithm: sequence of rounds ◮ 2 parameters: max local space M L , max aggregate space M A MR( M L , M A ) round Transforms a multiset X of key-value pairs into a new multiset Y of key value pairs by applying a given reduce function to all input pairs with the same key.
Previous work Sequential setting ◮ APSP (Johnson’s alg.): O ( n · m + n 2 log n ) time ◮ Roditty et al. (STOC’13, SODA’14): 3/2-approximation in O (min { m 3 / 2 , mn 2 / 3 } ) = o ( n · m ) time. ◮ Empirically: very few SSSPs guarantee accurate estimates ([MLH09, CGHLM13, C+12, C+13, C+15]). Parallel setting ◮ Exact diameter through matrix-multiplication: O (log n ) rounds but Ω( n 2 ) space. ◮ Cohen (JACM’00): (1 + ǫ )-approximation in O (poly(log n )) time and superlinear space . Not easy to implement.
Previous work (cont’d) 2-Approximation achievable through SSSP PRAM algorithms ∆-stepping (Meyer and Sanders, JoA’03) ◮ Parallel time-work tradeoff by staggering edge relaxations ( d j ← min { d j , d i + w ij } ) ◮ At iteration i , compute distances ∈ [( i − 1)∆ , i ∆]. ◮ Small ∆’s: ≃ Dijkstra. Large ∆’s: ≃ Bellman-Ford ◮ Round complexity =Ω( ℓ Φ( G ) ), where ℓ Φ( G ) edges are required to connect any two nodes at distance Φ( G ). Our aim Diameter approximation in linear space and o ( ℓ Φ( G ) ) rounds
Diameter approximation: high-level strategy Based on shallow-depth clustering: 1. Compute a decomposition C of G into clusters of small radius 2. Estimate diameter Φ( G ) from diameter Φ( G C ), with G C a suitable quotient graph derived from C Remarks ◮ Previous decompositions ([MPX13, Mey08]) do not guarantee small (unweighted+weighted) radius ◮ Cluster ganularity chosen so that G C fits into local memory ◮ Small radius → low round complexity, better approximation
Decomposition C : algorithm cluster ( τ ) Challenges Cluster centers are sampled at random. In order to attain small (unweighted+weighted) cluster radius we must 1. Ensure higher sampling density in remote regions of the graph 2. Avoid heavy edges for cluster growth Key ingredients 1. Progressive clustering strategy 2. ∆-stepping approach to cluster growing
Decomposition C : a pathological example
Decomposition C : algorithm cluster ( τ ) Progressive clustering [CPPU15] 1. Select random batch of τ centers from uncovered nodes 2. Grow both old and new clusters until covering half of the uncovered nodes 3. Repeat steps 1-2 until complete coverage ∆-stepping-like cluster growth [CPPU16] ◮ ∆ ← guess on cluster’s minimum weighted radius ◮ In each iteration of progressive clustering (Steps 1-2): ◮ Use only light edges (weight < ∆) and stop at radius ∆ ◮ If desired coverage cannot be obtained then ∆ ← 2∆
Algorithm cluster ( τ ) : example ( τ = 1 , ∆ = 4 ) Graph G 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P O 3 N 2 1 A 2 E F 3 5 1 Q L G H 2 1 1 R S
Algorithm cluster ( τ ) : example ( τ = 1 , ∆ = 4 ) 1st batch of τ centers 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P 3 O N 2 1 A 2 E F 3 5 1 Q L G H 2 1 1 R S
Algorithm cluster ( τ ) : example ( τ = 1 , ∆ = 4 ) 1st batch of τ centers 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P 3 O N 2 1 A 2 E F 3 5 1 Q L G H 2 1 1 R S
Algorithm cluster ( τ ) : example ( τ = 1 , ∆ = 4 ) 2nd batch of τ centers 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P 3 O N 2 1 A 2 E F 3 5 1 Q L G H 2 1 1 R S
Algorithm cluster ( τ ) : example ( τ = 1 , ∆ = 4 ) 3rd batch of τ centers 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P 3 O N 2 1 A 2 E F 3 5 1 Q L G H 2 1 1 R S
Decomposition C : algorithm cluster ( τ ) Theorem W.h.p. cluster ( τ ) computes a decomposition C of G into O ( τ log 2 n ) clusters ◮ Max cluster radius: O ( R ( G , τ ) log n ) ◮ Round complexity: O (min { n /τ, ℓ R ( G ,τ ) } log n ) on MR( n ǫ , m ), for any constant ǫ ∈ (0 , 1). where: ◮ R ( G , τ ): minimum max radius in any τ -clustering of G ◮ ℓ X : max number of edges in a min-weight path of weight X
Diameter approximation: example Graph G , weighted diameter Φ( G ) = 16 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P O 3 N 2 A 1 2 E F 3 5 1 Q L G H 1 2 1 S R
Diameter approximation: example 4 I M C 3 1 1 1 2 3 D 4 1 B 2 P 3 N O 2 1 A 2 E F 3 5 1 Q L G H 1 2 1 R S Radius = 3 7 M Quotient graph G C A 5 Φ (G C ) = 12 Radius = 4 Φ (G) <= 12+4+2 = 18 (vs 16) R Radius = 2
Diameter approximation: main result Theorem For a given weighted graph G , w.h.p. we can compute an upper bound to Φ( G ) ◮ Approximation ratio: O (log 3 n ) ◮ Round complexity: O (min { n /τ, ℓ R ( G ,τ ) log n } log n ) on MR( n ǫ , m ), for any constant ǫ ∈ (0 , 1). Remarks ◮ Round complexity becomes o ( ℓ Φ( G ) / n δ ) on graphs of bounded doubling dimension ◮ Practical implementation. On real-world graphs, approximation ratio < 1 . 3 ◮ Byproduct: linear-space, low-round k -center clustering in MR
Proof Idea ◮ 2-phase decomposition strategy: ◮ Phase 1. Compute an estimate R of R ( G , τ ) through progressive sampling. ◮ Phase 2. Perform log n iterations of cluster-growing steps of fixed radius R from batches of centers selected with geometrically increasing probability ◮ O (log 3 n ) Approximation: w.h.p. the nodes of each shortest-path segment of length R belong to O (log 2 n ) clusters of radius O ( R log n ).
Diameter approximation: experiments Experimental setup ◮ In-house cluster with 16 machines ◮ 18GB RAM / Intel i7 nehalem 4-core processor ◮ Spark MapReduce platform Scalability Datasets 4500 RMAT(26) roads(3) 4000 Graph n m Φ( G ) 3500 3000 23,947,347 29,166,673 55,859,820 roads-USA 1,890,815 2,328,872 16,425,258 roads-CAL 2500 time (s) 3,997,962 32, 681, 189 9.41 livejournal 2000 41,652,230 1,468,365,182 9.07 twitter 1500 S 2 mesh(S) 2 S ( S − 1) † 1000 2 S 16 · 2 S R-MAT(S) † ≈ S · 2 . 3 · 10 7 ≈ S · 5 . 3 · 10 7 roads(S) † 500 0 † the diameter depends on the size of the graph, controlled by S > 1. 2 1 2 2 2 3 2 4 machines
Diameter approximation: experiments ◮ We compare our algorithm ( CLUSTER ) with ∆-stepping Rounds Time 10 5 10 5 CLUSTER CLUSTER ∆ stepping ∆ stepping 10 4 10 4 10 3 time (s) 10 3 10 2 10 2 10 1 10 0 10 1 A L h a l r 4 ) A L h a l r 4 ) S A s n t e 2 S A s n t e 2 U C m e r i t ( U C m e r i t ( u w T u w T s d s j o t A s d s o j t A d a e M d a e M a o v a o v o r l i R o r l i R r r
Diameter approximation: experiments Approximation Work 10 12 1.5 CLUSTER CLUSTER ∆ stepping ∆ stepping 10 11 1.4 10 10 1.3 10 9 1.2 10 8 1.1 10 7 1.0 A L h a l r 4 ) S A s n e 2 A L h a l r 4 ) U C e r t t ( S A s n e 2 m u w i T U C e r t t ( s s o t A m u w i T d d e j M s s o t A a o a v d d e j M o r l i R a a v r o r o l i R r The approximation quality does not depend on the granularity of the clustering.
Conclusions Summary MR-algorithm for O (log 3 n ) approximation of the diameter of a large, undirected, weighted graph G ◮ o ( ℓ Φ( G ) ) rounds, linear global space, sublinear local space ◮ Good performance/approximation on real-world graphs Ongoing and future work ◮ Tighter analysis of approximation factor ◮ Clustering + constant d.d. yields a (1 + ǫ ) (unweighted) diameter approximation in O (( m + n ) /ǫ ) sequential time. ◮ Clustering for approximate centrality computations Software GRADIAS: crono.dei.unipd.it/gradias
Recommend
More recommend