Partial Re-streaming Approach For Massive Graph Partitioning Ghizlane ECHBARTHI Hamamache KHEDDOUCI L aboratoire d' I nfoRmatique en I mage et S ystèmes d'information LIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon http://liris.cnrs.fr 1
Introduction Actual graph datasets are huge ! ( World wide web, Facebook, Twitter, Biological networks, ...) Usual computations over these graphs become challenging !
Application Graph partitioning is an essential preprocessing step for distributed graph computations. Random Graph Partitioning is widely applied in distributed graph computation systems (Pregel, GraphLab , Horton, …) in order to run parallel algorithms. Instead of random partitioning, can we think of a more sophisticated partitioning strategy ? 3
Outlines 1. Realted work 2. Proposed approach 3. Evaluation results 4. Conclusion 4
Related work: Streaming graph partitioning Streaming GP was first introduced in [1]. Each vertex arrives with his adjacency list. The Partitioner is a heuristic deciding in which machine the current vertex will be placed. K*C must handle the whole graph with C is the capacity of each machine. [1] : Stanton and Kliot, Streaming graph partitioning for large distributed graphs 2012 . 5
Related work: partitioning strategies LDG [1]: Fennel [2]: Restreaming [3]: Strategy that streams the graph dataset several times in order to improve the partition quality: ReLDG and ReFENNEL. [1] Stanton et al. 2012. [2] Charalampos et al. 2012. [3] Nishimura et al. 2013 6
Related work summary Since the 90’s In 2012 In 2013
Proposed approach: Partial Restreaming (PR) PR method consists in re-streaming only a portion of the graph dataset. Advantages : o Less information to store o Less time to run There exist two versions of PR method: Simple Partial Restreaming partitioning • Selective Partial Restreaming partitioning • N.B: Partitioning heuristics used are Linear Deterministic Greedy LDG [1], and Fennel [2]. [1] Stanton et al. 2012. [2] Charalampos et al. 2012. 8
Proposed approach: Simple Partial Restreaming Simple partial re-streaming method consists of two major phases: Phase 1: The first loaded portion of size S of the graph dataset is re-streamed several times. Phase 2: The rest of the graph dataset is streamed once. 9
Proposed approach : Selective Partial Restreaming Selective partial re-streaming method consists of two major phases: Phase 1: Select a portion with a high average degree and density and re-stream it several times. Phase 2: The rest of the graph dataset is streamed once. 10 10
Evaluation Set up Datasets used All datasets are obtained from the SNAP repository [1] Parameters k = 40 parts o s=10 streams o Half of the graph dataset is re-streamed o [1]: http://snap.stanford.edu/data/ 11 11
Evaluation results Comparing Partial Restreaming and Full restreaming methods Difference = 5.9% Difference = 2.3% 12 12
Evaluation results Comparing Simple partial restreaming and Selective partial restreaming 13 13
Evaluation results Computing the run time gain The run time gain is approximately 50% when restreaming the half of the graph. 14 14
Conclusion Simple PR method reduces the runtime while delivering good quality partitions as in the setting of full restreaming. Selective PR method improves the partition quality compared to the simple PR method. However, selective PR is costly than Simple PR with regard to the runtime. 15 15
Thanks for your attention !
Recommend
More recommend