Processing Massive Sized Graphs using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at Chicago Robert Grossman : University of Chicago and Open Data Group Andy Yoo : Lawrence Livermore National Laboratory
Background Very large graph (billions of vertices) processing is important in many real world applications (e.g., social networks) Traditional systems are often complicated to use and/or expensive to build Processing graphs distributedly requires shared data access or complicated data moving This paper investigates how to support large graph processing with “cloud” style compute system Data centric model, simplified API E.g., MapReduce
Overview Sector/Sphere In-Storage Data Processing Framework Graph Breadth-First Search Experimental Results Conclusion
Sector/Sphere Sector: distributed file system Running on clusters of commodity computers Software fault tolerance with replication T opology aware Application aware Sphere: parallel data processing framework In-storage processing User-defined functions on data segments in parallel Load balancing and fault tolerance
Parallel Data Processing Framework Data Storage Disk Disk Disk Locality aware distributed Input Input Input Seg. x Seg. y Seg. z file system UDF UDF UDF Data Processing MapReduce User-defined functions Bucket Bucket Bucket Bucket Writer Writer Writer Writer Output Output Output Output Data Exchanging Seg. 1 Seg. 2 Seg. 3 Seg. n Hash Reduce Disk Disk Disk Disk
Key Performance Factors Input locality Data is processed on the node where it resides, or on nearest nodes Output locality Output data can be put such locations such that data movement can be reduced in further processing In-memory objects Frequently accessed data may be stored in memory
Output Locality: An Example Join two datasets DataSet 1 DataSet 2 Scan each one UDF 1 UDF 1 UDF 2 UDF 2 independently, put their results together Merge the result buckets UDF- UDF- UDF- Join Join Join
Graph BFS b a b a
Data Segmentation Adjacency list Each segments contains approximately same number of edges Edges belonging to the adjacency list of one vertex will not be slit into two segments
Sphere UDF for Graph BFS Basic idea: scan each data segment, find the neighbors of the current level, generates next level, which is the union of the neighbors of all vertices in the current level. Repeat this, until destination is found. Sphere UDF for unidirectional BFS Input: Graph data segment x, current level segment l_x. If a vertex appears in level segment l_x, then it must exist in the graph data segment x For each vertex in level segment l_x, find its neighbor vertices in the data segment x, label each neighbor vertex a bucket ID so that it satisfies the above relationship
Experiments Setup Data PubMed: 28m vertices, 542m edges, 6GB data PubMedEx: 5b vertices, 118b edges, 1.3TB data Testbed Open Cloud T estbed: JHU, UIC, StarLight, Calit2 4 racks, 120 nodes, 10GE inter-connection
Average Time Cost (seconds) on PubMed using 20 Servers Avg Time Avg Time Length Count Percent Uni-BFS Bi-BFS 2 28 10.8 21 25 3 85 32.7 26 29 4 88 33.8 38 33 5 34 13.1 70 42 6 13 5 69 42 7 7 2.7 88 51 8 5 1.9 84 54 Total 260 Avg Time 40 33
Performance Impact of Various Components in Sphere Components Change Time Cost Change Without in-memory object 117% Without bucket location 146% optimization With bucket combiner 106% With bucket fault tolerance 110% Data segmentation by the same 118% number of vertices
The Average Time Cost (seconds) on PubMedEx using 60 Servers Length Count Percent Avg Time 2 11 4.2 56 3 1 0.4 82 4 60 23.2 79 5 141 54.2 197 6 45 17.3 144 7 2 0.7 201 Total 260 Avg Time 156
The Average Time Cost (seconds) on PubMedEx on 19, 41, 61, 83 Servers Group 3 4 5 6 7 Count 1 24 58 16 1 Servers # AVG 19 112 257 274 327 152 275 41 153 174 174 165 280 174 59 184 150 157 140 124 153 83 214 145 146 138 192 147
Conclusion We can process very large graphs with the “cloud” compute model such as Sphere Performance is comparable to traditional systems, but requires simple development effort (less than 1000 lines of code for BFS) A BFS-type query can be done in a few minutes Future work: concurrent queries
Recommend
More recommend