using sector sphere
play

using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at - PowerPoint PPT Presentation

Processing Massive Sized Graphs using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at Chicago Robert Grossman : University of Chicago and Open Data Group Andy Yoo : Lawrence Livermore National Laboratory Background Very large


  1. Processing Massive Sized Graphs using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at Chicago Robert Grossman : University of Chicago and Open Data Group Andy Yoo : Lawrence Livermore National Laboratory

  2. Background  Very large graph (billions of vertices) processing is important in many real world applications (e.g., social networks)  Traditional systems are often complicated to use and/or expensive to build  Processing graphs distributedly requires shared data access or complicated data moving  This paper investigates how to support large graph processing with “cloud” style compute system  Data centric model, simplified API  E.g., MapReduce

  3. Overview  Sector/Sphere  In-Storage Data Processing Framework  Graph Breadth-First Search  Experimental Results  Conclusion

  4. Sector/Sphere  Sector: distributed file system  Running on clusters of commodity computers  Software fault tolerance with replication  T opology aware  Application aware  Sphere: parallel data processing framework  In-storage processing  User-defined functions on data segments in parallel  Load balancing and fault tolerance

  5. Parallel Data Processing Framework  Data Storage Disk Disk Disk  Locality aware distributed Input Input Input Seg. x Seg. y Seg. z file system UDF UDF UDF  Data Processing  MapReduce  User-defined functions Bucket Bucket Bucket Bucket Writer Writer Writer Writer Output Output Output Output  Data Exchanging Seg. 1 Seg. 2 Seg. 3 Seg. n  Hash  Reduce Disk Disk Disk Disk

  6. Key Performance Factors  Input locality  Data is processed on the node where it resides, or on nearest nodes  Output locality  Output data can be put such locations such that data movement can be reduced in further processing  In-memory objects  Frequently accessed data may be stored in memory

  7. Output Locality: An Example  Join two datasets DataSet 1 DataSet 2  Scan each one UDF 1 UDF 1 UDF 2 UDF 2 independently, put their results together  Merge the result buckets UDF- UDF- UDF- Join Join Join

  8. Graph BFS b a b a

  9. Data Segmentation  Adjacency list  Each segments contains approximately same number of edges  Edges belonging to the adjacency list of one vertex will not be slit into two segments

  10. Sphere UDF for Graph BFS  Basic idea: scan each data segment, find the neighbors of the current level, generates next level, which is the union of the neighbors of all vertices in the current level. Repeat this, until destination is found.  Sphere UDF for unidirectional BFS  Input: Graph data segment x, current level segment l_x. If a vertex appears in level segment l_x, then it must exist in the graph data segment x  For each vertex in level segment l_x, find its neighbor vertices in the data segment x, label each neighbor vertex a bucket ID so that it satisfies the above relationship

  11. Experiments Setup  Data  PubMed: 28m vertices, 542m edges, 6GB data  PubMedEx: 5b vertices, 118b edges, 1.3TB data  Testbed  Open Cloud T estbed: JHU, UIC, StarLight, Calit2  4 racks, 120 nodes, 10GE inter-connection

  12. Average Time Cost (seconds) on PubMed using 20 Servers Avg Time Avg Time Length Count Percent Uni-BFS Bi-BFS 2 28 10.8 21 25 3 85 32.7 26 29 4 88 33.8 38 33 5 34 13.1 70 42 6 13 5 69 42 7 7 2.7 88 51 8 5 1.9 84 54 Total 260 Avg Time 40 33

  13. Performance Impact of Various Components in Sphere Components Change Time Cost Change Without in-memory object 117% Without bucket location 146% optimization With bucket combiner 106% With bucket fault tolerance 110% Data segmentation by the same 118% number of vertices

  14. The Average Time Cost (seconds) on PubMedEx using 60 Servers Length Count Percent Avg Time 2 11 4.2 56 3 1 0.4 82 4 60 23.2 79 5 141 54.2 197 6 45 17.3 144 7 2 0.7 201 Total 260 Avg Time 156

  15. The Average Time Cost (seconds) on PubMedEx on 19, 41, 61, 83 Servers Group 3 4 5 6 7 Count 1 24 58 16 1 Servers # AVG 19 112 257 274 327 152 275 41 153 174 174 165 280 174 59 184 150 157 140 124 153 83 214 145 146 138 192 147

  16. Conclusion  We can process very large graphs with the “cloud” compute model such as Sphere  Performance is comparable to traditional systems, but requires simple development effort (less than 1000 lines of code for BFS)  A BFS-type query can be done in a few minutes  Future work: concurrent queries

Recommend


More recommend