Lawrence Livermore National Laboratory Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
Graph mining techniques have been widely-used in many important applications in recent years � Graph mining extracts information by analyzing relations and structures in graphs (such as ER graphs) So-called “scale-free” graphs can carry rich information Lawrence Livermore National Laboratory 2
Graph Mining Applications: Web Search � Google’s PageRank uses a web graph to rank web pages for given queries � Related applications – Personalized web search search – People search – Eigenvalue/eigenvector – Random walk with restart Lawrence Livermore National Laboratory 3
Graph Mining Applications: Social Network Analysis Community detection algorithms can identify the two communities (e.g., Girvan and Newman, 2002) Further analysis reveals detailed community structures in the graph (e.g., van Dongen, 2000 and Palla, 2005 ) Zachary’s Karate Club, 1977 Divided into two groups centered around two individuals, 1 and 34 Lawrence Livermore National Laboratory 4
Graph Mining Applications: Protein Clustering Can discover proteins with similar functions by clustering protein modules in the protein- protein interaction graphs. Protein-protein interaction network of yeast Adamcsek et. al., Bioinformatica, 1021, 2006 Lawrence Livermore National Laboratory 5
Graph Mining Applications: National Security � Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976 ) � Other related applications – Exact and inexact – Exact and inexact pattern discovery – Fraud detection – Cyber security – Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis , ACM, 2004 Lawrence Livermore National Laboratory 6
Challenges � High complexity of graph mining algorithms – Common graph mining algorithms have high-order computational complexity • High-order algorithms (O(N 2+ )) − Page rank, community finding, path traversal • NP-Complete algorithms • NP-Complete algorithms − Maximal cliques, subgraph pattern matching � Large data size requires out-of-core approaches – Graphs with 10 9+ nodes and edges are increasingly common – Intermediate result increases exponentially in many cases Lawrence Livermore National Laboratory 7
Traditional relational databases have been used in large graph analysis Distribution of Response Time for 100 Bi-directional searches � Due to prevalence and ease 5% 10+ minutes of use conventional database 2% 5 - 10 minutes systems have been used in 30% 2 - 5 minutes graph analysis 37% 1 - 2 minutes 26% < 1 minute � Designed for transaction 0% 5% 10% 15% 20% 25% 30% 35% 40% processing 300B node graph search on Netezza on 700-node NPS ( SC’06 ) 300B node graph search on Netezza on 700-node NPS ( SC’06 ) � Poor performance and scalability 120B node graph search on 60-node MSSG ( Cluster ’06 ) Lawrence Livermore National Laboratory 8
Many-tasks paradigm is currently used for analyzing large data sets: Map/Reduce � Map/Reduce is a popular many- tasks model being used for a wide range of applications � Map/Reduce model – A M/R program consists of many map and reduce tasks map and reduce tasks – Each task works independently – Data between mappers and reducers via intermediate files – Processes list of (key, value) pairs Map/Reduce model � Is Map/Reduce for everything? Lawrence Livermore National Laboratory 9
Map/Reduce model is too limited for large complex graph analysis � Map/Reduce successfully used for some applications, but – Inverted index construction – Distributed sort – Term-vector calculation – Page Rank � Drawbacks � Drawbacks – Model limited to embarrassingly parallel applications – Poor performance and scalability (due to poor handling of intermediate results) 333.75 Sec/64 Nodes System Platform Time (Sec) BFS Search Results Full PubMed graph with 30 million Map/Reduce 20-node Fenix Cluster 1068 vertices and 500 million edges were used, except SGRACE for which a SGRACE 64-node Tuson Cluster 221 synthetic graph with 25 million vertices and 125 million edges is used Lawrence Livermore National Laboratory 10
Dataflow model is a promising alternative to address these issues � More flexible and complex than Map/Reduce ( Map/Reduce on steroids!! ) � Many independent tasks accessing external data in parallel, realizing data parallelism – Tasks triggered by the – Tasks triggered by the availability of data – No flow of control – Data parallel and independent � We evaluated the use of dataflow Dryad dataflow diagram model for large graph analysis in this work Lawrence Livermore National Laboratory 11
We measured the performance of graph algorithms on an actual dataflow machine: Data Analytic Supercomputer DAS RDBMS VS. � � Sequential or parallel relational Parallel dataflow engine on commodity clusters database systems on � Specialized high-performance commodity HW commodity HW library library � Optimized for transaction • Streaming data pipelined for processing maximum in-memory processing � Ubiquitous • Sequentialized disk accesses � Relatively easy to use • Optimized for SORT and JOIN operations � Relies on SQL compiler for � Offers great flexibility for optimization optimization Lawrence Livermore National Laboratory 12
DAS programming and execution environment ECL Code � Uses ECL, a proprietary dataflow language ECL Compiler ECL Library � Built-in ECL data manipulation constructs are C++ Code implemented in a highly optimized library optimized library Executable – JOIN, SORT, MERGE, etc. � Unlike SQL, these low-level CE CE CE constructs are suitable for … complex graph operations Lawrence Livermore National Laboratory 13
An example ECL code ��������������������� ������������� ���� ���������������� �������� !"#$��������%��%��&"''��#��'��(%�%!'����% )���������� *"#+�����&�����,�%�-"''./��(%�0��������,��')��1��2� �����������%������ ������03���������������)1��145�%������22� ���������%,����������������������%��)�%������)�/���/2� ���������������3*����������%,��)�%������)�/���/2� �3�*3����������2� Lawrence Livermore National Laboratory 14
We evaluated some of the most commonly used applications in our experiments Applications evaluated on DAS System Path Traversal Uni- and Bi-directional BFS Pattern Matching Find subgraphs that matches given template template TeraByte (TB) Sort Jim Gray’s SORT Benchmark Page Rank Eigenvector using power method Disambiguation Binning-based coreference resolution Lawrence Livermore National Laboratory 15
Real-world graphs are used in our performance experiments Grant Agency IssuedG Autho rant r Gran t Journal PubMed Sm PubMed Lg IsAut horOf FundedBy IsIss Grant Grant ueOf ueOf |V| |V| 1M 1M 29M 29M Article Published Journal In Issue HasChemic |E| 2M 270M al HasKeywor Chemical d HasContac Raw 400 MB 127 GB tInfo data size HasMeshHe Keyword ading ContactInf MeshHeadin o g Lawrence Livermore National Laboratory 16
Path Traversal: Breadth-first search (BFS) on DAS Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join Destin Sou ation rce ����,��%2 ����,��%2 Edge List Adjacency List (Denormalized) Unidirectional 287.926 120.359 Bidirectional 204.90 56.431 Used large PubMed data Lawrence Livermore National Laboratory 17
DAS system is ideal for handling complex subgraph pattern queries on large data sets -�����"�6,�%��6,�!"#'�%6���&,"�������'�%�����6�� -�����"�6,�%��6,�!"#'�%6���&,"�������'�%�����6�� -�����"�6,�%��6,�!"#'�%6���&,"�� �,"���'� ������������������������ �7"��8�42 �����'�%����%!���&�������%��7"��8�92 -�����"�6,�%��6,�!"#'�%6�����,������'�%� ����6��%�$���,"���'��7"��8�52 Lawrence Livermore National Laboratory 18
Recommend
More recommend