Towards Efficient Query Processing on Massive Time-Evolving Graphs - PowerPoint PPT Presentation

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data (C-Big 2012) 1

Introduction  Dynamic number of nodes and edges in many emerging applications, for example:  Hyperlink structure of the World Wide Web  Relationship structures in online social networks  Connectivity structures of the Internet and overlays  Communication flow networks among individuals  Time-Evolving Graph or TEG:  A sequence of snapshots of a graph as it evolves over the time 2

Need New Approaches for TEG In contrast to middle size and static graphs: An additional dimension, namely time 1. Huge size in many modern domains 2. Facebook has about 800 million vertices and 104 billion edges  The additional temporal dimension causes the data size to 3. increase by multiple orders of magnitude. We study three important problems about TEGs:  Distribution on Cluster Computers  Reachability Query  Pattern Matching 3

BSP model and Vertex-centric graph processing  BSP (Bulk Synchronous Parallel) model Super Step Super Step Super Step Communication Communication  Vertex-centric graph processing  Each vertex of the data graph is a computing unit.  Each vertex initially just knows its own label and its outgoing edges.  Pregel, Giraph Apache, GPS M. Felice Pace, BSP vs MapReduce . Proceedings of the 12th International Conference on Computational Science (ICCS '12) 4

TEG distribution on Clusters  two contradictory goals:  Minimizing communication cost among the nodes of the cluster.  Maximizing node utilization.  A trade-off between two extremes:  Assigning the vertices randomly  Partitioning the graph into connected components P1 P4 P2 P3 Assignment of vertices based on a 5 Random assignment of graph vertices partitioning pattern

TEG distribution on Clusters  More partitions than the number of the compute nodes  Dynamic repartitioning of sub-graphs when changes pass a certain threshold related to the connectivity and structure of the sub-graphs  Incremental reallocation of a node in order to reduce the communication cost a b 6 c

Pattern Matching  There are different paradigms for pattern matching:  Sub-graph Isomorphism (NP-Complete)  Graph Simulation (Quadratic)  Dual Simulation (Cubic)  Strong Simulation (Cubic) SD Mary John SD SD Bio PM Bob Ann Bio PM Bio Alice Sara PM PM PM: Product Manager Pattern Graph Data Graph SD: Software Developer Bio: Biologist 7

Graph Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

Graph Dual Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

Graph Strong Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

Distributed Graph simulation Initial Distributed Graph Data Graph Query Graph b a a e d c b b c f c 12

Distributed Graph simulation The First Superstep Data Graph Query Graph b a a e d c b b c f c 13

Distributed Graph simulation The Second Superstep Data Graph Query Graph b a a e d c b b c f c 14

Distributed Graph simulation The Third Superstep Data Graph Query Graph b a a e d c b b c f c 15

Preliminary results Source of the graph: http://snap.stanford.edu/data/ Number of vertices in the pattern: 20 Graph Synthesizer: http://projects.skewed.de/graph-tool/ 16

Pattern Matching in TEGs  We borrow the idea of result graphs from [1].  Lists for requests of insert and delete, and time stamps for snapshots of the graph.  Delete commands can only diminish the result graph  Insert commands will expand previous result graph.  Saving Result Graphs for some of the snapshots of the graph RG1 RG2 RG3 Diff(G3,G2):= Diff(G2,G1):= Inserts/Deletes Inserts/Deletes [1] W . Fan, J. Li, J. Luo , Z. Tan, X. Wang, and Y. Wu, “Incremental graph pattern matching,” in Proceedings of the 2011 ACM SIGMOD 17 International Conference on Management of data, ser. SIGMOD ’ 11. New York, NY, USA: ACM, 2011, pp. 925 – 936.

Reference  Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large- scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10).  “ Giraph website,” http://giraph.apache.org/.  S. Salihoglu and J. Widom , “ Gps : A graph processing system,” Stanford University, Technical Report, 2012.  S. Ma, Y. Cao, J. Huai, and T. Wo , “Distributed graph pattern matching,” in Proceedings of the 21 st international conference on World Wide Web, ser. WWW ’ 12.  W . Fan, J. Li, J. Luo , Z. Tan, X. Wang, and Y. Wu, “Incremental graph pattern matching,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ser. SIGMOD ’ 11.  S. Ma, Y. Cao, W . Fan, J. Huai, and T. Wo , “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310 – 321, Dec. 2011.  M. R. Henzinger, T. A. Henzinger, and P . W . Kopke, “Computing simulations on finite and infinite graphs,” in Proceedings of the 36 th Annual Symposium on Foundations of Computer Science, ser. FOCS ’ 95. 18

Towards Efficient Query Processing on Massive Time-Evolving Graphs - PowerPoint PPT Presentation

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Processing Massive Graphs Amir H. Payberah amir.payberah@cs.ox.ac.uk University of Oxford Amir

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

1 Recap: Taking Differences Distribution of Integer Values 0.1 idea: use efficient coding

Online Query Processing Exposure to online query processing algorithms and fundamentals A

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

On Continuous, Discrete and Timed Models in Systems Biology Oded Maler CNRS - VERIMAG Grenoble,

Faculty Early Career Development (CAREER) Program (NSF 17-537) Next Deadlines: July 18,

Division of Environmental Biology (DEB) Virtual Office Hour Welcome to the DEB Virtual Office

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018

Biology FEST: Faculty Explora4ons in Scien4fic Teaching Biology FEST Luncheon

Integrative Biology ib.berkeley.edu/undergrad Major Information Session Golden Bear Orientation,

Introduction to Biopython Iddo Friedberg Associate Professor College of Veterinary Medicine

Biovigilance Component Hemovigilance Module Surveillance Requirements and Data Reporting

Towards Efficient Query Processing on Massive Time-Evolving Graphs - PowerPoint PPT Presentation

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Processing Massive Graphs Amir H. Payberah amir.payberah@cs.ox.ac.uk University of Oxford Amir

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

1 Recap: Taking Differences Distribution of Integer Values 0.1 idea: use efficient coding

Online Query Processing Exposure to online query processing algorithms and fundamentals A

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

On Continuous, Discrete and Timed Models in Systems Biology Oded Maler CNRS - VERIMAG Grenoble,

Faculty Early Career Development (CAREER) Program (NSF 17-537) Next Deadlines: July 18,

Division of Environmental Biology (DEB) Virtual Office Hour Welcome to the DEB Virtual Office

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018

Biology FEST: Faculty Explora4ons in Scien4fic Teaching Biology FEST Luncheon

Integrative Biology ib.berkeley.edu/undergrad Major Information Session Golden Bear Orientation,

Introduction to Biopython Iddo Friedberg Associate Professor College of Veterinary Medicine

Biovigilance Component Hemovigilance Module Surveillance Requirements and Data Reporting

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance