Distributed Graph Storage Veronika Molnár, UZH
Overview - Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions 2
Graphs and Social Networks 1 Graph = collection of nodes + edges connecting nodes to each other Social Network = collection of individuals and social relations Social Network is also a Graph! (node = person, edge = relation) Social Network graph (image source : thenextweb.com) 3
Graphs and Social Networks 2 - Social Network graph properties (SNA = Social Network Analysis) - Limited number of connections at each node (person) e.g. Facebook: max 5000 - Distribution not uniform - Most people: an average number of connections - But: a few people have a lot of connections (Power law distribution) - Small degree of separation = “Small World” (length of shortest paths) - Centrality - Constantly changing, but very large graph! (7 billion people = 7 billion nodes) 4
Graphs and Social Networks 3 Shortest Path Centrality VM Betweenness Closeness BP PageRank Degree 5
Graphs and Social Networks 4 - Social Network can be… - Facebook - Emails - Mailing lists - Academic networks 6
Criteria for Graph Processing Systems 1 - Modes: - Distributed processing - Research and industry use - Interactive and noninteractive modes - Storage of static and dynamic information E-mail connectivity graph 7 (image source: research.microsoft.com)
Criteria for Graph Processing Systems 2 - Properties: - Scalability (social networks are large!) - Speed - Features: - SNA (Social Network Analysis) metrics: PageRank, Centrality, Shortest paths, ... - Extensibility E-mail connectivity graph 8 (image source: research.microsoft.com)
Current Systems 1 Storage: - Apache Hive (and Hadoop) - Titan Graph Database - Neo4j 9
Current Systems - Storage 2 Apache Hive (and Hadoop) Hadoop: Map/Reduce architecture - Hive: High-level operations on large data sets - HiveQL (similar to SQL) - Converted to MapReduce jobs - Not graph-specific - Supports custom data formats - Can be used as a backend for other systems 10
Current Systems - Storage 3 Titan Graph Database - Store and Query large graphs - Graph schemas - edge and vertex labels - Gremlin query language - transactional query model - high level operations - Two backends: Cassandra and HBase 11
Current Systems - Storage 4 Neo4j - Cost: €12K for startups (more for large companies), free for personal use - Graph Database Management - ACID compliant (Atomicity, Consistency, Isolation, Durability) - Graphs are stored as Edges, Nodes, Attributes - Focus on finding and querying data - Graph analytics with igraph or GraphX - Community! 12
Neo4j 13
Current Systems 5 Computation: - igraph - Spark GraphX - GraphLab 14
Current Systems - Computation 6 igraph - Network analysis / network research - Portable and efficient - Python, R, C, C++ - Built-in, optimized SNA metrics (centrality, diameter, connected components) - Stand-alone or Grid - Extensible, 3 layer API 15
Current Systems - Computation 7 Spark GraphX - Graphs and parallel graph computations - User-defined parallel operations - stored in-memory for faster processing - very good end-to-end performance - graphs are immutable; all operations create a new graph - Prebuilt graph algorithms, e.g. PageRank 16
Current Systems - Computation 8 GraphLab - Cost: $4,000/machine/year, or free 1 year student subscription - Graph computations: processing & analytics - Visualization (GraphLab Canvas) - Machine learning - Common graph algorithms + API 17
GraphLab 18
Current Systems 9 Used by Facebook/Google: - Pregel/Pregelix - Apache Giraph 19
Current Systems - Large Scale 10 Pregel/Pregelix - Pregel: Google-only, Pregelix: open-source - BSP (bulk synchronous processing) model - User defined edge, vertex, message types - Supersteps - Extremely large graphs - in-memory/out-of-core operation models - Vertex-based API, libraries with graph algorithms 20
Current Systems - Large Scale 11 Apache Giraph - BSP model - Graph-wide metrics via global operations - Built on Hadoop, 5-26 times faster than Hive - Highly parallel, keeps all data in memory - Scales linearly with number of edges, can make efficient use of large clusters - Used for PageRank, popularity rank, shortest paths - No built-in graph metrics 21
Comparison Focus Scalability SNA Extensibility Used for Hive parallel computations any size no Java generic Titan storage ~100 B no Python, Java graph queries Neo4j transactional DB ~1 B yes Java, Python, R recommender systems igraph efficiency, portability ~1 M yes R, Python, C++ research GraphX parallel computations ~1 B yes Java, Python, R graph processing GraphLab processing, analytics ~1 B yes C++ recommender systems Giraph large scale, BSP any size no Java, Python Facebook Pregel(ix) large scale, BSP any size yes Java Google 22
Which is the best? Depends on the network and intended use.. - Very large Social Networks: - High-performance, customizable systems, such as Pregelix - Research: - igraph and GraphX support R and Python integration - Analysis and Visualisation of Social Networks - GraphLab with built-in interactive analysis and plotting features - Neo4j contains vast amounts of community resources for these tasks - Custom use cases... - Existing systems might not support these - Instead: use Hadoop/Hive and write the rest yourself! 23
Thank You! aaaaaand Stay for some questions 24
Questions 1 Why do we analyse social data? What are the possible uses of analysing social data? 25
Questions 2 Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, …) 26
Questions 3 Have you ever used such a system? Which one? 27
Questions 4 What are the advantages and disadvantages of distributed graph processing? What is the value of graph processing? 28
Questions 5 How can social metric calculations deal with fake accounts? 29
The End ... 30
Recommend
More recommend