graphs in big data challenges and opportunities
play

Graphs in Big Data: Challenges and Opportunities Yinglong Xia - PowerPoint PPT Presentation

Graphs in Big Data: Challenges and Opportunities Yinglong Xia 05/16/2016 Mission-Critical Big Data Analytics (MCBDA2016) Graph is the way we remember, we associate, and we understand. 2 3 Background Graph Analytics and Systems


  1. Graphs in Big Data: 
 Challenges and Opportunities Yinglong Xia 05/16/2016 Mission-Critical Big Data Analytics (MCBDA’2016)

  2. Graph is the way we remember, we associate, 
 and we understand. 2

  3. 3

  4. Background Graph Analytics and Systems Challenges Breakthrough & Opportunities Mini Hands-on 4

  5. Classic Graph Theory • In 1736, Seven Bridges of Königsberg 
 is historically proposed in mathematics, 
 laid the foundations of graph theory. 
 Seven Bridges of Königsberg Graph isomorphism • In 1878, Graph theory is discussed by 
 Sylvester in Nature 
 • The first textbook on graph theory was 
 written by Dénes K ő nig in 1936, followed 
 by another one by Frank Harary in 1969 Goldman Chinese Postman Max bipartite match 5

  6. Brief History 2016 61.6 million V 1.47 billion E 40 million V 300 million E Neuronal network @ Human 
 Brain Project 89 billion V 
 & 100 trillion E N.T. Bliss, Confronting the Challenges of Graphs and Networks, Lincoln Laboratory Journal, 2013 6

  7. Diversity in Graph Technology Dynamic graph helps analyze the RDF graph enables knowledge 
 Streaming graph monitors 
 spatial and temporal influence over 
 inference 
 sentiment propagation over 
 the entities in the network over linked data time and how the graph 
 structure can impact Property graph is widely used as a data 
 storage model to manage the properties 
 Graphical models leverages of entities as well as the interconnections statistics to inference latent 
 Graph technology 
 Edge label factors in a complex system leads to rich 
 analytic abilities Vertex ID Edge property 7

  8. Graphs in Big Data Some recommender Graphical Models system such as CDR graph: Call detailed can be used to find collaborative filter Social network is a record can form a graph latent variables can be constructed scale-free graph with by linking the numbers from noisy data on a bipartite graph small-world effect called each other. From IBM Big Data Webpage 8

  9. Graph Analytics 9

  10. Complex Network Analysis Real world complex networks include WWW, Social Network, Biological network, Citation Network, Power Grid, Food Web, Metabolic network, etc. Import properties/metrics: Complex network models: - Small-world effect - Poisson random graph - Betweenness - degree~Poisson - Eccentricity/ Centrality - Small world effect - Transitivity - Watts and Strogatz graph - Resilience - Transitivity - Community structure - Small world effect - Clustering coefficient - Barabasi and Albert graph - Matching index - Small world - Power law 10

  11. Information Propagation 11

  12. Knowledge Graph Tim Berners-Lee RDF is a key part of semantic network, 
 RDF ● making the WWW into a info exchange media Represent relationships among • entities using links with properties W3C/DAWG Standards • SPARQL is the born standard language Yinglong to query graph Software y r data represented w t s o u r d k y n r _ as RDF triples t i s i n u H d Hardware Q n i Futurewei subject Predicate Object Yinglong work_in Futurewei Yinglong born 1980 Futurewei has_HQ Shenzhen RDF Graph = A collection of triples, linking the description of resources 12

  13. Graphical Model for Probabilistic Inference Bayesian Network 101 Judea Pearl Graphical Model ● Represent joint distribution of r.v. 
 ● compactly using the conditional 
 Dependence Random variable independence among factors Components: ● node → random variable ● CPT edge → prob dependence ● Joint Distribution Examples ● Bayesian Network ● Probabilistic Inference � Inferring the status of the 
 Latent Markov Field ● unobservable random variables using what can be 
 Factor Graph observed (a.k.a Evidence ) ● Boltmann Machine ● Ex � Given wet grass(G=true) � chance of rain (R=true) is: Use case � Computer vision, image processing 13

  14. Property Graph and Data Management ● Property graph is a data representation model with strong expressiveness Property graph is supported by most graph databases (NoSQL) and also ● forms the foundation of graph analysis. • Vertices ■ Unique ID for each A set of (directed) edges ■ ■ Property: a set of key-value pairs Edges • ■ Unique ID for each ■ Two end vertices With at least a label ■ ■ Property: a set of key-value pairs 14

  15. Property Graph Implementation Adjacent list ● ● Similar to CSR � with improvements ● Utilized by ScaleGraph etc Adjacent matrix ● ● Graph —> Sparse Matrix ● Suit to some algorithms (e.g. PageRank), ● Utilized by IBM GPI Vertex property list � edge property list ● ● Utilized by Spark/GraphX Straightforward and effective data organization ● 15

  16. Basic Operators in Property Graph Traversal ● ● Def � Visit/Modify vertices following the edges ● Implementation � BFS, DFS, ● Application � SSSP � CF � Loopy Bayesian Inference Unvisited Graph Editing ● Graph Traversal ● Def � add/delete/modify vertices, edges, or the property ● Implementation � local update (graphDB), new graphs (Spark/GraphX) ● Application � Finance Surveillance � Hypergraph construction Remove vertices 
 Add edges Gaussian Elimination 16

  17. Graph Systems 17

  18. Some Existing Products Visualization ScaleGraph Analytics Frameworks Flink/Gelly Storage 18

  19. Neo4j System Architecture and Storage Format Neo’s declarative 
 query language Traversals Core API Cypher Vertex/Edge Cache Thread local diffs Graph LFU-protocol changes in a TX structure and data buffers FS Cache HA High Availability 
 i.e. mmap based on TX Transaction log Record files for TX roll-back Disk Link edges inclined to a vertex using the relationship data structure, imposing some performance issue for handling celebrates in power-law graphs e.g. social network Easy to implement horizontal partitioning in FS 19

  20. Titan System Architecture and Storage Format Store Manager Transaction store Relations Index Store 20

  21. OrientDB System Architecture Graph JDBC Support distributed platforms, offering key-value store, docDB, 
 and graphDB in one system DocDB based storage 21

  22. IBM System G Graph Data Organization Contiguously store adjacency 
 • Keeping a chunk of graph data in memory for efficient data retrieval edges for vertex K • On-demand loading loads data only when the vertices and/or edges are accessed • Stitching graph data together in memory → increase data locality Property Files • Behaving as a in-memory database Load entire list likely in a Key table TS table Edges prop1 Property propM single disk access 0 0 Latest TS Latest TS … … ... ... ... ... Multiversioning 1 1 Latest TS Latest TS … … TS N-1 TS N-1 prev prev prop1 prop1 … … propM propM … … … … next next … … … … next next K K Latest TS Latest TS … … … … … Reduce disk … … … … … … … … … … … … TS N TS N prev prev prop1 prop1 … … propM propM access latency … … ... ... ... ... Cache the latest set of pointers in Key table to reduce disk access. Furthermore, cache the Key table in memory when there is enough memory. Reference Contiguously store properties counter for vertex K 22 22

  23. Glance at Graph Computing Engines Spark/GraphX GraphChi 23

  24. Issues within Existing Systems Separation of data management and analytics layers results in ● unnecessary data duplication, adversely hurting the overall Analytics performance Data copy & 
 Data wipe & 
 transform/map rewrite GraphLab, GraphX —> No data management available • GraphDB Titan —> No clear model for data computing/analytics • Limited consideration on Scale-up � but relaying on Scale-out ● for performance improvement, which is inherently different running time overall time GraphX and Titan cannot use the low-cost sync. • Irregularity in graph data access brings high cost to IO, 
 • comm. time slowing down the overall graph data processing time comp. time #machine 24

  25. Issues within Existing Systems - 2 Every coin has two sides JVM constraints ● Productivity and open-source amenable. Java and Scala run on JVMs • Irregular data access in graph forms pressure • Poor data locality � leads to increased workload in GC ■ E.g. Importing 200M edges into Neo4j on one shot on a server with 1TB results in out of ■ memory issue; Tuning the transaction size in Titan is also quite challenging. JVM abstraction makes it difficult to use low level features � such as NUMA- • awareness, GPU devices, etc. ● Impact by the constraints of RDD Spark gives up JVM based GC � which may help improve the performance of • GraphX; Due to the characteristics of RDD, dynamics graph can result in a lot of data copy, rather than in-place data update 25

  26. Challenges 26

Recommend


More recommend