Graphs in Big Data: Challenges and Opportunities Yinglong Xia - PowerPoint PPT Presentation

Graphs in Big Data:   Challenges and Opportunities Yinglong Xia 05/16/2016 Mission-Critical Big Data Analytics (MCBDA’2016)

Graph is the way we remember, we associate,   and we understand. 2

Background Graph Analytics and Systems Challenges Breakthrough & Opportunities Mini Hands-on 4

Classic Graph Theory • In 1736, Seven Bridges of Königsberg   is historically proposed in mathematics,   laid the foundations of graph theory.   Seven Bridges of Königsberg Graph isomorphism • In 1878, Graph theory is discussed by   Sylvester in Nature   • The first textbook on graph theory was   written by Dénes K ő nig in 1936, followed   by another one by Frank Harary in 1969 Goldman Chinese Postman Max bipartite match 5

Brief History 2016 61.6 million V 1.47 billion E 40 million V 300 million E Neuronal network @ Human   Brain Project 89 billion V   & 100 trillion E N.T. Bliss, Confronting the Challenges of Graphs and Networks, Lincoln Laboratory Journal, 2013 6

Diversity in Graph Technology Dynamic graph helps analyze the RDF graph enables knowledge   Streaming graph monitors   spatial and temporal influence over   inference   sentiment propagation over   the entities in the network over linked data time and how the graph   structure can impact Property graph is widely used as a data   storage model to manage the properties   Graphical models leverages of entities as well as the interconnections statistics to inference latent   Graph technology   Edge label factors in a complex system leads to rich   analytic abilities Vertex ID Edge property 7

Graphs in Big Data Some recommender Graphical Models system such as CDR graph: Call detailed can be used to find collaborative filter Social network is a record can form a graph latent variables can be constructed scale-free graph with by linking the numbers from noisy data on a bipartite graph small-world effect called each other. From IBM Big Data Webpage 8

Graph Analytics 9

Complex Network Analysis Real world complex networks include WWW, Social Network, Biological network, Citation Network, Power Grid, Food Web, Metabolic network, etc. Import properties/metrics: Complex network models: - Small-world effect - Poisson random graph - Betweenness - degree~Poisson - Eccentricity/ Centrality - Small world effect - Transitivity - Watts and Strogatz graph - Resilience - Transitivity - Community structure - Small world effect - Clustering coefficient - Barabasi and Albert graph - Matching index - Small world - Power law 10

Information Propagation 11

Knowledge Graph Tim Berners-Lee RDF is a key part of semantic network,   RDF ● making the WWW into a info exchange media Represent relationships among • entities using links with properties W3C/DAWG Standards • SPARQL is the born standard language Yinglong to query graph Software y r data represented w t s o u r d k y n r _ as RDF triples t i s i n u H d Hardware Q n i Futurewei subject Predicate Object Yinglong work_in Futurewei Yinglong born 1980 Futurewei has_HQ Shenzhen RDF Graph = A collection of triples, linking the description of resources 12

Graphical Model for Probabilistic Inference Bayesian Network 101 Judea Pearl Graphical Model ● Represent joint distribution of r.v.   ● compactly using the conditional   Dependence Random variable independence among factors Components: ● node → random variable ● CPT edge → prob dependence ● Joint Distribution Examples ● Bayesian Network ● Probabilistic Inference � Inferring the status of the   Latent Markov Field ● unobservable random variables using what can be   Factor Graph observed (a.k.a Evidence ) ● Boltmann Machine ● Ex � Given wet grass(G=true) � chance of rain (R=true) is: Use case � Computer vision, image processing 13

Property Graph and Data Management ● Property graph is a data representation model with strong expressiveness Property graph is supported by most graph databases (NoSQL) and also ● forms the foundation of graph analysis. • Vertices ■ Unique ID for each A set of (directed) edges ■ ■ Property: a set of key-value pairs Edges • ■ Unique ID for each ■ Two end vertices With at least a label ■ ■ Property: a set of key-value pairs 14

Property Graph Implementation Adjacent list ● ● Similar to CSR � with improvements ● Utilized by ScaleGraph etc Adjacent matrix ● ● Graph —> Sparse Matrix ● Suit to some algorithms (e.g. PageRank), ● Utilized by IBM GPI Vertex property list � edge property list ● ● Utilized by Spark/GraphX Straightforward and effective data organization ● 15

Basic Operators in Property Graph Traversal ● ● Def � Visit/Modify vertices following the edges ● Implementation � BFS, DFS, ● Application � SSSP � CF � Loopy Bayesian Inference Unvisited Graph Editing ● Graph Traversal ● Def � add/delete/modify vertices, edges, or the property ● Implementation � local update (graphDB), new graphs (Spark/GraphX) ● Application � Finance Surveillance � Hypergraph construction Remove vertices   Add edges Gaussian Elimination 16

Graph Systems 17

Some Existing Products Visualization ScaleGraph Analytics Frameworks Flink/Gelly Storage 18

Neo4j System Architecture and Storage Format Neo’s declarative   query language Traversals Core API Cypher Vertex/Edge Cache Thread local diffs Graph LFU-protocol changes in a TX structure and data buffers FS Cache HA High Availability   i.e. mmap based on TX Transaction log Record files for TX roll-back Disk Link edges inclined to a vertex using the relationship data structure, imposing some performance issue for handling celebrates in power-law graphs e.g. social network Easy to implement horizontal partitioning in FS 19

Titan System Architecture and Storage Format Store Manager Transaction store Relations Index Store 20

OrientDB System Architecture Graph JDBC Support distributed platforms, offering key-value store, docDB,   and graphDB in one system DocDB based storage 21

IBM System G Graph Data Organization Contiguously store adjacency   • Keeping a chunk of graph data in memory for efficient data retrieval edges for vertex K • On-demand loading loads data only when the vertices and/or edges are accessed • Stitching graph data together in memory → increase data locality Property Files • Behaving as a in-memory database Load entire list likely in a Key table TS table Edges prop1 Property propM single disk access 0 0 Latest TS Latest TS … … ... ... ... ... Multiversioning 1 1 Latest TS Latest TS … … TS N-1 TS N-1 prev prev prop1 prop1 … … propM propM … … … … next next … … … … next next K K Latest TS Latest TS … … … … … Reduce disk … … … … … … … … … … … … TS N TS N prev prev prop1 prop1 … … propM propM access latency … … ... ... ... ... Cache the latest set of pointers in Key table to reduce disk access. Furthermore, cache the Key table in memory when there is enough memory. Reference Contiguously store properties counter for vertex K 22 22

Glance at Graph Computing Engines Spark/GraphX GraphChi 23

Issues within Existing Systems Separation of data management and analytics layers results in ● unnecessary data duplication, adversely hurting the overall Analytics performance Data copy &   Data wipe &   transform/map rewrite GraphLab, GraphX —> No data management available • GraphDB Titan —> No clear model for data computing/analytics • Limited consideration on Scale-up � but relaying on Scale-out ● for performance improvement, which is inherently different running time overall time GraphX and Titan cannot use the low-cost sync. • Irregularity in graph data access brings high cost to IO,   • comm. time slowing down the overall graph data processing time comp. time #machine 24

Issues within Existing Systems - 2 Every coin has two sides JVM constraints ● Productivity and open-source amenable. Java and Scala run on JVMs • Irregular data access in graph forms pressure • Poor data locality � leads to increased workload in GC ■ E.g. Importing 200M edges into Neo4j on one shot on a server with 1TB results in out of ■ memory issue; Tuning the transaction size in Titan is also quite challenging. JVM abstraction makes it difficult to use low level features � such as NUMA- • awareness, GPU devices, etc. ● Impact by the constraints of RDD Spark gives up JVM based GC � which may help improve the performance of • GraphX; Due to the characteristics of RDD, dynamics graph can result in a lot of data copy, rather than in-place data update 25

Challenges 26

Graphs in Big Data: Challenges and Opportunities Yinglong Xia - PowerPoint PPT Presentation

Graphs in Big Data: Challenges and Opportunities Yinglong Xia 05/16/2016 Mission-Critical Big Data Analytics (MCBDA2016) Graph is the way we remember, we associate, and we understand. 2 3 Background Graph Analytics and Systems

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Mining Data Graphs Semi-supervised learning, label propagation, Web Search Data graphs Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Big graphs for big data: parallel matching and Outline clustering on billion-vertex graphs

in the Storage Ring pEDM Experiment ERIC METODIEV CAPP/IBS, HARVARD COLLEGE HAWAII, JOINT

Adaptive Multiscale Streamline Simulation and Inversion for High-Resolution Geomodels Vegard

derivatives for design and control with Jim and Simon review: serial manipulator end

Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Managing and Monitoring Statistical Models Nate Derby Stakana Analytics Seattle, WA Winnipeg

Se Secur ure Data a Type pes: A A Simp mple Ab Abstract ction for Co

Cost-Effectiveness of Transcatheter Mitral Valve Repair versus Medical Therapy in Patients with

Graphs in Big Data: Challenges and Opportunities Yinglong Xia - PowerPoint PPT Presentation

Graphs in Big Data: Challenges and Opportunities Yinglong Xia 05/16/2016 Mission-Critical Big Data Analytics (MCBDA2016) Graph is the way we remember, we associate, and we understand. 2 3 Background Graph Analytics and Systems

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Mining Data Graphs Semi-supervised learning, label propagation, Web Search Data graphs Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Big graphs for big data: parallel matching and Outline clustering on billion-vertex graphs

in the Storage Ring pEDM Experiment ERIC METODIEV CAPP/IBS, HARVARD COLLEGE HAWAII, JOINT

Adaptive Multiscale Streamline Simulation and Inversion for High-Resolution Geomodels Vegard

derivatives for design and control with Jim and Simon review: serial manipulator end

Towards Practical Differential Privacy for SQL Queries Noah Johnson, Joseph P. Near, Dawn Song

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Managing and Monitoring Statistical Models Nate Derby Stakana Analytics Seattle, WA Winnipeg

Se Secur ure Data a Type pes: A A Simp mple Ab Abstract ction for Co

Cost-Effectiveness of Transcatheter Mitral Valve Repair versus Medical Therapy in Patients with

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data