Multiplex graph analysis with GraphBLAS Gábor Szárnyas With contributions from Petra Várhegyi and Lehel Boér Fault Tolerant Systems Research Group MTA-BME Lendület Cyber-Physical Systems Research Group Budapest University of Technology and Economics 1 Department of Measurement and Information Systems
PARADISE PAPERS 3
Paradise papers Data set leaked to investigative journalists Similar to Panama Papers, but larger Mid-sized data set: 2M nodes, 3M edges 4
Paradise papers schema Edge types Node labels 5
GRAPH DATA MODELS 6
Example Carol Bob Simple graph • “Textbook graph” • Untyped graph Dan • Homogeneous network • Monoplex graph Fred Eve 7
Example with edge types Carol Bob Multiplex graph • Edge-labelled or edge-typed graph Dan • Heterogeneous or multidimensional network Fred Eve • Friends • Business partners Expressive power: between untyped and property graphs. 8
MULTIPLEX GRAPH ANALYSIS 9
Local clustering coefficient metric Wedge Triangle v v A wedge LCC(𝑤) = No. of triangles in 𝑤 closes into No. of wedges in 𝑤 a triangle K. Faust, S. Wasserman (1994): Social Network Analysis 10
Typed clustering coefficient metric Wedge Triangle Two edges of Two edges of v v the same type the same type Edge with a different type TCC(𝑤) = No. of typed triangles in 𝑤 No. of typed wedges in 𝑤 F. Battiston et al. @ Physical Review E 2014 Structural measures for multiplex networks 11
Typed clusteredness example 2 LCC 2 2 TCC 0 3 3 3 C C B B 1 2 2 D D 3 2 F F 0 2 E E 3 0 3 12
Multiplex analysis on Paradise papers Previous research o Characterization of HW/SW/building models o 100k – 1M nodes/edges o Naïve Java implementation using edge lists Ran for Paradise papers data set o Clustering coefficient metrics did not complete in days What’s going on? o Implementation and algorithmic aspects need to be tuned G. Szárnyas et al. @ MODELS 2016 Towards the characterization of realistic models: evaluation of multidisciplinary graph metrics 13
GRAPH PROCESSING WORKLOADS 14
Graph processing landscape Giraph, Spark GraphX, Flink Gelly, LDBC: Linked Data Neo4j Graph algorithms lib Benchmark Council Graph analysis Graph (structure only) analytics amount of data accessed Multiplex graph analysis (structure and types) OLAP No off-the-shelf solutions? queries Graph queries (structure, types, and props) OLTP Cypher and SPARQL engines queries (Neo4j, Virtuoso, Stardog …) expected execution time
Graph analysis frameworks Many Apache frameworks can be adapted Hama Graph Giraph on Hadoop Spark GraphX Flink Gelly But most seem abandoned. $ git rev-list --count --all --no-merges --since="Feb 2 2015" --since="Feb 2 2017" --before="Feb 2 2017" hama/graph 63 0 giraph 154 67 spark/graphx 362 120 flink/flink-libraries/gelly 139 112 16
Arabesque Open-source graph analytical framework over Hadoop Frequent subgraph mining: find subgraph with a minimum number of matches Optimized for distributed execution arabesque$ git rev-list... 125 91 C.H.C. Teixeira et al. @ SOSP 2015 Arabesque: A systems for distributed graph mining G. Siganos @ FOSDEM 2016 Arabesque: A distributed graph mining platform Qatar-Computing-Research-Institute/Arabesque 17
The complexity of graph computations Graph computation is difficult o The “curse of connectedness” o Computer architecture are suited for hierarchical data Graph analytics: vertex-centric programming model o “Think like a vertex” o Pregel, scatter-gather, gather-apply-scatter The majority of distributed graph engines use this Going distributed for a 2M node graph? V. Kalavri et al. @ TKDE 2018 High-level programming abstractions for distributed graph processing 18
LINEAR ALGEBRA-BASED CLUSTERING COEFFICIENTS 19
Adjacency matrices B C D E F C B 0 0 1 0 0 B 0 0 1 0 0 C 1 1 0 1 1 𝐵business = D D 0 0 1 0 1 E 0 0 1 1 0 F F E B C D E F B C D E F 0 1 0 1 0 0 1 1 1 0 B B 1 0 1 0 1 1 0 1 0 1 C C 0 1 0 0 0 𝐵friends = 1 1 0 1 1 D 𝐵 = D 1 0 0 0 0 1 0 1 0 1 E E Optimization #1 0 1 0 0 0 0 1 1 1 0 F F Key idea: Multiplication of adjacency matrices = 2-hop, 3-hop, etc. paths 20
Triangles – untyped diag −1 𝐵 ⋅ 𝐵 ⋅ 𝐵 C B 0 1 1 1 0 0 1 1 1 0 D 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 F 0 0 1 0 1 0 0 1 0 1 E 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 3 1 2 1 3 4 8 8 8 4 4 1 0 1 0 1 1 3 2 3 2 8 4 8 4 8 4 1 1 0 1 1 2 2 4 2 2 8 8 8 8 8 8 0 0 1 0 1 1 3 2 3 1 8 4 8 4 8 4 0 1 1 1 0 3 1 2 1 3 4 8 8 8 4 4 2-hop paths 3-hop paths Number of triangles for each node 21
Triangles – typed diag −1 (𝐵 𝑗 ∙ 𝐵 𝑘 ∙ 𝐵 𝑗 ) C B 0 1 0 1 0 0 0 1 0 0 D 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 F 1 0 0 0 0 0 0 1 0 1 E 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 2 2 1 1 1 1 1 6 2 2 6 0 0 1 0 1 0 2 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 2 0 0 0 Matrices get more dense 22
Optimization: element-wise multiplication 𝐵 ⋅ 𝐵 ⋅ 𝐵 enumerates all 3-hop paths o Matrices get more dense, but only the diagonal is used Idea: use element-wise multiplication Optimization #2 LCC 𝑤 = diag −1 (𝐵 ⋅ 𝐵 ⋅ 𝐵) = 𝐵 ⋅ 𝐵 ⊙ 𝐵 ⋅ 1 Typed clustering: 𝒫 𝑢 2 matrix multiplications ∑ 𝑗≠𝑘 𝐵 𝑗 ⋅𝐵 𝑘 ⊙𝐵 𝑗 ⋅1 TCC(𝑤) = 𝑜−1 ⋅∑ 𝑗 𝐵 𝑗 ⋅1 ⊙ 𝐵 𝑗 ⋅1 −1 𝒫 𝑢 2 matrix multiplications for each node P. Várhegyi – Master’s thesis (2018) Embarrassingly Multidimensional graph analytics parallel – > opt. #3 23
The example using the optimization C B 1 0 1 0 1 0 0 0 1 0 0 D 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 ⊙ F 1 1 0 0 0 0 0 0 1 0 1 E 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 2 2 1 1 1 2 2 0 1 1 6 0 0 1 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 More sparse matrix 24
IMPLEMENTATION 1: Java 25
Question on SoftwareRecs Efficient Java Matrix Library (2014 – ongoing) 26
TCC on Paradise papers subsets Less than 1 min. Only EJML scales for the whole graph 27
Graph analyzer library Java implementation o Linear algebra-based implementations o Uses EJML library with CSC compression o Single-threaded o Runs on top of Neo4j/EMF/CSV graphs Number of graph metrics: o For nodes, types, node pairs, node- type pairs, … o TCC variants, typed degree distribution, degree entropy o Pairwise multiplexity, multiplex participation coefficient P. Várhegyi – Master’s thesis (2018) ftsrg/graph-analyzer Multidimensional graph analytics 28
IMPLEMENTATION 2: Julia 29
Julia language A high-performance, high-level dynamic language v1.0 last August, v1.1 just out 30
Julia implementation Preliminary TCC implementation: ~25 lines Written in a few days (incl. learning the language) 𝐵 ⋅ 𝐵 ⊙ 𝐵 ⋅ 1 A * A .* A * ones(n) Performance on Paradise papers: similar to the best Java implementation but easier to write and extend 31
IMPLEMENTATION 3: GraphBLAS 32
Approach for defining graph algorithms GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra Idea: BLAS (Basic Linear Algebra Subprograms), since 1979 Numerical Graph analytic applications applications LINPACK/LAPACK Graph algorithms Separation of concerns Separation of concerns BLAS GraphBLAS Hardware arch. Hardware arch. S. McMillan @ SEI Research Review (CMU) Graph algorithms on future architectures
Graph operations with semirings Many graph algorithms can be captured over arbitrary semirings – > requires overloading of ⨁ . ⨂ Examples: + . ⋅ real numbers clustering tropical semiring min.+ shortest path ⋁. ⋀ boolean semiring traversal Old idea, but very few libraries support this. Aho, Hopcrof, Ullman (1974): Cormen, Leiserson, Rivest (1990): The Design and Analysis of Introduction to Algorithms Computer Algorithms [only in the 1st edition] 34
Graph operations with semirings SuiteSparse:GraphBLAS C API and single-threaded implementation Steep learning curve Good performance even with a single thread o Benchmark work in progress for LCC/TCC sergiud/SuiteSparse/ R. Lipman, T. Davis @ redisconf18 Graph algebra: graph operations in the language of linear algebra J. Kepner, J. Gilbert, H. Jananthan, J. Kepner, Graph algorithms in the Mathematics of Big Data. language of linear algebra. MIT Press 2018 SIAM, 2011 35
RESULTS OF THE ANALYSIS 36
Results on the Paradise papers Two variants of the typed clustering coefficient Outliers Note. Further analysis needs domain-specific expertise. 37
ALGORITHMIC OPTIMIZATION 38
Recommend
More recommend