multiplex graph analysis with graphblas
play

Multiplex graph analysis with GraphBLAS Gbor Szrnyas With - PowerPoint PPT Presentation

Multiplex graph analysis with GraphBLAS Gbor Szrnyas With contributions from Petra Vrhegyi and Lehel Bor Fault Tolerant Systems Research Group MTA-BME Lendlet Cyber-Physical Systems Research Group Budapest University of Technology


  1. Multiplex graph analysis with GraphBLAS Gábor Szárnyas With contributions from Petra Várhegyi and Lehel Boér Fault Tolerant Systems Research Group MTA-BME Lendület Cyber-Physical Systems Research Group Budapest University of Technology and Economics 1 Department of Measurement and Information Systems

  2. PARADISE PAPERS 3

  3. Paradise papers  Data set leaked to investigative journalists  Similar to Panama Papers, but larger  Mid-sized data set: 2M nodes, 3M edges 4

  4. Paradise papers schema Edge types Node labels 5

  5. GRAPH DATA MODELS 6

  6. Example Carol Bob Simple graph • “Textbook graph” • Untyped graph Dan • Homogeneous network • Monoplex graph Fred Eve 7

  7. Example with edge types Carol Bob Multiplex graph • Edge-labelled or edge-typed graph Dan • Heterogeneous or multidimensional network Fred Eve • Friends • Business partners Expressive power: between untyped and property graphs. 8

  8. MULTIPLEX GRAPH ANALYSIS 9

  9. Local clustering coefficient metric Wedge Triangle v v A wedge LCC(𝑤) = No. of triangles in 𝑤 closes into No. of wedges in 𝑤 a triangle K. Faust, S. Wasserman (1994): Social Network Analysis 10

  10. Typed clustering coefficient metric Wedge Triangle Two edges of Two edges of v v the same type the same type Edge with a different type TCC(𝑤) = No. of typed triangles in 𝑤 No. of typed wedges in 𝑤 F. Battiston et al. @ Physical Review E 2014 Structural measures for multiplex networks 11

  11. Typed clusteredness example 2 LCC 2 2 TCC 0 3 3 3 C C B B 1 2 2 D D 3 2 F F 0 2 E E 3 0 3 12

  12. Multiplex analysis on Paradise papers  Previous research o Characterization of HW/SW/building models o 100k – 1M nodes/edges o Naïve Java implementation using edge lists  Ran for Paradise papers data set o Clustering coefficient metrics did not complete in days  What’s going on? o Implementation and algorithmic aspects need to be tuned G. Szárnyas et al. @ MODELS 2016 Towards the characterization of realistic models: evaluation of multidisciplinary graph metrics 13

  13. GRAPH PROCESSING WORKLOADS 14

  14. Graph processing landscape Giraph, Spark GraphX, Flink Gelly,  LDBC: Linked Data Neo4j Graph algorithms lib Benchmark Council Graph analysis Graph (structure only) analytics amount of data accessed Multiplex graph analysis (structure and types) OLAP No off-the-shelf solutions? queries Graph queries (structure, types, and props) OLTP Cypher and SPARQL engines queries (Neo4j, Virtuoso, Stardog …) expected execution time

  15. Graph analysis frameworks Many Apache frameworks can be adapted  Hama Graph  Giraph on Hadoop  Spark GraphX  Flink Gelly But most seem abandoned. $ git rev-list --count --all --no-merges --since="Feb 2 2015" --since="Feb 2 2017" --before="Feb 2 2017" hama/graph 63 0 giraph 154 67 spark/graphx 362 120 flink/flink-libraries/gelly 139 112 16

  16. Arabesque  Open-source graph analytical framework over Hadoop  Frequent subgraph mining: find subgraph with a minimum number of matches  Optimized for distributed execution arabesque$ git rev-list... 125 91 C.H.C. Teixeira et al. @ SOSP 2015 Arabesque: A systems for distributed graph mining G. Siganos @ FOSDEM 2016 Arabesque: A distributed graph mining platform Qatar-Computing-Research-Institute/Arabesque 17

  17. The complexity of graph computations  Graph computation is difficult o The “curse of connectedness” o Computer architecture are suited for hierarchical data  Graph analytics: vertex-centric programming model o “Think like a vertex” o Pregel, scatter-gather, gather-apply-scatter  The majority of distributed graph engines use this  Going distributed for a 2M node graph? V. Kalavri et al. @ TKDE 2018 High-level programming abstractions for distributed graph processing 18

  18. LINEAR ALGEBRA-BASED CLUSTERING COEFFICIENTS 19

  19. Adjacency matrices B C D E F C B 0 0 1 0 0 B 0 0 1 0 0 C 1 1 0 1 1 𝐵business = D D 0 0 1 0 1 E 0 0 1 1 0 F F E B C D E F B C D E F 0 1 0 1 0 0 1 1 1 0 B B 1 0 1 0 1 1 0 1 0 1 C C 0 1 0 0 0 𝐵friends = 1 1 0 1 1 D 𝐵 = D 1 0 0 0 0 1 0 1 0 1 E E Optimization #1 0 1 0 0 0 0 1 1 1 0 F F Key idea: Multiplication of adjacency matrices = 2-hop, 3-hop, etc. paths 20

  20. Triangles – untyped diag −1 𝐵 ⋅ 𝐵 ⋅ 𝐵 C B 0 1 1 1 0 0 1 1 1 0 D 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 F 0 0 1 0 1 0 0 1 0 1 E 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 3 1 2 1 3 4 8 8 8 4 4 1 0 1 0 1 1 3 2 3 2 8 4 8 4 8 4 1 1 0 1 1 2 2 4 2 2 8 8 8 8 8 8 0 0 1 0 1 1 3 2 3 1 8 4 8 4 8 4 0 1 1 1 0 3 1 2 1 3 4 8 8 8 4 4 2-hop paths 3-hop paths Number of triangles for each node 21

  21. Triangles – typed diag −1 (𝐵 𝑗 ∙ 𝐵 𝑘 ∙ 𝐵 𝑗 ) C B 0 1 0 1 0 0 0 1 0 0 D 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 F 1 0 0 0 0 0 0 1 0 1 E 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 2 2 1 1 1 1 1 6 2 2 6 0 0 1 0 1 0 2 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 2 0 0 0 Matrices get more dense 22

  22. Optimization: element-wise multiplication  𝐵 ⋅ 𝐵 ⋅ 𝐵 enumerates all 3-hop paths o Matrices get more dense, but only the diagonal is used  Idea: use element-wise multiplication Optimization #2 LCC 𝑤 = diag −1 (𝐵 ⋅ 𝐵 ⋅ 𝐵) = 𝐵 ⋅ 𝐵 ⊙ 𝐵 ⋅ 1  Typed clustering: 𝒫 𝑢 2 matrix multiplications ∑ 𝑗≠𝑘 𝐵 𝑗 ⋅𝐵 𝑘 ⊙𝐵 𝑗 ⋅1 TCC(𝑤) = 𝑜−1 ⋅∑ 𝑗 𝐵 𝑗 ⋅1 ⊙ 𝐵 𝑗 ⋅1 −1 𝒫 𝑢 2 matrix multiplications for each node P. Várhegyi – Master’s thesis (2018) Embarrassingly Multidimensional graph analytics parallel – > opt. #3 23

  23. The example using the optimization C B 1 0 1 0 1 0 0 0 1 0 0 D 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 ⊙ F 1 1 0 0 0 0 0 0 1 0 1 E 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 2 2 1 1 1 2 2 0 1 1 6 0 0 1 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 More sparse matrix 24

  24. IMPLEMENTATION 1: Java 25

  25. Question on SoftwareRecs Efficient Java Matrix Library (2014 – ongoing) 26

  26. TCC on Paradise papers subsets Less than 1 min. Only EJML scales for the whole graph 27

  27. Graph analyzer library  Java implementation o Linear algebra-based implementations o Uses EJML library with CSC compression o Single-threaded o Runs on top of Neo4j/EMF/CSV graphs  Number of graph metrics: o For nodes, types, node pairs, node- type pairs, … o TCC variants, typed degree distribution, degree entropy o Pairwise multiplexity, multiplex participation coefficient P. Várhegyi – Master’s thesis (2018) ftsrg/graph-analyzer Multidimensional graph analytics 28

  28. IMPLEMENTATION 2: Julia 29

  29. Julia language  A high-performance, high-level dynamic language  v1.0 last August, v1.1 just out 30

  30. Julia implementation  Preliminary TCC implementation: ~25 lines  Written in a few days (incl. learning the language) 𝐵 ⋅ 𝐵 ⊙ 𝐵 ⋅ 1 A * A .* A * ones(n) Performance on Paradise papers:  similar to the best Java implementation  but easier to write and extend 31

  31. IMPLEMENTATION 3: GraphBLAS 32

  32. Approach for defining graph algorithms  GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra  Idea: BLAS (Basic Linear Algebra Subprograms), since 1979 Numerical Graph analytic applications applications LINPACK/LAPACK Graph algorithms Separation of concerns Separation of concerns BLAS GraphBLAS Hardware arch. Hardware arch. S. McMillan @ SEI Research Review (CMU) Graph algorithms on future architectures

  33. Graph operations with semirings Many graph algorithms can be captured over arbitrary semirings – > requires overloading of ⨁ . ⨂ Examples: + . ⋅  real numbers clustering  tropical semiring min.+ shortest path ⋁. ⋀  boolean semiring traversal Old idea, but very few libraries support this. Aho, Hopcrof, Ullman (1974): Cormen, Leiserson, Rivest (1990): The Design and Analysis of Introduction to Algorithms Computer Algorithms [only in the 1st edition] 34

  34. Graph operations with semirings  SuiteSparse:GraphBLAS  C API and single-threaded implementation  Steep learning curve  Good performance even with a single thread o Benchmark work in progress for LCC/TCC sergiud/SuiteSparse/ R. Lipman, T. Davis @ redisconf18 Graph algebra: graph operations in the language of linear algebra J. Kepner, J. Gilbert, H. Jananthan, J. Kepner, Graph algorithms in the Mathematics of Big Data. language of linear algebra. MIT Press 2018 SIAM, 2011 35

  35. RESULTS OF THE ANALYSIS 36

  36. Results on the Paradise papers Two variants of the typed clustering coefficient Outliers Note. Further analysis needs domain-specific expertise. 37

  37. ALGORITHMIC OPTIMIZATION 38

Recommend


More recommend