space of k truss decomposition
play

Space of K-truss Decomposition Optimizations on GPUs Safaa Diab, Mhd - PowerPoint PPT Presentation

KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs Safaa Diab, Mhd Ghaith Olabi, Izzat El Hajj American University of Beirut HPEC Graph Challenge September 23, 2020 Overview KTrussExplorer is a highly


  1. KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs Safaa Diab, Mhd Ghaith Olabi, Izzat El Hajj American University of Beirut HPEC Graph Challenge September 23, 2020

  2. Overview KTrussExplorer is a highly parameterized framework for exploring different combinations of k-truss decomposition optimizations on GPUs Supported features: Contributions: • Edge-centric parallelization • A survey of optimizations • Undirected or directed graphs • A framework for exploring the design space • Directed by index or by degree • Tiling the adjacency matrix github.com/ielhajj/ktruss-explorer • A view of the design space • Parallelizing intersections • Removing or marking weak edges • Unexplored combinations faster than prior champions • Recomputing for all or affected edges

  3. Methodology • Software: KtrussExplorer kernels are implemented in CUDA • System: Evaluation is on one Volta V100 GPU with 16GB of memory • Datasets: We evaluate with all graphs in the graph challenge collection • Except: Friendster, graph500-scale24-ef16, and graph500-scale25-ef16 due to limited device memory capacity. • Search space: Design space is searched exhaustively • Except: very large graphs

  4. Graph Directedness 64 Directed is faster Speedup of Directed over Undirected 32 Undirected is faster 0 0 16 8 4 1 2 1 2 2 support support +1 +1 +1 +1 +1 +1 +1 +1 +1 1 {0,1} {0,2} {1,0} {1,2} {2,0} {2,1} {0,1} {0,2} {1,2} Undirected Graph Directed Graph 0.5  Less redundancy  Less synchronization (no atomics) 0.25  Stop counting early 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 0.000001 0.00001 0.00010.001 0.01 0.1 1 10 100 Average Number of Triangles per Edge k = 3

  5. Directing Edges by Degree 8 Directed by degree is faster Directed by index Directed by index is faster Speedup of Directed by Degree 4 over Directed by Index • Keep edges from vertex with lower index to vertex with higher index 2 Directed by degree • Keep edges from vertex with lower 1 degree to vertex with higher degree  Advantage: shrink large adjacency 0.5 lists to reduce load imbalance 10 0 10 1 10 2 10 3 10 4 10 5 10 6 1 10 100 1000 10000 1000001000000 Maximum Vertex Degree k = 3

  6. 0 1 2 3 4 5 6 7 Tiling 0 1 1 0 2 srcPtr 3 0 4 7 10 12 14 18 29 24 3 5 7 6 4 dstIdx 5 1 5 6 7 0 3 5 4 5 7 1 5 2 7 0 1 2 3 0 7 0 2 4 6 2 4 6 7 Example Graph Logical Adjacency List CSR Representation without Tiling 0 1 2 3 4 5 6 7 0  Better locality  Partitioning intersections into smaller sub-intersections 1 2 srcPtr 3 0 1 3 3 4 7 8 11 12 13 17 18 20 21 21 22 24 4 dstIdx 5 1 0 3 1 5 6 7 5 4 5 7 5 2 0 1 2 3 0 0 2 7 7 4 6 6 7 Logical Adjacency List Tiled CSR Representation with Tiling

  7. Benefits of Tiling 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1 1 2 2 - Bad locality 3 3  Good locality 4 4 5 5 6 6  Good locality 7  Good locality 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Memory Access Pattern without Tiling Memory Access Pattern with Tiling

  8. Benefits of Tiling 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1 1 Sub-intersection 1 2 2 (trivially empty) Intersection 3 3 4 4 Sub-intersection 2 5 5 (trivially empty) 6 6 7 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Intersection without Tiling Intersection with Tiling

  9. Benefits of Tiling 0 1 2 3 4 5 6 7 1.8 0 Tiling is faster 1.7 1 No tiling is faster Sub-intersection 1 2 1.6 (trivially empty) 3 1.5 4 Speedup of Tiling Sub-intersection 2 5 1.4 (trivially empty) 6 1.3 7 1.2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 1.1 1 1 1.0 2 2 3 3 0.9 4 4 5 5 0.8 6 6 1 2 4 8 16 32 7 7 Average Vertex Degree Intersection with Tiling k = 3

  10. Parallelizing Intersections 0 1 2 3 4 5 6 7 1.3 0 Parallelization is faster Speedup of Parallelizing Intersections 1 1.2 No parallelization is faster Sub-intersection 1 2 3 1.1 4 Sub-intersection 2 5 1.0 6 7 0.9 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 0 0.8 1 1 2 2 0.7 3 3 4 4 5 5 0.6 6 6 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 100 1,00010,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 7 7 Number of Edges k = 3

  11. Removing Deleted Edges Intermediately 2.2 srcPtr Removing deleted 2.0 edges intermediately is dstIdx Speedup of Removing Deleted faster 1.8 Not removing deleted Edges Intermediately weak edges edges intermediately is 1.6 faster Mark deleted edges 1.4 srcPtr  No overhead to remove edges 1.2 dstIdx x x x x x 1.0 0.8 Remove deleted edges (for select iterations) srcPtr 0.6  Shorter intersections 0.4 dstIdx 10 1 1 10 2 10 100 1,000 10 3 10 4 10,000 10 5 100,000 1,000,000 10 6 10,000,000 10 7 100,000,000 1,000,000,000 10 8 10 9 Number of Edges k = k max

  12. Recomputing Support for All or Affected Edges Edges that are not affected and whose threads do not need to recount Edges that are not affected but whose threads need to recount on behalf of affected edges Graphs performing better with Edges that are affected and whose threads need to recount Weak edges that were deleted only affected edges reprocessed: • graph500-scale20-ef16 1 1 0 0 • graph500-scale21-ef16 • graph500-scale23-ef16 2 3 2 3 For further investigation: • Recomputing for affected edges on select iterations (later iterations) 4 5 4 5 Undirected Graph Directed Graph

  13. Marking Affected Edges Edges that are not affected and whose threads do not need to recount Edges that are not affected but whose threads need to recount on behalf of affected edges Edges that are affected and whose threads need to recount Weak edges that were deleted 1 0 01: parallel for e = { u , v } ∈ E do 02: if e is deleted then 03: mark u as affected, mark v as affected 2 3 4 4 5 5 Pseudocode for Marking Affected Edges

  14. Marking Affected Edges Edges that are not affected and whose threads do not need to recount Edges that are not affected but whose threads need to recount on behalf of affected edges Edges that are affected and whose threads need to recount Weak edges that were deleted 1 0 01: parallel for e = { u , v } ∈ E do 02: if e is deleted then 03: mark u as affected, mark v as affected 04: parallel for e = { u , v } ∈ E do 05: if e is not deleted and ( u is affected or v is affected) then 06: mark e as affected 2 2 3 3 07: if u is not affected then mark u as needs to recount 08: else if v is not affected then mark v as needs to recount 4 5 Pseudocode for Marking Affected Edges

  15. Marking Affected Edges Edges that are not affected and whose threads do not need to recount Edges that are not affected but whose threads need to recount on behalf of affected edges Edges that are affected and whose threads need to recount Weak edges that were deleted 1 0 01: parallel for e = { u , v } ∈ E do 02: if e is deleted then 03: mark u as affected, mark v as affected 04: parallel for e = { u , v } ∈ E do 05: if e is not deleted and ( u is affected or v is affected) then 06: mark e as affected 2 3 07: if u is not affected then mark u as needs to recount 08: else if v is not affected then mark v as needs to recount 09: parallel for e = { u , v } ∈ E do 10: if e is not deleted and e is not affected then 11: if u needs to recount or v needs to recount then 12: mark e as needs to recount 4 5 Pseudocode for Marking Affected Edges

  16. Comparison with Prior Champions 8 Speedup over 2018 Champions 4 (Bisson & Fatica) 2 1 0.5 0.25 10 1 10 10 2 100 1,000 10 3 10,000 10 4 100,000 10 5 1,000,000 10 6 10,000,000 10 7 100,000,000 1,000,000,000 10 8 10 9 Number of Edges k = 3

  17. KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs Safaa Diab, Mhd Ghaith Olabi, Izzat El Hajj American University of Beirut github.com/ielhajj/ktruss-explorer

Recommend


More recommend