accelerating approximate
play

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK - PowerPoint PPT Presentation

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen Bergen, Norway + Pacific Northwest National Laboratory Richland, WA, USA #


  1. Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen – Bergen, Norway + Pacific Northwest National Laboratory – Richland, WA, USA # Simula Research Laboratory – Oslo, Norway April 14, 2016 1

  2. The Edge Weighted Matching Problem Given an edge weighted graph G(V,E). Select a set M of non-incident edges of maximum weight. 7 2 6 6 W(opt) = 22 1 8 10 5 9 Best known algorithm has running time O(|V||E|+|V| 2 log|V|) [Gabow 90] Often too expensive for real applications April 14, 2016 2

  3. Greedy Algorithm Add most expensive remaining edge (x,y) to the solution Remove (x,y) and all edges incident on x or y 7 2 6 6 1 8 10 5 9 Running time: O(|E| log |V|) (due to sorting) Guarantees a 1⁄2 - approximation, but in practice often much better April 14, 2016 3

  4. Suitor Algorithm (Manne and Halappanavar 2015) P (x) = “v: Heaviest neighbor of x that does not already have a better offer than w(x,v )” x y (x,y) in M Lemma : If P (x) is set correctly for every x, then P () defines the same solution as Greedy. Outline of The Suitor Algorithm: Process the vertices in a linear fashion to find the best candidate for each vertex Set the Suitor value of the candidate Whenever a vertex is dislodged it must be re-processed before moving to the next vertex April 14, 2016 4

  5. Graph Data Structures 1 a b c d e END d e 0 0 2 3 6 8 10 9 7 Idx 0 1 2 3 4 5 6 7 8 9 c b c a a d e c e c d E 4 5 4 5 7 9 7 10 9 10 W 5 Compressed Edge List 4 a b

  6. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c __:__ Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

  7. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c __:__ Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

  8. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

  9. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

  10. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  11. Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  12. Suitor Algorithm: Example 1 c : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  13. Suitor Algorithm: Example 1 c : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  14. Suitor Algorithm: Example 1 d : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  15. Suitor Algorithm: Example 1 d : 9 c : 7 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  16. Suitor Algorithm: Example 1 d : 9 c : 7 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  17. Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  18. Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  19. Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

  20. Suitor Algorithm: Recap Search for the best possible candidate for each vertex → O (d v ) In the worst case, needs to repeat the search d v times Set the Suitor and corresponding Offer , O (1) Running Time: O ( Σ v ∈ V ( d v ) 2 ) ≤ O (∆E ) With sorted neighbor lists: O ( Σ v ∈ V ( d v log d v ) 2 + V + E) ≤ O (E log ∆+ V + E) April 14, 2016 20

  21. Observations The order in which vertices are processed has no influence on the final matching The vertices can be processed independently Need to protect the Suitor values as multiple vertices may try to become suitor at the same time u x v Mutual Exclusion April 14, 2016 21

  22. Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion . . . . W Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved E . . . . . fashion O ( Ceil ( | N (v)|/32)) Start 0 9 . . . Indices ….. warp 0 warp 1 warp L-1 Stored in shared memory of the warp April 14, 2016 22

  23. Suitor Algorithm on GPU w w .. w w w w .. w w | N (v)| = 64 0 1 30 31 32 33 62 63 time 0 1 ... 30 31 0 1 ... 30 31 w w .. w w w w .. w w 0 1 30 31 32 33 62 63 0 0 1 1 ... ... 30 31 30 31 T =max{ w t , w t+32 , w t+64 ,...} best_t April 14, 2016 23

  24. Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) April 14, 2016 24

  25. Suitor Algorithm on GPU Butterfly Reduction T T T T . . . . T T 0 1 2 3 30 31 T T T .. .. .. .. T log 2 (32) 01 01 23 23 0123 …. .... …. …. 0123 0123 0123 .... .... After 5 Steps T =max{T .... T } best best_0, , best_31 April 14, 2016 25

  26. Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) Set the Suitor and corresponding Offer for the candidate , O (1) Can be buffered in registers Save failed and dislodged vertices in shared memory Processed by the same warp in the next round Redistributed across the warps of the same block April 14, 2016 26

  27. Load Balancing Each warp stores failed and dislodged vertices in programmable cache Can be redistributed across the warps of the same block Need synchronization among the warps of the same block Synchronization can become a bottleneck April 14, 2016 27

  28. Experimental Evaluation Intel CPUs NVIDIA GPU 2 sockets, 64 GB and BW of 51.2 GB/s. Tesla K40 with GK110B GPU, 15 SMXes, 2880 cores @ 745 MHz Hyper-threaded 8-core Intel Xeon E5 @3.10 GHz 12 GB of GDDR5 at 3 GHz, BW of 288 GB/sec. nvcc (CUDA 7.0) GCC 4.9.2, GOMP_CPU_AFFINITY, numactl Dataset Florida Sparse Matrix Collection: 269 problems ranging from a million to billion of edges Alternatives Locally Dominant(LD): Shared Memory and GPU Implementation Suitor: Shared Memory April 14, 2016 28

  29. Problem Instances and Runtimes April 14, 2016 29

  30. Speed up vs LD and OMP-Suitor Speedup of GPU-Suitor relative to GPU-LD and OMP-LD (left),and OMP-Suitor (right) April 14, 2016 30

  31. Design Space Exploration April 14, 2016 31

  32. Conclusion Presented an implementation of the Suitor Matching algorithm for GPUs Faster than multicore and previous GPU implementations Future works: Extension to multigpus April 14, 2016 32

  33. Thank you! Questions? Md Naim - naim.md@gmail.com Fredrik Manne - ManneFredrik.Manne@ii.uib.no Mahantesh Halappanavar – mahantesh.halappanavar@pnnl.gov Antonino Tumeo – antonino.tumeo@pnnl.gov Johannes Langguth – langguth@simula.no April 14, 2016 33

Recommend


More recommend