Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen – Bergen, Norway + Pacific Northwest National Laboratory – Richland, WA, USA # Simula Research Laboratory – Oslo, Norway April 14, 2016 1
The Edge Weighted Matching Problem Given an edge weighted graph G(V,E). Select a set M of non-incident edges of maximum weight. 7 2 6 6 W(opt) = 22 1 8 10 5 9 Best known algorithm has running time O(|V||E|+|V| 2 log|V|) [Gabow 90] Often too expensive for real applications April 14, 2016 2
Greedy Algorithm Add most expensive remaining edge (x,y) to the solution Remove (x,y) and all edges incident on x or y 7 2 6 6 1 8 10 5 9 Running time: O(|E| log |V|) (due to sorting) Guarantees a 1⁄2 - approximation, but in practice often much better April 14, 2016 3
Suitor Algorithm (Manne and Halappanavar 2015) P (x) = “v: Heaviest neighbor of x that does not already have a better offer than w(x,v )” x y (x,y) in M Lemma : If P (x) is set correctly for every x, then P () defines the same solution as Greedy. Outline of The Suitor Algorithm: Process the vertices in a linear fashion to find the best candidate for each vertex Set the Suitor value of the candidate Whenever a vertex is dislodged it must be re-processed before moving to the next vertex April 14, 2016 4
Graph Data Structures 1 a b c d e END d e 0 0 2 3 6 8 10 9 7 Idx 0 1 2 3 4 5 6 7 8 9 c b c a a d e c e c d E 4 5 4 5 7 9 7 10 9 10 W 5 Compressed Edge List 4 a b
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c __:__ Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c __:__ Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 c : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 c : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 c : 7 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 c : 7 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer
Suitor Algorithm: Recap Search for the best possible candidate for each vertex → O (d v ) In the worst case, needs to repeat the search d v times Set the Suitor and corresponding Offer , O (1) Running Time: O ( Σ v ∈ V ( d v ) 2 ) ≤ O (∆E ) With sorted neighbor lists: O ( Σ v ∈ V ( d v log d v ) 2 + V + E) ≤ O (E log ∆+ V + E) April 14, 2016 20
Observations The order in which vertices are processed has no influence on the final matching The vertices can be processed independently Need to protect the Suitor values as multiple vertices may try to become suitor at the same time u x v Mutual Exclusion April 14, 2016 21
Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion . . . . W Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved E . . . . . fashion O ( Ceil ( | N (v)|/32)) Start 0 9 . . . Indices ….. warp 0 warp 1 warp L-1 Stored in shared memory of the warp April 14, 2016 22
Suitor Algorithm on GPU w w .. w w w w .. w w | N (v)| = 64 0 1 30 31 32 33 62 63 time 0 1 ... 30 31 0 1 ... 30 31 w w .. w w w w .. w w 0 1 30 31 32 33 62 63 0 0 1 1 ... ... 30 31 30 31 T =max{ w t , w t+32 , w t+64 ,...} best_t April 14, 2016 23
Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) April 14, 2016 24
Suitor Algorithm on GPU Butterfly Reduction T T T T . . . . T T 0 1 2 3 30 31 T T T .. .. .. .. T log 2 (32) 01 01 23 23 0123 …. .... …. …. 0123 0123 0123 .... .... After 5 Steps T =max{T .... T } best best_0, , best_31 April 14, 2016 25
Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) Set the Suitor and corresponding Offer for the candidate , O (1) Can be buffered in registers Save failed and dislodged vertices in shared memory Processed by the same warp in the next round Redistributed across the warps of the same block April 14, 2016 26
Load Balancing Each warp stores failed and dislodged vertices in programmable cache Can be redistributed across the warps of the same block Need synchronization among the warps of the same block Synchronization can become a bottleneck April 14, 2016 27
Experimental Evaluation Intel CPUs NVIDIA GPU 2 sockets, 64 GB and BW of 51.2 GB/s. Tesla K40 with GK110B GPU, 15 SMXes, 2880 cores @ 745 MHz Hyper-threaded 8-core Intel Xeon E5 @3.10 GHz 12 GB of GDDR5 at 3 GHz, BW of 288 GB/sec. nvcc (CUDA 7.0) GCC 4.9.2, GOMP_CPU_AFFINITY, numactl Dataset Florida Sparse Matrix Collection: 269 problems ranging from a million to billion of edges Alternatives Locally Dominant(LD): Shared Memory and GPU Implementation Suitor: Shared Memory April 14, 2016 28
Problem Instances and Runtimes April 14, 2016 29
Speed up vs LD and OMP-Suitor Speedup of GPU-Suitor relative to GPU-LD and OMP-LD (left),and OMP-Suitor (right) April 14, 2016 30
Design Space Exploration April 14, 2016 31
Conclusion Presented an implementation of the Suitor Matching algorithm for GPUs Faster than multicore and previous GPU implementations Future works: Extension to multigpus April 14, 2016 32
Thank you! Questions? Md Naim - naim.md@gmail.com Fredrik Manne - ManneFredrik.Manne@ii.uib.no Mahantesh Halappanavar – mahantesh.halappanavar@pnnl.gov Antonino Tumeo – antonino.tumeo@pnnl.gov Johannes Langguth – langguth@simula.no April 14, 2016 33
Recommend
More recommend