Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK - PowerPoint PPT Presentation

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen – Bergen, Norway + Pacific Northwest National Laboratory – Richland, WA, USA # Simula Research Laboratory – Oslo, Norway April 14, 2016 1

The Edge Weighted Matching Problem Given an edge weighted graph G(V,E). Select a set M of non-incident edges of maximum weight. 7 2 6 6 W(opt) = 22 1 8 10 5 9 Best known algorithm has running time O(|V||E|+|V| 2 log|V|) [Gabow 90] Often too expensive for real applications April 14, 2016 2

Greedy Algorithm Add most expensive remaining edge (x,y) to the solution Remove (x,y) and all edges incident on x or y 7 2 6 6 1 8 10 5 9 Running time: O(|E| log |V|) (due to sorting) Guarantees a 1⁄2 - approximation, but in practice often much better April 14, 2016 3

Suitor Algorithm (Manne and Halappanavar 2015) P (x) = “v: Heaviest neighbor of x that does not already have a better offer than w(x,v )” x y (x,y) in M Lemma : If P (x) is set correctly for every x, then P () defines the same solution as Greedy. Outline of The Suitor Algorithm: Process the vertices in a linear fashion to find the best candidate for each vertex Set the Suitor value of the candidate Whenever a vertex is dislodged it must be re-processed before moving to the next vertex April 14, 2016 4

Graph Data Structures 1 a b c d e END d e 0 0 2 3 6 8 10 9 7 Idx 0 1 2 3 4 5 6 7 8 9 c b c a a d e c e c d E 4 5 4 5 7 9 7 10 9 10 W 5 Compressed Edge List 4 a b

Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c __:__ Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a __:__ b __:__ Suitor : Offer

Suitor Algorithm: Example 1 __:__ __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

Suitor Algorithm: Example 1 c : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

Suitor Algorithm: Example 1 d : 9 __:__ d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

Suitor Algorithm: Example 1 d : 9 c : 7 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

Suitor Algorithm: Example 1 d : 9 e : 10 d e 0 Unprocesse d 9 7 Current c a : 5 Candidate 5 Processed 4 a b : 4 b __:__ Suitor : Offer

Suitor Algorithm: Recap Search for the best possible candidate for each vertex → O (d v ) In the worst case, needs to repeat the search d v times Set the Suitor and corresponding Offer , O (1) Running Time: O ( Σ v ∈ V ( d v ) 2 ) ≤ O (∆E ) With sorted neighbor lists: O ( Σ v ∈ V ( d v log d v ) 2 + V + E) ≤ O (E log ∆+ V + E) April 14, 2016 20

Observations The order in which vertices are processed has no influence on the final matching The vertices can be processed independently Need to protect the Suitor values as multiple vertices may try to become suitor at the same time u x v Mutual Exclusion April 14, 2016 21

Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion . . . . W Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved E . . . . . fashion O ( Ceil ( | N (v)|/32)) Start 0 9 . . . Indices ….. warp 0 warp 1 warp L-1 Stored in shared memory of the warp April 14, 2016 22

Suitor Algorithm on GPU w w .. w w w w .. w w | N (v)| = 64 0 1 30 31 32 33 62 63 time 0 1 ... 30 31 0 1 ... 30 31 w w .. w w w w .. w w 0 1 30 31 32 33 62 63 0 0 1 1 ... ... 30 31 30 31 T =max{ w t , w t+32 , w t+64 ,...} best_t April 14, 2016 23

Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) April 14, 2016 24

Suitor Algorithm on GPU Butterfly Reduction T T T T . . . . T T 0 1 2 3 30 31 T T T .. .. .. .. T log 2 (32) 01 01 23 23 0123 …. .... …. …. 0123 0123 0123 .... .... After 5 Steps T =max{T .... T } best best_0, , best_31 April 14, 2016 25

Suitor Algorithm on GPU Assign a chunk of vertices to each warp Each warp processes its chunk independently Each warp starts with the assigned chunk in shared memory The warp processes its vertices in a linear fashion Search for the best possible candidate for each v ∈ c hunk The threads in the warp read the neighbor list of v in an interleaved fashion O ( Ceil ( | N (v)|/32)) Saves the best found so far in a local variable Reduction at the end, O( log 2 (32)) Set the Suitor and corresponding Offer for the candidate , O (1) Can be buffered in registers Save failed and dislodged vertices in shared memory Processed by the same warp in the next round Redistributed across the warps of the same block April 14, 2016 26

Load Balancing Each warp stores failed and dislodged vertices in programmable cache Can be redistributed across the warps of the same block Need synchronization among the warps of the same block Synchronization can become a bottleneck April 14, 2016 27

Experimental Evaluation Intel CPUs NVIDIA GPU 2 sockets, 64 GB and BW of 51.2 GB/s. Tesla K40 with GK110B GPU, 15 SMXes, 2880 cores @ 745 MHz Hyper-threaded 8-core Intel Xeon E5 @3.10 GHz 12 GB of GDDR5 at 3 GHz, BW of 288 GB/sec. nvcc (CUDA 7.0) GCC 4.9.2, GOMP_CPU_AFFINITY, numactl Dataset Florida Sparse Matrix Collection: 269 problems ranging from a million to billion of edges Alternatives Locally Dominant(LD): Shared Memory and GPU Implementation Suitor: Shared Memory April 14, 2016 28

Problem Instances and Runtimes April 14, 2016 29

Speed up vs LD and OMP-Suitor Speedup of GPU-Suitor relative to GPU-LD and OMP-LD (left),and OMP-Suitor (right) April 14, 2016 30

Design Space Exploration April 14, 2016 31

Conclusion Presented an implementation of the Suitor Matching algorithm for GPUs Faster than multicore and previous GPU implementations Future works: Extension to multigpus April 14, 2016 32

Thank you! Questions? Md Naim - naim.md@gmail.com Fredrik Manne - ManneFredrik.Manne@ii.uib.no Mahantesh Halappanavar – mahantesh.halappanavar@pnnl.gov Antonino Tumeo – antonino.tumeo@pnnl.gov Johannes Langguth – langguth@simula.no April 14, 2016 33

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK - PowerPoint PPT Presentation

Accelerating Approximate Weighted Matching on GPUs MD NAIM, FREDRIK MANNE, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen Bergen, Norway + Pacific Northwest National Laboratory Richland, WA, USA #

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

CuZr-Mo bimetals for CLIC accelerating structures for CLIC accelerating structures Introduction

The Use of Prediction for The Use of Prediction for Accelerating Upgrade Misses in Accelerating

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Declara've Programming Project: Stable Matching Problems VUB //

Matching heavy-quark fields in QCD and HQET Andrey Grozin A.G.Grozin@inp.nsk.su Budker

Matching supply and demand in large networks with focus on railways Dr. Andreas Oetting 3 rd

GROUPS IN MERGER EVALUATIONS Aditi Mehta November 8, 2012 The Pros and Cons of Merger Control

Distributed algorithms for edge dominating sets Jukka Suomela Helsinki Institute for Information

in CS ADAPTING AN OPEN-SOURCE WEB- BASED ASSESSMENT SYSTEM FOR THE AUTOMATED ASSESSMENT OF

Matrix Groups GAP examples Matrix groups in GAP Schreier-Sims Max Neunhffer Problems Group

October 6, 2016 Began in 1941 as Central Utah Vocational School Became an official

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK - PowerPoint PPT Presentation

Accelerating Approximate Weighted Matching on GPUs MD NAIM*, FREDRIK MANNE*, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen Bergen, Norway + Pacific Northwest National Laboratory Richland, WA, USA #

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Decommissioning: Winds of Change in Offshore Oil &amp; Gas Accelerating NAMEPA &amp; NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen &amp; Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

CuZr-Mo bimetals for CLIC accelerating structures for CLIC accelerating structures Introduction

The Use of Prediction for The Use of Prediction for Accelerating Upgrade Misses in Accelerating

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Declara've Programming Project: Stable Matching Problems VUB //

Matching heavy-quark fields in QCD and HQET Andrey Grozin A.G.Grozin@inp.nsk.su Budker

Matching supply and demand in large networks with focus on railways Dr. Andreas Oetting 3 rd

GROUPS IN MERGER EVALUATIONS Aditi Mehta November 8, 2012 The Pros and Cons of Merger Control

Distributed algorithms for edge dominating sets Jukka Suomela Helsinki Institute for Information

in CS ADAPTING AN OPEN-SOURCE WEB- BASED ASSESSMENT SYSTEM FOR THE AUTOMATED ASSESSMENT OF

Matrix Groups GAP examples Matrix groups in GAP Schreier-Sims Max Neunhffer Problems Group

October 6, 2016 Began in 1941 as Central Utah Vocational School Became an official

Accelerating Approximate Weighted Matching on GPUs MD NAIM, FREDRIK MANNE, MAHANTESH HALAPPANAVAR + , ANTONINO TUMEO + , JOHANNES LANGGUTH # *University of Bergen Bergen, Norway + Pacific Northwest National Laboratory Richland, WA, USA #

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der