Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens { anastop,knikas,goumas,nkoziris } @cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr May 31, 2009 National Technical University of Athens CSLab
Outline 1 Dijkstra’s Basics 2 Straightforward Parallelization Scheme 3 Helper-Threading Scheme 4 Experimental Evaluation 5 Conclusions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 2 / 19
The Basics of Dijkstra’s Algorithm SSSP Problem Directed graph G = ( V , E ), weight function w : E → R + , source vertex s p ∀ v ∈ V : compute δ ( v ) = min { w ( p ) : s � v } Shortest path estimate d ( v ) gradually converges to δ ( v ) through relaxations relax (v,w): d ( w ) = min { d ( w ) , d ( v ) + w ( v , w ) } p ◮ can we find a better path s � w by going through v? Three partitions of vertices Settled: d ( v ) = δ ( v ) Queued: d ( v ) > δ ( v ) and d ( v ) � = ∞ Unreached: d ( v ) = ∞ Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 3 / 19
The Basics of Dijkstra’s Algorithm Serial algorithm : G = ( V , E ), w : E → R + , Input 1 source vertex s , min Q Output : shortest distance array d , 2 predecessor array π foreach v ∈ V do 3 S d [ v ] ← inf; 4 π [ v ] ← nil; 5 A Insert ( Q , v ) ; 50 6 5 20 end E 7 10 15 70 55 d [ s ] ← 0; 2 7 8 B 10 60 65 C while Q � = ∅ do D 9 8 u ← ExtractMin( Q ) ; 10 8 foreach v adjacent to u do 11 sum ← d [ u ] + w ( u , v ); 8 12 if d [ v ] > sum then 13 DecreaseKey( Q , v , sum ) ; 14 d [ v ] ← sum ; 15 π [ v ] ← u ; 16 end 17 end 18 Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 4 / 19
The Basics of Dijkstra’s Algorithm 5 4 7 8 6 5 12 9 j 10 13 7 9 8 13 15 17 13 16 12 14 15 12 13 16 10 14 i i: 17 6 k Min-priority queue implemented as binary min-heap maintains all but the settled vertices min-heap property: ∀ i : d ( parent ( i )) ≤ d ( i ) amortizes the cost of multiple ExtractMin ’s and DecreaseKey ’s ◮ O (( | E | + | V | ) log | V | ) time complexity Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 5 / 19
Straightforward Parallelization Fine-grain parallelization at the inner loop level Fine-Grain Multi-Threaded S /* Initialization phase same to the serial A 50 code */ 5 20 E while Q � = ∅ do 10 15 1 70 55 2 7 Barrier 2 B 10 60 65 if tid = 0 then C 3 D 8 u ← ExtractMin( Q ) ; 4 8 Barrier 5 for v adjacent to u in parallel do 6 8 sum ← d [ u ] + w ( u , v ); 7 if d [ v ] > sum then 8 Begin-Atomic 9 Issues: DecreaseKey( Q , v , sum ) ; 10 End-Atomic 11 speedup bounded by average d [ v ] ← sum ; 12 π [ v ] ← u ; out-degree 13 end 14 concurrent heap updates due to end 15 DecreaseKey ’s barrier synchronization overhead Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 6 / 19
Concurrent Heap Updates with Locks x 5 4 3 x 7 8 5 8 6 5 j: 9 2 j 12 9 10 13 j 7 9 10 13 7 9 8 13 15 17 13 12 14 15 16 15 12 13 10 14 12 13 16 12 14 16 i i: 17 6 k: 12 4 i: 17 3 k k Coarse-grain synchronization ( cgs-lock ) ◮ enforces atomicity at the level of a DecreaseKey operation ◮ one lock for the entire heap ◮ serializes DecreaseKey ’s Fine-grain synchronization ( fgs-lock ) ◮ enforces atomicity at the level of a single swap ◮ allows multiple swap sequences to execute in parallel as long as they are temporally non-overlapping ◮ separate locks for each parent-child pair Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 7 / 19
Performance of FGMT with Locks 1.3 cgs-lock 1.2 perfbar+cgs-lock 1.1 perfbar+fgs-lock 1 Multithreaded speedup 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 Number of threads Software barriers dominate total execution time 72% with 2 threads, 88% with 8 replace with idealized (simulated) zero-latency barriers Fgs-lock scheme more scalable, but still fails to outperform serial locking overhead (2 locks + 2 unlocks per swap) Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 8 / 19
Concurrent Heap Updates with TM x 5 4 3 x 7 8 5 8 6 5 j: 9 2 j 12 9 10 13 7 9 j 8 13 7 9 10 13 15 17 13 12 14 15 16 15 12 13 16 10 14 12 13 16 12 14 i i: 17 6 k: 12 4 i: 17 3 k k TM-based Coarse-grain synchronization ( cgs-tm ) ◮ enclose DecreaseKey within a transaction ◮ allows multiple swap sequences to execute in parallel as long as they are spatially (and temporally) non-overlapping ◮ conflicting transaction stalls and retries or aborts Fine-grain synchronization ( fgs-tm ) ◮ enclose each swap operation within a transaction ◮ atomicity as in fgs-lock ◮ shorter but more transactions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 9 / 19
Performance of FGMT with TM 1.2 1.1 Multithreaded speedup 1 0.9 0.8 0.7 perfbar+cgs-lock perfbar+fgs-lock 0.6 perfbar+cgs-tm perfbar+fgs-tm 0.5 2 4 6 8 10 12 14 16 Number of threads TM-based schemes offer speedup up to ∼ 1.1 less overhead for cgs-tm , yet equally able to exploit available concurrency Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 10 / 19
Helper-Threading Scheme Motivation expose more parallelism to each thread eliminate costly barrier synchronization Rationale S in serial, relaxations are performed only i-1 from the extracted (settled) vertex A 50 5 20 i E allow relaxations for out-edges of 10 15 70 55 2 7 queued vertices, hoping that some of B 10 60 65 C D them might already be settled 8 8 ◮ main thread operates as in the serial algorithm 8 ◮ assign the next t vertices in the queue ( x 2 . . . x t +1 ) to t helper threads ◮ helper thread k relaxes all out-edges of vertex x k speculation on the status of d ( x k ) ◮ if already optimal , main thread will be offloaded ◮ if not optimal , any suboptimal relaxations will be corrected eventually by main thread Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 11 / 19
Execution Pattern Serial FGMT Helper Threads Thread 1 Thread 2 Thread 3 Thread 4 Main Main Helper 1 Helper 2 Helper 3 step k step k step k step k+1 kill kill step k+1 step k+2 step k+1 kill kill step k+2 kill extract-min step k+2 read tid th -min relax edges the main thread stops all helpers at the end of each iteration unfinished work will be corrected, as with mis-speculated distances Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 12 / 19
Helper-Threading Scheme Main thread Helper thread while Q � = ∅ do while Q � = ∅ do 1 1 u ← ExtractMin( Q ) ; while done = 1 do ; 2 2 done ← 0; 3 x ← ReadMin( Q , tid ) 3 foreach v adjacent to u do stop ← 0 4 4 sum ← d [ u ] + w ( u , v ); 5 foreach y adjacent to x and while stop = 0 do 5 Begin-Xact Begin-Xact 6 6 if d [ v ] > sum then 7 if done = 0 then 7 DecreaseKey( Q , v , sum ) ; sum ← d [ x ] + w ( x , y ) 8 8 d [ v ] ← sum ; 9 if d [ y ] > sum then 9 π [ v ] ← u ; DecreaseKey( Q , y , sum ) 10 10 End-Xact d [ y ] ← sum 11 11 π [ y ] ← x end 12 12 else 13 Begin-Xact 13 stop ← 1 14 done ← 1; 14 End-Xact 15 End-Xact 15 end 16 end 16 end 17 for a single neighbour, the check for relaxation, updates to the heap, and updates to d , π arrays, are enclosed within a transaction ◮ performed “all-or-none” ◮ on a conflict, only one thread commits interruption of helper threads implemented through TM, as well Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 13 / 19
Helper-Threading Scheme Main thread Helper thread while Q � = ∅ do while Q � = ∅ do 1 1 u ← ExtractMin( Q ) ; while done = 1 do ; 2 2 done ← 0; 3 x ← ReadMin( Q , tid ) 3 foreach v adjacent to u do stop ← 0 4 4 sum ← d [ u ] + w ( u , v ); 5 foreach y adjacent to x and while stop = 0 do 5 Begin-Xact Begin-Xact 6 6 if d [ v ] > sum then 7 if done = 0 then 7 DecreaseKey( Q , v , sum ) ; sum ← d [ x ] + w ( x , y ) 8 8 d [ v ] ← sum ; 9 if d [ y ] > sum then 9 π [ v ] ← u ; DecreaseKey( Q , y , sum ) 10 10 End-Xact d [ y ] ← sum 11 11 π [ y ] ← x end 12 12 else 13 Begin-Xact 13 stop ← 1 14 done ← 1; 14 End-Xact 15 End-Xact 15 end 16 end 16 end 17 Why with TM? composable ◮ all dependent atomic sub-operations composed into a large atomic operation, without limiting concurrency optimistic easily programmable Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 14 / 19
Experimental Setup Full-system simulation Simics 3.0.31 in conjunction with GEMS toolset 2.1 boots unmodified Solaris 10 (UltraSPARC III Cu) LogTM (“Signature Edition”) eager version management eager conflict detection ◮ on a conflict, a transaction stalls and either retries or aborts HYBRID conflict resolution policy ◮ favors older transactions Hardware platform single CMP system (configurations up to 32 cores) private L1 caches (64KB), shared L2 cache (2MB) Software Pthreads for threading and synchronization Simics “magic” instructions to simulate idealized barriers Sun Studio 12 C compiler (-xO3) Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 15 / 19
Recommend
More recommend