CSLab Outline 1 Dijkstras Basics 2 Straightforward Parallelization - PowerPoint PPT Presentation

Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens { anastop,knikas,goumas,nkoziris } @cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr May 31, 2009 National Technical University of Athens CSLab

Outline 1 Dijkstra’s Basics 2 Straightforward Parallelization Scheme 3 Helper-Threading Scheme 4 Experimental Evaluation 5 Conclusions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 2 / 19

The Basics of Dijkstra’s Algorithm SSSP Problem Directed graph G = ( V , E ), weight function w : E → R + , source vertex s p ∀ v ∈ V : compute δ ( v ) = min { w ( p ) : s � v } Shortest path estimate d ( v ) gradually converges to δ ( v ) through relaxations relax (v,w): d ( w ) = min { d ( w ) , d ( v ) + w ( v , w ) } p ◮ can we find a better path s � w by going through v? Three partitions of vertices Settled: d ( v ) = δ ( v ) Queued: d ( v ) > δ ( v ) and d ( v ) � = ∞ Unreached: d ( v ) = ∞ Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 3 / 19

The Basics of Dijkstra’s Algorithm Serial algorithm : G = ( V , E ), w : E → R + , Input 1 source vertex s , min Q Output : shortest distance array d , 2 predecessor array π foreach v ∈ V do 3 S d [ v ] ← inf; 4 π [ v ] ← nil; 5 A Insert ( Q , v ) ; 50 6 5 20 end E 7 10 15 70 55 d [ s ] ← 0; 2 7 8 B 10 60 65 C while Q � = ∅ do D 9 8 u ← ExtractMin( Q ) ; 10 8 foreach v adjacent to u do 11 sum ← d [ u ] + w ( u , v ); 8 12 if d [ v ] > sum then 13 DecreaseKey( Q , v , sum ) ; 14 d [ v ] ← sum ; 15 π [ v ] ← u ; 16 end 17 end 18 Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 4 / 19

The Basics of Dijkstra’s Algorithm 5 4 7 8 6 5 12 9 j 10 13 7 9 8 13 15 17 13 16 12 14 15 12 13 16 10 14 i i: 17 6 k Min-priority queue implemented as binary min-heap maintains all but the settled vertices min-heap property: ∀ i : d ( parent ( i )) ≤ d ( i ) amortizes the cost of multiple ExtractMin ’s and DecreaseKey ’s ◮ O (( | E | + | V | ) log | V | ) time complexity Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 5 / 19

Straightforward Parallelization Fine-grain parallelization at the inner loop level Fine-Grain Multi-Threaded S /* Initialization phase same to the serial A 50 code */ 5 20 E while Q � = ∅ do 10 15 1 70 55 2 7 Barrier 2 B 10 60 65 if tid = 0 then C 3 D 8 u ← ExtractMin( Q ) ; 4 8 Barrier 5 for v adjacent to u in parallel do 6 8 sum ← d [ u ] + w ( u , v ); 7 if d [ v ] > sum then 8 Begin-Atomic 9 Issues: DecreaseKey( Q , v , sum ) ; 10 End-Atomic 11 speedup bounded by average d [ v ] ← sum ; 12 π [ v ] ← u ; out-degree 13 end 14 concurrent heap updates due to end 15 DecreaseKey ’s barrier synchronization overhead Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 6 / 19

Concurrent Heap Updates with Locks x 5 4 3 x 7 8 5 8 6 5 j: 9 2 j 12 9 10 13 j 7 9 10 13 7 9 8 13 15 17 13 12 14 15 16 15 12 13 10 14 12 13 16 12 14 16 i i: 17 6 k: 12 4 i: 17 3 k k Coarse-grain synchronization ( cgs-lock ) ◮ enforces atomicity at the level of a DecreaseKey operation ◮ one lock for the entire heap ◮ serializes DecreaseKey ’s Fine-grain synchronization ( fgs-lock ) ◮ enforces atomicity at the level of a single swap ◮ allows multiple swap sequences to execute in parallel as long as they are temporally non-overlapping ◮ separate locks for each parent-child pair Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 7 / 19

Performance of FGMT with Locks 1.3 cgs-lock 1.2 perfbar+cgs-lock 1.1 perfbar+fgs-lock 1 Multithreaded speedup 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 Number of threads Software barriers dominate total execution time 72% with 2 threads, 88% with 8 replace with idealized (simulated) zero-latency barriers Fgs-lock scheme more scalable, but still fails to outperform serial locking overhead (2 locks + 2 unlocks per swap) Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 8 / 19

Concurrent Heap Updates with TM x 5 4 3 x 7 8 5 8 6 5 j: 9 2 j 12 9 10 13 7 9 j 8 13 7 9 10 13 15 17 13 12 14 15 16 15 12 13 16 10 14 12 13 16 12 14 i i: 17 6 k: 12 4 i: 17 3 k k TM-based Coarse-grain synchronization ( cgs-tm ) ◮ enclose DecreaseKey within a transaction ◮ allows multiple swap sequences to execute in parallel as long as they are spatially (and temporally) non-overlapping ◮ conflicting transaction stalls and retries or aborts Fine-grain synchronization ( fgs-tm ) ◮ enclose each swap operation within a transaction ◮ atomicity as in fgs-lock ◮ shorter but more transactions Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 9 / 19

Performance of FGMT with TM 1.2 1.1 Multithreaded speedup 1 0.9 0.8 0.7 perfbar+cgs-lock perfbar+fgs-lock 0.6 perfbar+cgs-tm perfbar+fgs-tm 0.5 2 4 6 8 10 12 14 16 Number of threads TM-based schemes offer speedup up to ∼ 1.1 less overhead for cgs-tm , yet equally able to exploit available concurrency Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 10 / 19

Helper-Threading Scheme Motivation expose more parallelism to each thread eliminate costly barrier synchronization Rationale S in serial, relaxations are performed only i-1 from the extracted (settled) vertex A 50 5 20 i E allow relaxations for out-edges of 10 15 70 55 2 7 queued vertices, hoping that some of B 10 60 65 C D them might already be settled 8 8 ◮ main thread operates as in the serial algorithm 8 ◮ assign the next t vertices in the queue ( x 2 . . . x t +1 ) to t helper threads ◮ helper thread k relaxes all out-edges of vertex x k speculation on the status of d ( x k ) ◮ if already optimal , main thread will be offloaded ◮ if not optimal , any suboptimal relaxations will be corrected eventually by main thread Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 11 / 19

Execution Pattern Serial FGMT Helper Threads Thread 1 Thread 2 Thread 3 Thread 4 Main Main Helper 1 Helper 2 Helper 3 step k step k step k step k+1 kill kill step k+1 step k+2 step k+1 kill kill step k+2 kill extract-min step k+2 read tid th -min relax edges the main thread stops all helpers at the end of each iteration unfinished work will be corrected, as with mis-speculated distances Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 12 / 19

Helper-Threading Scheme Main thread Helper thread while Q � = ∅ do while Q � = ∅ do 1 1 u ← ExtractMin( Q ) ; while done = 1 do ; 2 2 done ← 0; 3 x ← ReadMin( Q , tid ) 3 foreach v adjacent to u do stop ← 0 4 4 sum ← d [ u ] + w ( u , v ); 5 foreach y adjacent to x and while stop = 0 do 5 Begin-Xact Begin-Xact 6 6 if d [ v ] > sum then 7 if done = 0 then 7 DecreaseKey( Q , v , sum ) ; sum ← d [ x ] + w ( x , y ) 8 8 d [ v ] ← sum ; 9 if d [ y ] > sum then 9 π [ v ] ← u ; DecreaseKey( Q , y , sum ) 10 10 End-Xact d [ y ] ← sum 11 11 π [ y ] ← x end 12 12 else 13 Begin-Xact 13 stop ← 1 14 done ← 1; 14 End-Xact 15 End-Xact 15 end 16 end 16 end 17 for a single neighbour, the check for relaxation, updates to the heap, and updates to d , π arrays, are enclosed within a transaction ◮ performed “all-or-none” ◮ on a conflict, only one thread commits interruption of helper threads implemented through TM, as well Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 13 / 19

Helper-Threading Scheme Main thread Helper thread while Q � = ∅ do while Q � = ∅ do 1 1 u ← ExtractMin( Q ) ; while done = 1 do ; 2 2 done ← 0; 3 x ← ReadMin( Q , tid ) 3 foreach v adjacent to u do stop ← 0 4 4 sum ← d [ u ] + w ( u , v ); 5 foreach y adjacent to x and while stop = 0 do 5 Begin-Xact Begin-Xact 6 6 if d [ v ] > sum then 7 if done = 0 then 7 DecreaseKey( Q , v , sum ) ; sum ← d [ x ] + w ( x , y ) 8 8 d [ v ] ← sum ; 9 if d [ y ] > sum then 9 π [ v ] ← u ; DecreaseKey( Q , y , sum ) 10 10 End-Xact d [ y ] ← sum 11 11 π [ y ] ← x end 12 12 else 13 Begin-Xact 13 stop ← 1 14 done ← 1; 14 End-Xact 15 End-Xact 15 end 16 end 16 end 17 Why with TM? composable ◮ all dependent atomic sub-operations composed into a large atomic operation, without limiting concurrency optimistic easily programmable Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 14 / 19

Experimental Setup Full-system simulation Simics 3.0.31 in conjunction with GEMS toolset 2.1 boots unmodified Solaris 10 (UltraSPARC III Cu) LogTM (“Signature Edition”) eager version management eager conflict detection ◮ on a conflict, a transaction stalls and either retries or aborts HYBRID conflict resolution policy ◮ favors older transactions Hardware platform single CMP system (configurations up to 32 cores) private L1 caches (64KB), shared L2 cache (2MB) Software Pthreads for threading and synchronization Simics “magic” instructions to simulate idealized barriers Sun Studio 12 C compiler (-xO3) Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 15 / 19

CSLab Outline 1 Dijkstras Basics 2 Straightforward Parallelization - PowerPoint PPT Presentation

Early Experiences on Accelerating Dijkstras Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

NTTs Question Answering System for NTCIR-6 QAC-4 Ryuichiro Higashinaka and Hideki Isozaki NTT

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan

Modularization of Multimodal Interaction Specification Matthias Denecke, Kohji Dohsaka, Mikio

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi

Delivering IaaS for the Greek Academic and Research Community Vangelis Koukis

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Generation Analytic Platforms How to embed disruptors in your business strategy? October 6 th ,

Truthful Cake Cutting Egor Ianovski University of Auckland CMSS Summer Workshop 2012 Egor

On#the#Fault+tolerance#and#High# Performance#of#Replicated#Transactional# Systems Dr.#Sachin

NEW GPU FEATURES OF NVIDIAS MAXWELL ARCHITECTURE ALEXEY PANTELEEV DEVELOPER TECHNOLOGY

Orchestrated Android-Style System Upgrades for Embedded Linux Diego Rondini Embedded Linux

Programmability in the Era of Parallel Computing Per Stenstrm Department of Computer Science

Transformation Languages Eugene Syriani Ph.D. Candidate in the Modelling, Simulation and Design

Hands-on Cassandra OSCON July 20, 2010 Eric Evans eevans@rackspace.com @jericevans

Sambuz

Useful Links

Newsletter

Mail Us