Scheduling Parallel Programs by Work Stealing with Private Deques - PowerPoint PPT Presentation

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguéraud Mike Rainey Carnegie Mellon INRIA Max Planck Institute University for Software Systems PPoPP 25.2.2013 1 Friday, July 5, 13

Scheduling parallel tasks 2 Friday, July 5, 13

Scheduling parallel tasks set of cores 2 Friday, July 5, 13

Scheduling parallel tasks pool of tasks 2 Friday, July 5, 13

Scheduling parallel tasks • Goal: dynamic load balancing • A centralized approach: does not scale up • Popular approach: work stealing • Our work: study implementations of work stealing 2 Friday, July 5, 13

Work stealing 3 Friday, July 5, 13

Work stealing deque 3 Friday, July 5, 13

Work stealing pop pop pop push push push 3 Friday, July 5, 13

Work stealing steal 3 Friday, July 5, 13

Concurrent deques steals • Deques are shared. • Two sources of race: top • between thieves • between owner and thief bot • Chase-Lev data structure resolves pop push these races using atomic compare&swap and memory fences. 4 Friday, July 5, 13

Concurrent deques • Well studied: shown to perform well both in theory and in practice ... however, researchers identified two main limitations • Runtime overhead: In a relaxed memory model, pop must use a memory fence. • Lack of flexibility: Simple extensions (e.g., steal half) involve major challenges. 5 Friday, July 5, 13

Previous studies of private deques Feeley 1992 Multilisp Hendler & Shavit 2002 C Umatani 2003 Java Hirashi et al. 2009 C Sanchez et al. 2010 C Fluet et al. 2011 Parallel ML 6 Friday, July 5, 13

Private deques steal request • Each core has exclusive access to its own deque. pop & • An idle core obtains a task by send making a steal request . pop push • A busy core regularly checks for incoming requests. 7 Friday, July 5, 13

Private deques Addresses the main limitations of concurrent deques: • no need for memory fence • flexible deques (any data structure can be used) but • new cost associated with regular polling • additional delay associated with steals 8 Friday, July 5, 13

Unknowns of private deques • What is the best way to implement work stealing with private deques? • How does it compare on state of art benchmarks with concurrent deques? • Can establish tight bounds on the runtime? 9 Friday, July 5, 13

Unknowns of private deques • What is the best way to implement work stealing with private deques? We give a receiver- and a sender-initiated algorithm. • How does it compare on state of art benchmarks with concurrent deques? We evaluate on a collection of benchmarks. • Can establish tight bounds on the runtime? We prove a theorem w.r.t. delay and polling overhead. 9 Friday, July 5, 13

Receiver initiated -1 -1 -1 -1 2 2 1 3 4 10 Friday, July 5, 13

Receiver initiated -1 -1 -1 -1 CAS 2 2 1 3 4 10 Friday, July 5, 13

Receiver initiated 2 -1 -1 -1 CAS 2 1 3 4 10 Friday, July 5, 13

Receiver initiated 2 -1 -1 -1 2 1 3 4 10 Friday, July 5, 13

Receiver initiated -1 -1 -1 -1 2 1 3 4 10 Friday, July 5, 13

From receiver to sender initiated • Receiver initiated: each idle core targets one busy core at random • Sender initiated: each busy core targets one core at random • Sender initiated idea is adapted from distributed computing. • Sender initiated is simpler to implement. 11 Friday, July 5, 13

Sender initiated ... ... ... ... 2 1 3 4 12 Friday, July 5, 13

Sender initiated ... 0 ... ... 2 1 3 4 12 Friday, July 5, 13

Sender initiated ... 0 ... ... CAS 2 1 3 4 12 Friday, July 5, 13

Sender initiated ... ... ... CAS 2 1 3 4 12 Friday, July 5, 13

Sender initiated ... ... ... 2 1 3 4 12 Friday, July 5, 13

Sender initiated ... ... ... ... 2 1 3 4 12 Friday, July 5, 13

Performance study • We implemented in our own C++ library: • our receiver-initiated algorithm • our sender-initiated algorithm • our Chase-Lev implementation • We compare all of those implementations against Cilk Plus. 13 Friday, July 5, 13

Benchmarks • Classic Cilk benchmarks and Problem Based Benchmark Suite (Blelloch et al 2012) • Problem areas: merge sort, sample sort, maximal independent set, maximal matching, convex hull, fibonacci, and dense matrix multiply. 14 Friday, July 5, 13

Performance results concurrent deques receiver init Intel Xeon, 30 cores Shared deques 1.4 sender init Normalized run time Recv. − init. polling period = 30 µ sec Sender − init. Cilk Plus 1.2 Cilk Plus normalized execution time 1.0 0.8 0.6 0.4 0.2 0.0 matmul cilksort(exptintseq) cilksort(randintseq) fib matching(eggrid2d) matching(egrlg) matching(egrmat) MIS(grid2d) MIS(rlg) MIS(rmat) hull(plummer2d) hull(uniform2d) 15 Friday, July 5, 13

Analytical model number of cores P T 1 serial run time minimal run time with infinite cores T ∞ parallel run time with P cores T P δ polling interval maximal number of forks in a path F 16 Friday, July 5, 13

Our main analytical result Bound for greedy schedulers: P + P − 1 T 1 T P T ∞ ≤ P Bound for concurrent deques (ignoring cost of fences): P + P − 1 T 1 E [ T P ] ≤ T ∞ + O ( F ) P Bound for our two algorithms: � T 1 ⇣ ⌘ 1 + O (1) P + P − 1 � E [ T P ] ≤ T ∞ + O ( δ F ) · δ P 17 Friday, July 5, 13

Our main analytical result Bound for greedy schedulers: P + P − 1 T 1 T P T ∞ ≤ P Bound for concurrent deques (ignoring cost of fences): P + P − 1 T 1 E [ T P ] ≤ T ∞ + O ( F ) P cost of steals Bound for our two algorithms: � T 1 ⇣ ⌘ 1 + O (1) P + P − 1 � E [ T P ] ≤ T ∞ + O ( δ F ) · δ P 17 Friday, July 5, 13

Our main analytical result Bound for greedy schedulers: P + P − 1 T 1 T P T ∞ ≤ P Bound for concurrent deques (ignoring cost of fences): P + P − 1 T 1 E [ T P ] ≤ T ∞ + O ( F ) P cost of steals Bound for our two algorithms: � T 1 ⇣ ⌘ 1 + O (1) P + P − 1 � E [ T P ] ≤ T ∞ + O ( δ F ) · δ P cost of steals polling overhead 17 Friday, July 5, 13

Conclusion • We presented two new private-deques algorithms, evaluated them, and proved analytical results. • In the paper, we demonstrated the flexibility of private deques by implementing the steal half policy. 18 Friday, July 5, 13

Scheduling Parallel Programs by Work Stealing with Private Deques - PowerPoint PPT Presentation

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud Mike Rainey Carnegie Mellon INRIA Max Planck Institute University for Software Systems PPoPP 25.2.2013 1 Friday, July 5, 13 Scheduling

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Parallel Search Ciaran McCreesh and Patrick Prosser This Weeks Lectures Search and

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 51 Contents 1 Introduction 2 Work

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing Amdahls law Speed

Coordinates Josiah Manson and Scott Schaefer Texas A&M University Barycentric Coordinates

Lecture 2.6: Propositions over a universe Matthew Macauley Department of Mathematical Sciences

DAQ introduction Purpose of this talk : (1) Introduction for those who have not been in every

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

Integrating SMT with Theorem Proving for AMS Verification Yan Peng & Mark Greenstreet

Scheduling Parallel Programs by Work Stealing with Private Deques - PowerPoint PPT Presentation

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud Mike Rainey Carnegie Mellon INRIA Max Planck Institute University for Software Systems PPoPP 25.2.2013 1 Friday, July 5, 13 Scheduling

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Parallel Search Ciaran McCreesh and Patrick Prosser This Weeks Lectures Search and

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 51 Contents 1 Introduction 2 Work

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing Amdahls law Speed

Coordinates Josiah Manson and Scott Schaefer Texas A&amp;M University Barycentric Coordinates

Lecture 2.6: Propositions over a universe Matthew Macauley Department of Mathematical Sciences

DAQ introduction Purpose of this talk : (1) Introduction for those who have not been in every

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

Integrating SMT with Theorem Proving for AMS Verification Yan Peng &amp; Mark Greenstreet

Coordinates Josiah Manson and Scott Schaefer Texas A&M University Barycentric Coordinates

Integrating SMT with Theorem Proving for AMS Verification Yan Peng & Mark Greenstreet