Malleable task-graph scheduling with a practical speed-up model Loris Marchal 1 Bertrand Simon 1 Oliver Sinnen 2 Frédéric Vivien 1 1: CNRS, INRIA, ENS Lyon and Univ. Lyon, FR. 2: Univ. Auckland, NZ. New Challenges in Scheduling Theory — Aussois March 2016 L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 1 / 22
Motivation Context: � Optimize the time performance of multifrontal sparse solvers (e.g., MUMPS or QR-MUMPS) � Computations well described by a tree of tasks � Generalization to Series-Parallel graphs � Purpose: find a schedule achieving the lowest makespan T T Objectives: � Provide theoretical guarantees on widely used scheduling algorithms � Design ones with smaller makespan L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 2 / 22
Motivation Context: � Optimize the time performance of multifrontal sparse solvers (e.g., MUMPS or QR-MUMPS) � Computations well described by a tree of tasks � Generalization to Series-Parallel graphs � Purpose: find a schedule achieving the lowest makespan G 1 G 2 G 1 ; G 2 Objectives: � Provide theoretical guarantees on widely used scheduling algorithms � Design ones with smaller makespan L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 2 / 22
Motivation Context: � Optimize the time performance of multifrontal sparse solvers (e.g., MUMPS or QR-MUMPS) � Computations well described by a tree of tasks � Generalization to Series-Parallel graphs � Purpose: find a schedule achieving the lowest makespan G 1 G 2 G 1 ∥ G 2 Objectives: � Provide theoretical guarantees on widely used scheduling algorithms � Design ones with smaller makespan L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 2 / 22
Motivation Context: � Optimize the time performance of multifrontal sparse solvers (e.g., MUMPS or QR-MUMPS) � Computations well described by a tree of tasks � Generalization to Series-Parallel graphs � Purpose: find a schedule achieving the lowest makespan Objectives: � Provide theoretical guarantees on widely used scheduling algorithms � Design ones with smaller makespan L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 2 / 22
Motivation Context: � Optimize the time performance of multifrontal sparse solvers (e.g., MUMPS or QR-MUMPS) � Computations well described by a tree of tasks � Generalization to Series-Parallel graphs � Purpose: find a schedule achieving the lowest makespan 1 2 4 5 6 3 Objectives: � Provide theoretical guarantees on widely used scheduling algorithms � Design ones with smaller makespan L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 2 / 22
Application modeling Coarse-grain picture: tree of tasks (or SP task graph) � Each task: partial factorization, graph of smaller sub-tasks Expand all tasks and schedule resulting graph ? � Scheduling trees simpler than general graphs (forget sub-tasks) � Behavior of coarse-grain tasks � parallel and malleable � Speed-up model − → trade-off between: Accuracy : fits well the data Tractability : amenable to perf. analysis, guaranteed algorithms L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 3 / 22
Application modeling Coarse-grain picture: tree of tasks (or SP task graph) � Each task: partial factorization, graph of smaller sub-tasks POTRF-0 TRSM-1-0 TRSM-4-0 SYRK-1-1-0 TRSM-2-0 TRSM-3-0 GEMM-4-1-0 GEMM-4-2-0 GEMM-4-3-0 POTRF-1 GEMM-2-1-0 GEMM-3-2-0 GEMM-3-1-0 SYRK-4-4-0 TRSM-4-1 TRSM-2-1 SYRK-2-2-0 TRSM-3-1 SYRK-3-3-0 SYRK-4-4-1 GEMM-4-2-1 GEMM-4-3-1 SYRK-2-2-1 GEMM-3-2-1 SYRK-3-3-1 Expand all tasks and schedule resulting graph ? � Scheduling trees simpler than general graphs (forget sub-tasks) � Behavior of coarse-grain tasks � parallel and malleable � Speed-up model − → trade-off between: Accuracy : fits well the data Tractability : amenable to perf. analysis, guaranteed algorithms L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 3 / 22
Application modeling Coarse-grain picture: tree of tasks (or SP task graph) � Each task: partial factorization, graph of smaller sub-tasks POTRF-0 TRSM-1-0 TRSM-4-0 SYRK-1-1-0 TRSM-2-0 TRSM-3-0 GEMM-4-1-0 GEMM-4-2-0 GEMM-4-3-0 POTRF-1 GEMM-2-1-0 GEMM-3-2-0 GEMM-3-1-0 SYRK-4-4-0 TRSM-4-1 TRSM-2-1 SYRK-2-2-0 TRSM-3-1 SYRK-3-3-0 SYRK-4-4-1 GEMM-4-2-1 GEMM-4-3-1 SYRK-2-2-1 GEMM-3-2-1 SYRK-3-3-1 Expand all tasks and schedule resulting graph ? � Scheduling trees simpler than general graphs (forget sub-tasks) � Behavior of coarse-grain tasks � parallel and malleable � Speed-up model − → trade-off between: Accuracy : fits well the data Tractability : amenable to perf. analysis, guaranteed algorithms L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 3 / 22
General speed-up models Literature: studies with few assumptions speed-up ( p ) = time(1 proc.) � work ( p ) = p · time ( p proc. ) � time(p proc.) � Non-increasing speed-up and work � Independent tasks: theoretical FPTAS and practical 2-approximations [Jansen 2004, Fan et al. 2012] � SP-graphs: ≈ 2 . 6-approximation [Lepère et al. 2001] with concave speed-up: ( 2 + ε ) -approximation of unspecified complexity [Makarychev et al. 2014] L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 4 / 22
Previous work (Europar 2015, with A. Guermouche) speed − up ( p ) = p α Prasanna & Musicus model [PM 1996]: speed-up α = 1 perfect parallelism 0 < α < 1 1 α = 0 no parallelism processors 1 Conclusions: � Average Accuracy � No guarantees for distributed platforms � Rational numbers of processors � Task finish times complex � Optimal algorithm for SP-graphs to compute L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 5 / 22
Today: simpler model Simple and reasonable model of a parallel malleable task T i � Perfect parallelism up to a threshold δ i : time = w i / min ( p , δ i ) � Rational allocation for free (McNaughton’s wrap-around rule) speed-up 1 e = p o l s processors δ i Related studies � 2-approximation [Balmin et al. 13] that we will discuss � [Kell et al. 2015] : time = w i p + ( p − 1 ) c ; 2-approximation for p = 3, open for p ≥ 4 L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 6 / 22
Today: simpler model Simple and reasonable model of a parallel malleable task T i � Perfect parallelism up to a threshold δ i : time = w i / min ( p , δ i ) � Rational allocation for free (McNaughton’s wrap-around rule) speed-up 1 e = p o l s processors δ i Related studies � 2-approximation [Balmin et al. 13] that we will discuss � [Kell et al. 2015] : time = w i p + ( p − 1 ) c ; 2-approximation for p = 3, open for p ≥ 4 L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 6 / 22
Problem complexity Proportional Mapping Greedy strategy Experimental comparison Outline Problem complexity 1 Analysis of P ROPORTIONAL M APPING [Pothen et al. 1993] 2 Design of a greedy strategy 3 Experimental comparison 4 Conclusion 5 L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 7 / 22
Problem complexity Proportional Mapping Greedy strategy Experimental comparison Overview of the problem Given a SP-graph, p processors: compute the optimal makespan � Problem known as P | sp − graph , any , spdp - lin , δ i | C max � Malleability + perfect parallelism ⇒ P = . . . + thresholds = ⇒ NP-complete � � Existing proof in [Drozdowski and Kubiak 1999] : arguably complex Contribution � New NP-completeness proof L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 8 / 22
Problem complexity Proportional Mapping Greedy strategy Experimental comparison Overview of the problem Given a SP-graph, p processors: compute the optimal makespan � Problem known as P | sp − graph , any , spdp - lin , δ i | C max � Malleability + perfect parallelism ⇒ P = . . . + thresholds = ⇒ NP-complete � � Existing proof in [Drozdowski and Kubiak 1999] : arguably complex Contribution � New NP-completeness proof L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 8 / 22
Problem complexity Proportional Mapping Greedy strategy Experimental comparison Overview of the problem Given a SP-graph, p processors: compute the optimal makespan � Problem known as P | sp − graph , any , spdp - lin , δ i | C max � Malleability + perfect parallelism ⇒ P = . . . + thresholds = ⇒ NP-complete � � Existing proof in [Drozdowski and Kubiak 1999] : arguably complex Contribution � New NP-completeness proof L. Marchal, B. Simon , O. Sinnen, F. Vivien Malleable task-graph scheduling with a practical speed-up model 8 / 22
Recommend
More recommend