Background on traditional scheduling HEFT: Heterogeneous Earliest Finish Time 1 Priority level: ◮ rank ( T i ) = w i + T j ∈ Succ ( T i ) ( com ij + rank ( T j )) , max where Succ ( T ) is the set of successors of T ◮ Recursive computation by bottom-up traversal of the graph 2 Allocation ◮ For current task T i , determine best processor P q : minimize σ ( T i ) + w iq ◮ Enforce constraints related to communication costs ◮ Insertion scheduling: look for t = σ ( T i ) s.t. P q is available during interval [ t, t + w iq [ 3 Complexity: same as MCP without/with insertion Yves Robert Scheduling for Heterogeneous Platforms 16/ 86
Background on traditional scheduling Bibliography – Traditional scheduling Introductory book: Distributed and parallel computing , H. El-Rewini and T. G. Lewis, Manning 1997 FCP: On the complexity of list scheduling algorithms for distributed-memory systems , A. Radulescu and A.J.C. van Gemund, 13th ACM Int Conf. Supercomputing (1999), 68-75 HEFT: Performance-effective and low-complexity task scheduling for heterogeneous computing , H. Topcuoglu and S. Hariri and M.-Y. Wu, IEEE TPDS 13, 3 (2002), 260-274 Yves Robert Scheduling for Heterogeneous Platforms 17/ 86
Background on traditional scheduling What’s wrong? � Nothing (still may need to map a DAG onto a platform!) � Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) � Wrong metric: need to relax makespan minimization objective Yves Robert Scheduling for Heterogeneous Platforms 18/ 86
Background on traditional scheduling What’s wrong? � Nothing (still may need to map a DAG onto a platform!) � Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) � Wrong metric: need to relax makespan minimization objective Yves Robert Scheduling for Heterogeneous Platforms 18/ 86
Background on traditional scheduling What’s wrong? � Nothing (still may need to map a DAG onto a platform!) � Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) � Wrong metric: need to relax makespan minimization objective Yves Robert Scheduling for Heterogeneous Platforms 18/ 86
Packet routing Outline Background on traditional scheduling 1 Packet routing 2 Master-worker on heterogeneous platforms 3 Broadcast 4 Limitations 5 Putting all together 6 Conclusion 7 Yves Robert Scheduling for Heterogeneous Platforms 19/ 86
Packet routing Problem E C A G D F H B Routing sets of messages from sources to destinations Paths not fixed a priori Packets of same message may follow different paths Yves Robert Scheduling for Heterogeneous Platforms 20/ 86
Packet routing Hypotheses E C A G D F H B A packet crosses an edge within one time-step At any time-step, at most one packet crosses an edge Yves Robert Scheduling for Heterogeneous Platforms 21/ 86
Packet routing Hypotheses E C A G D F H B A packet crosses an edge within one time-step At any time-step, at most one packet crosses an edge Scheduling: for each time-step, decide which packet crosses any given edge Yves Robert Scheduling for Heterogeneous Platforms 21/ 86
Packet routing Notation n k,l l k n k,l i,j i j n k,l : total number of packets to be routed from k to l n k,l i,j : total number of packets routed from k to l and crossing edge ( i, j ) Yves Robert Scheduling for Heterogeneous Platforms 22/ 86
Packet routing Lower bound Congestion C i,j of edge ( i, j ) = total number of packets that cross ( i, j ) � n k,l C i,j = C max = max i,j C i,j i,j ( k,l ) | n k,l > 0 C max lower bound on schedule makespan C ∗ ≥ C max ⇒ “Fluidified” solution in C max ? Yves Robert Scheduling for Heterogeneous Platforms 23/ 86
Packet routing Lower bound Congestion C i,j of edge ( i, j ) = total number of packets that cross ( i, j ) � n k,l C i,j = C max = max i,j C i,j i,j ( k,l ) | n k,l > 0 C max lower bound on schedule makespan C ∗ ≥ C max ⇒ “Fluidified” solution in C max ? Yves Robert Scheduling for Heterogeneous Platforms 23/ 86
Packet routing Equations (1/2) E E C A G A G D F H B B � n k,l 1 Initialization (packets leave node k ): k,j = n k,l j | ( k,j ) ∈ A � n k,l i,l = n k,l 2 Reception (packets reach node l ): i | ( i,l ) ∈ A 3 Conservation law (crossing intermediate node i ): � n k,l � n k,l i,j = ∀ ( k, l ) , j � = k, j � = l j,i i | ( i,j ) ∈ A i | ( j,i ) ∈ A Yves Robert Scheduling for Heterogeneous Platforms 24/ 86
Packet routing Equations (1/2) E C G A G D D F H H B � n k,l 1 Initialization (packets leave node k ): k,j = n k,l j | ( k,j ) ∈ A � n k,l i,l = n k,l 2 Reception (packets reach node l ): i | ( i,l ) ∈ A 3 Conservation law (crossing intermediate node i ): � n k,l � n k,l i,j = ∀ ( k, l ) , j � = k, j � = l j,i i | ( i,j ) ∈ A i | ( j,i ) ∈ A Yves Robert Scheduling for Heterogeneous Platforms 24/ 86
Packet routing Equations (1/2) G G � n k,l 1 Initialization (packets leave node k ): k,j = n k,l j | ( k,j ) ∈ A � n k,l i,l = n k,l 2 Reception (packets reach node l ): i | ( i,l ) ∈ A 3 Conservation law (crossing intermediate node i ): � n k,l � n k,l i,j = ∀ ( k, l ) , j � = k, j � = l j,i i | ( i,j ) ∈ A i | ( j,i ) ∈ A Yves Robert Scheduling for Heterogeneous Platforms 24/ 86
Packet routing Equations (2/2) 4 Congestion ( k,l ) | n k,l > 0 n k,l C i,j = � i,j 5 Objective function C max ≥ C i,j , ∀ i, j Minimize C max Linear program in rational numbers: polynomial-time solution. In practice use Maple or Mupad Yves Robert Scheduling for Heterogeneous Platforms 25/ 86
Packet routing Equations (2/2) 4 Congestion ( k,l ) | n k,l > 0 n k,l C i,j = � i,j 5 Objective function C max ≥ C i,j , ∀ i, j Minimize C max Linear program in rational numbers: polynomial-time solution. In practice use Maple or Mupad Yves Robert Scheduling for Heterogeneous Platforms 25/ 86
Packet routing Routing algorithm 1 Compute optimal solution C max , n k,l i,j of previous linear program 2 Periodic schedule: ◮ Define Ω = � C max � C max ◮ Use � periods of length Ω Ω ◮ During each period, edge ( i, j ) forwards (at most) n k,l � � i,j Ω m k,l i,j = C max packets that go from k to l 3 Clean-up: sequentially process residual packets inside network Yves Robert Scheduling for Heterogeneous Platforms 26/ 86
Packet routing Performance Schedule is feasible Schedule is asymptotically optimal: C max ≤ C ∗ ≤ C max + O ( � C max ) Yves Robert Scheduling for Heterogeneous Platforms 27/ 86
Packet routing Why does it work? Relaxation of objective function Rational number of packets in LP formulation Periods long enough so that rounding down to integer numbers has negligible impact Periods numerous enough so that loss in first and last periods has negligible impact Periodic schedule, described in compact form Yves Robert Scheduling for Heterogeneous Platforms 28/ 86
Packet routing Bibliography – Packet routing Survey of results: Introduction to parallel algorithms and architectures: arrays, trees, hypercubes , F.T. Leighton, Morgan Kaufmann (1992) NP-completeness, approximation algorithm: A constant-factor approximation algorithm for packet routing and balancing local vs. global criteria , A. Srinivasan, C.-P. Teo, SIAM J. Comput. 30, 6 (2000), 2051-2068 Steady-state: Asymptotically optimal algorithms for job shop scheduling and packet routing , D. Bertsimas and D. Gamarnik, Journal of Algorithms 33, 2 (1999), 296-318 Yves Robert Scheduling for Heterogeneous Platforms 29/ 86
Master-worker on heterogeneous platforms Outline Background on traditional scheduling 1 Packet routing 2 Master-worker on heterogeneous platforms 3 Broadcast 4 Limitations 5 Putting all together 6 Conclusion 7 Yves Robert Scheduling for Heterogeneous Platforms 30/ 86
Master-worker on heterogeneous platforms Master-worker tasking: framework Heterogeneous resources Processors of different speeds Communication links with various bandwidths Large number of independent tasks to process Tasks are atomic Tasks have same size Single data repository One master initially holds data for all tasks Several workers arranged along a star, a tree or a general graph Yves Robert Scheduling for Heterogeneous Platforms 31/ 86
Master-worker on heterogeneous platforms Application examples Monte Carlo methods SETI@home Factoring large numbers Searching for Mersenne primes Particle detection at CERN (LHC@home) ... and many others: see BOINC at http://boinc.berkeley.edu Yves Robert Scheduling for Heterogeneous Platforms 32/ 86
Master-worker on heterogeneous platforms Makespan vs. steady state Two-different problems Makespan Maximize total number of tasks processed within a time-bound Steady state Determine periodic task allocation which maximizes total throughput Yves Robert Scheduling for Heterogeneous Platforms 33/ 86
✩ ❅ ✼✽ ✾✿❀ ❁ ✾ ❂ ❃ ❄ ❆❇ ✺ ❈❉ ❊ ❋ ❊ ● ❇ ❉ ❊ ✻ ✹ ■ ✪ ✧ ★ ✧ ✪ ★ ✩ ✫✬ ✬ ❯ ✭ ✮ ◗ ✲ ✳ ✴ ✵ ✶ ❍ ❆❏ ✧ ◗ ❳ ❖ ❱ ▼ ❯ ❳ ◆ ❖ ❳ ❳ ❚ ❩ ▼ ◆ ❩ ◗ ❨ ❘❨ ❯ ❑ ❙ ▲ ▼ ◆ ❖ P ◗ ❘ ❚ ◆ ❯ ▼ ❯ ❱ ◗ ◆ ❚ ❲ ★ ✦ ❯ ☞ ✝ ✞ ✟ ✠ ✝ ✡ ☛ ✌ ✡ ✍ ✎ ✏ ✑ ☛ ☞ ✌ ✍ ✆ ✝ ✏ ✁✂ � ✁✂ ✄ ☎ ✁ ✄ ✂ � ✄ ✠ ☎ ✁ ✄ ✂ ✆ ✝ ✞ ✟ ✎ ✑ ✬ ★ ✛ ✜ ✣ ✤ ✥ ✦ ✧ ✧ ✒ ✩ ★ ✧ ✪ ★ ✩ ✫✬ ✪ ✚ ✙ ✘ ✖ ✓ ✔ ✕ ✒ ✓ ✔ ✕ ✗ ✥ ✖ ✗ ✘ ✙ ✚ ◗ ✣ ✤ ❳ Example Master-worker on heterogeneous platforms Yves Robert ✷✢✸ ✯✱✰ ✛✢✜ Scheduling for Heterogeneous Platforms ❬✢❭ 34/ 86
✝ ☎ ✞ ✟ ✝ ✆ ✆ ☎ ☎ ☎ ✄ ✟ ✄ ✂ ✂ ✁ ✁ � � � � ✞ Master-worker on heterogeneous platforms Example A is the root of the tree; A is the root of the tree; all tasks start at A all tasks start at A Time for sending Time for sending one task from A to B one task from A to B Time for computing Time for computing one task in C one task in C Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
☛ ✆ ☞ ✠✡ ☛ ☞ ✌ ✌ ✟ ✟ ✞ ✞ ✝ ✠✡ ✆ ✝ ☎ ✁ ☎ ☎ ✄ ✄ ✂ ✂ � � ☎ � ✁ � Master-worker on heterogeneous platforms Example A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
☛ ✆ ☞ ✠✡ ☛ ☞ ✌ ✌ ✟ ✟ ✞ ✞ ✝ ✠✡ ✆ ✝ ☎ ✁ ☎ ☎ ✄ ✄ ✂ ✂ � � ☎ � ✁ � Master-worker on heterogeneous platforms Example A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
☛ ✆ ☞ ✠✡ ☛ ☞ ✌ ✌ ✟ ✟ ✞ ✞ ✝ ✠✡ ✆ ✝ ☎ ✁ ☎ ☎ ✄ ✄ ✂ ✂ � � ☎ � ✁ � Master-worker on heterogeneous platforms Example A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
☛ ✆ ☞ ✠✡ ☛ ☞ ✌ ✌ ✟ ✟ ✞ ✞ ✝ ✠✡ ✆ ✝ ☎ ✁ ☎ ☎ ✄ ✄ ✂ ✂ � � ☎ � ✁ � Master-worker on heterogeneous platforms Example A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
☛ ✆ ☞ ✠✡ ☛ ☞ ✌ ✌ ✟ ✟ ✞ ✞ ✝ ✠✡ ✆ ✝ ☎ ✁ ☎ ☎ ✄ ✄ ✂ ✂ � � ☎ � ✁ � Master-worker on heterogeneous platforms Example A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
✄ ✄ � ✁ ✁ ✂ ✂ � ✄ ✄ ☎ � ☎ ✆ ✆ ✝ ✝ ✞ ✞ ✟ ✟ � Master-worker on heterogeneous platforms Example Repeated Repeated Startup Startup pattern pattern Clean- Clean -up up A compute A compute A send A send B receive B receive B compute B compute C receive C receive C compute C compute C send C send D receive D receive D compute D compute 1 1 2 2 3 3 Steady- -state: 7 tasks every 6 time units state: 7 tasks every 6 time units Steady Yves Robert Scheduling for Heterogeneous Platforms 34/ 86
✝ ✟ ✞ ✞ ✞ ✞ ✞ ✞ ✞ ✞ ✞ ✟ ✟ ✟ ✟ ✟ ✞ ✟ ✟ ✟ ✟ ✠ ✠ ✠ ✠ ✠ ✠ ✠ ✠ ✠ ✞ ✞ ✡ ✝ ✆ ✆ ✝ ✝ ✝ ✝ ✝ ✌ ✝ ✝ ✝ ✝ ✝ ✝ ✞ ✝ ✝ ✝ ✝ ✝ ✝ ✝ ✞ ✞ ✞ ✞ ✞ ✞ ✞ ✠ ✡ ✆ ☞ ☛ ☛ ☛ ☛ ☞ ☞ ☞ ☞ ☞ ☞ ☞ ☞ ☞ ☞ ☛ ☞ ☞ ☞ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ☛ ☛ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ✡ ☛ ✡ ✡ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✆ ✆ ✌ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✄ ✄ ✄ ✄ ✄ ✄ ✁ ✁ ✄ � � � � � � � � � � � � � � � ✁ � � � � � � � � � ✁ ✁ ✁ ✁ ✁ ✄ ☎ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ☎ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ✆ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✌ Master-worker on heterogeneous platforms Solution for star-shaped platforms ✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍ ✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏ ✑✎✑✎✑✎✑✎✑✎✑✎✑ ✒✎✒✎✒✎✒✎✒ ✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍ ✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏ ✑✎✑✎✑✎✑✎✑✎✑✎✑ ✒✎✒✎✒✎✒✎✒ ✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍✎✍ ✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏✎✏ ✑✎✑✎✑✎✑✎✑✎✑✎✑ ✒✎✒✎✒✎✒✎✒ Communication links between master and workers have different bandwidths Workers have different computing power Yves Robert Scheduling for Heterogeneous Platforms 35/ 86
Master-worker on heterogeneous platforms Rule of the game M c 1 c p c 2 c i P 1 P 2 P i P p w 1 w 2 w i w p Master sends tasks to workers sequentially, and without preemption Full computation/communication overlap for each worker Worker P i receives a task in c i time-units Worker P i processes a task in w i time-units Yves Robert Scheduling for Heterogeneous Platforms 36/ 86
Master-worker on heterogeneous platforms Equations M c 1 c p c 2 c i P 1 P 2 P i P p w 1 w 2 w i w p Worker P i executes α i tasks per time-unit Computations: α i w i ≤ 1 Communications: � i α i c i ≤ 1 Objective: maximize throughput � ρ = α i i Yves Robert Scheduling for Heterogeneous Platforms 36/ 86
Master-worker on heterogeneous platforms Solution Faster-communicating workers first: c 1 ≤ c 2 ≤ . . . Make full use of first q workers, where q largest index s.t. q c i � ≤ 1 w i i =1 Make partial use of next worker P q +1 Discard other workers Bandwidth-centric strategy - Delegate work to the fastest communicating workers - It doesn’t matter if these workers are computing slowly - Of course, slow workers will not contribute much to the overall throughput Yves Robert Scheduling for Heterogeneous Platforms 37/ 86
Master-worker on heterogeneous platforms Example M 1 2 20 3 10 3 6 1 1 1 Discarded Fully active Tasks Communication Computation 6 tasks to P 1 6 c 1 = 6 6 w 1 = 18 3 tasks to P 2 3 c 2 = 6 3 w 2 = 18 2 tasks to P 3 2 c 3 = 6 2 w 3 = 2 11 tasks every 18 time-units ( ρ = 11 / 18 ≈ 0 . 6 ) Yves Robert Scheduling for Heterogeneous Platforms 38/ 86
Master-worker on heterogeneous platforms Example M 1 2 20 3 10 3 6 1 1 1 Discarded Fully active � Compare to purely greedy (demand-driven) strategy! 5 tasks every 36 time-units ( ρ = 5 / 36 ≈ 0 . 14 ) Even if resources are cheap and abundant, resource selection is key to performance Yves Robert Scheduling for Heterogeneous Platforms 38/ 86
Master-worker on heterogeneous platforms Extension to trees 0 0 9 9 1 1 1 1 2 2 2 2 5 5 5 5 2 2 2 2 3 3 4 1 3 4 1 3 10 6 6 6 6 2 10 6 6 6 6 2 7 1 1 1 5 5 5 0 0 9 9 40 67 1 1 1 1 2 2 2 2 40 90 40 5 39 53 39 2 2 3 3 6 6 6 6 1 Yves Robert Scheduling for Heterogeneous Platforms 39/ 86
Master-worker on heterogeneous platforms Extension to trees Fully used node Partially used node Idle node 0 9 1 1 2 2 5 5 2 2 3 3 4 1 6 6 6 6 2 10 1 1 1 5 5 5 Resource selection based on local information (children) Yves Robert Scheduling for Heterogeneous Platforms 39/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yes, I mean, almost! Yves Robert Scheduling for Heterogeneous Platforms 40/ 86
Master-worker on heterogeneous platforms LP formulation still works well . . . P k P j c ji file e mn c ik T m T n P i w i Conservation law � ∀ m, n sent ( P j → P i , e mn ) + executed ( P i , T m ) j � sent ( P i → P k , e mn ) = executed ( P i , T n ) + k Computations � executed ( P i , T m ) × flops ( T m ) × w i ≤ 1 m Outgoing communications � � sent ( P j → P i , e mn ) × bytes ( e mn ) × c ij ≤ 1 m,n j Yves Robert Scheduling for Heterogeneous Platforms 41/ 86
Master-worker on heterogeneous platforms . . . but schedule reconstruction is harder A 1 A 2 A 3 A 4 A 5 χ 1 χ 2 χ 3 χ 4 χ 1 χ 2 χ 3 χ 4 χ 1 χ 2 χ 3 χ 4 χ 1 χ 2 χ 3 χ 4 � � � � � � � � � � � � � � � � P 4 → P 3 P 3 → P 4 P 4 → P 2 P 2 → P 4 P 3 → P 2 P 2 → P 3 P 3 → P 1 P 1 → P 3 P 2 → P 1 P 1 → P 2 P 4 P 3 P 2 P 1 0 40 80 120 160 � Actual (cyclic) schedule obtained in polynomial time � Asymptotic optimality � A couple of practical problems (large period, # buffers) � No local scheduling policy Yves Robert Scheduling for Heterogeneous Platforms 42/ 86
Master-worker on heterogeneous platforms The beauty of steady-state scheduling Rationale Maximize throughput (total load executed per period) Simplicity Relaxation of makespan minimization problem Ignore initialization and clean-up phases Precise ordering/allocation of tasks/messages not needed Characterize resource activity during each time-unit: - which (rational) fraction of time is spent computing for which application? - which (rational) fraction of time is spent receiving or sending to which neighbor? Efficiency Optimal throughput ⇒ optimal schedule (up to a constant number of tasks) Periodic schedule, described in compact form ⇒ compiling a loop instead of a DAG! Yves Robert Scheduling for Heterogeneous Platforms 43/ 86
Master-worker on heterogeneous platforms The beauty of steady-state scheduling Rationale Maximize throughput (total load executed per period) Simplicity Relaxation of makespan minimization problem Ignore initialization and clean-up phases Precise ordering/allocation of tasks/messages not needed Characterize resource activity during each time-unit: - which (rational) fraction of time is spent computing for which application? - which (rational) fraction of time is spent receiving or sending to which neighbor? Efficiency Optimal throughput ⇒ optimal schedule (up to a constant number of tasks) Periodic schedule, described in compact form ⇒ compiling a loop instead of a DAG! Yves Robert Scheduling for Heterogeneous Platforms 43/ 86
Master-worker on heterogeneous platforms Bibliography – Master-worker tasking Steady-state scheduling: Scheduling strategies for master-worker tasking on heterogeneous processor platforms , C. Banino et al., IEEE TPDS 15, 4 (2004), 319-330 With bounded multi-port model: Distributed adaptive task allocation in heterogeneous computing environments to maximize throughput , B. Hong and V.K. Prasanna, IEEE IPDPS (2004), 52b With several applications: Centralized versus distributed schedulers for multiple bag-of-task applications , presented yesterday! Yves Robert Scheduling for Heterogeneous Platforms 44/ 86
Broadcast Outline Background on traditional scheduling 1 Packet routing 2 Master-worker on heterogeneous platforms 3 Broadcast 4 Limitations 5 Putting all together 6 Conclusion 7 Yves Robert Scheduling for Heterogeneous Platforms 45/ 86
Broadcast Broadcasting data Key collective communication operation Start: one processor has the data End: all processors own a copy Vast literature about broadcast, MPI Bcast Standard approach: use a spanning tree Finding the best spanning tree: NP-Complete problem (even in the telephone model) Yves Robert Scheduling for Heterogeneous Platforms 46/ 86
Broadcast Broadcasting data Key collective communication operation Start: one processor has the data End: all processors own a copy Vast literature about broadcast, MPI Bcast Standard approach: use a spanning tree Finding the best spanning tree: NP-Complete problem (even in the telephone model) Yves Robert Scheduling for Heterogeneous Platforms 46/ 86
Broadcast Broadcasting data Key collective communication operation Start: one processor has the data End: all processors own a copy Vast literature about broadcast, MPI Bcast Standard approach: use a spanning tree Finding the best spanning tree: NP-Complete problem (even in the telephone model) Yves Robert Scheduling for Heterogeneous Platforms 46/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) 4 3 3 4 2 1 6 3 2 Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) 4 3 3 (0) 4 2 1 6 3 2 ∈ T Next node: minimize ( R i ) + c ij , P j / Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) (3) 4 3 3 (3) 4 2 1 6 3 2 Next node: minimize ( R i ) + c ij , P j / ∈ T Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) (6) 4 3 3 (3) (6) 4 2 1 6 3 2 ∈ T Next node: minimize ( R i ) + c ij , P j / Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) (6) 4 3 3 (3) (7) 4 2 1 6 3 (7) 2 ∈ T Next node: minimize ( R i ) + c ij , P j / Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Earliest completing edge first (ECEF) (3) 4 3 3 (0) (9) (6) 4 2 1 6 3 (9) (7) 2 Broadcast finishing times ( t ) Yves Robert Scheduling for Heterogeneous Platforms 47/ 86
Broadcast Heuristic: Look-ahead (LA) (3) 4 3 3 (0) (1) 4 2 1 6 3 (1) 2 Next node: minimize ( R i ) + c ij + (min c jk ) , P j , P k / ∈ T Yves Robert Scheduling for Heterogeneous Platforms 48/ 86
Broadcast Heuristic: Look-ahead (LA) (4) 4 3 3 (3) (4) (4) 4 2 1 6 3 (2) 2 Next node: minimize ( R i ) + c ij + (min c jk ) , P j , P k / ∈ T Yves Robert Scheduling for Heterogeneous Platforms 48/ 86
Broadcast Heuristic: Look-ahead (LA) (7) 4 3 3 (7) (0) (4) 4 2 1 6 3 (5) (7) 2 Broadcast finishing times ( t ) Yves Robert Scheduling for Heterogeneous Platforms 48/ 86
Broadcast Broadcasting longer messages Message size goes from L to, say, 10 L Communication costs scale from c ij to 10 c ij ECEF heuristic: broadcast time becomes 90 LA heuristic: broadcast time becomes 70 Yves Robert Scheduling for Heterogeneous Platforms 49/ 86
Broadcast Broadcasting longer messages Message size goes from L to, say, 10 L Communication costs scale from c ij to 10 c ij ECEF heuristic: broadcast time becomes 90 LA heuristic: broadcast time becomes 70 Yves Robert Scheduling for Heterogeneous Platforms 49/ 86
Broadcast Broadcasting longer messages Message size goes from L to, say, 10 L Communication costs scale from c ij to 10 c ij ECEF heuristic: broadcast time becomes 90 LA heuristic: broadcast time becomes 70 Eh wait! What about PIPELINING?! Yves Robert Scheduling for Heterogeneous Platforms 49/ 86
Broadcast Broadcasting longer messages 4 3 3 4 2 ... 10 3 2 1 size = 10 L 1 6 3 2 Search spanning tree . . . Objective: minimize pipelined execution time Yves Robert Scheduling for Heterogeneous Platforms 50/ 86
Broadcast Broadcasting longer messages 4 3 3 4 2 ... 10 3 2 1 size = 10 L 1 6 3 2 Delay = inverse of throughput Node delay = � children of node comm. times Tree delay = maximum node delay Pipelined execution time: (# edges in longest path + #packets) × tree delay Objective: minimize tree delay Yves Robert Scheduling for Heterogeneous Platforms 51/ 86
Broadcast Back to the example 4 4 3 3 3 3 4 4 2 2 1 1 6 6 3 3 delay = 3 delay = 7 2 2 ECEF tree LA tree ECEF tree turns out to have minimum delay (maximal throughput) Can we always find tree with optimal throughput? � Problem is NP-complete � Still, can design simple heuristics: SDIEF: smallest-delay-increase edge first Yves Robert Scheduling for Heterogeneous Platforms 52/ 86
Broadcast Back to the example 4 4 3 3 3 3 4 4 2 2 1 1 6 6 3 3 delay = 3 delay = 7 2 2 ECEF tree LA tree ECEF tree turns out to have minimum delay (maximal throughput) Can we always find tree with optimal throughput? � Problem is NP-complete � Still, can design simple heuristics: SDIEF: smallest-delay-increase edge first Yves Robert Scheduling for Heterogeneous Platforms 52/ 86
Broadcast Back to the example 4 4 3 3 3 3 4 4 2 2 1 1 6 6 3 3 delay = 3 delay = 7 2 2 ECEF tree LA tree ECEF tree turns out to have minimum delay (maximal throughput) Can we always find tree with optimal throughput? � Problem is NP-complete � Still, can design simple heuristics: SDIEF: smallest-delay-increase edge first Yves Robert Scheduling for Heterogeneous Platforms 52/ 86
Broadcast Back to the example 4 4 3 3 3 3 4 4 2 2 1 1 6 6 3 3 delay = 3 delay = 7 2 2 ECEF tree LA tree ECEF tree turns out to have minimum delay (maximal throughput) Can we always find tree with optimal throughput? � Problem is NP-complete � Still, can design simple heuristics: SDIEF: smallest-delay-increase edge first Yves Robert Scheduling for Heterogeneous Platforms 52/ 86
Broadcast Assessing a broadcast strategy � Finding optimal set of spanning tree s is polynomial: use LP formulation! � Schedule reconstruction and packet management is harder with several trees Suggested approach: ◮ Compute optimal throughput (several trees) with LP formulation ◮ Run preferred heuristic to generate one or several ”good” spanning trees ◮ Stop refining when performance “reasonably” close to upper bound Will outperform MPI binomial spanning tree! Yves Robert Scheduling for Heterogeneous Platforms 53/ 86
Broadcast Assessing a broadcast strategy � Finding optimal set of spanning tree s is polynomial: use LP formulation! � Schedule reconstruction and packet management is harder with several trees Suggested approach: ◮ Compute optimal throughput (several trees) with LP formulation ◮ Run preferred heuristic to generate one or several ”good” spanning trees ◮ Stop refining when performance “reasonably” close to upper bound Will outperform MPI binomial spanning tree! Yves Robert Scheduling for Heterogeneous Platforms 53/ 86
Broadcast Bibliography – Broadcast Complexity: On broadcasting in heterogeneous networks , S. Khuller and Y.A. Kim, 15th ACM SODA (2004), 1011–1020 Heuristics: Efficient collective communication in distributed heterogeneous systems , P.B. Bhat, C.S. Raghavendra and V.K. Prasanna, JPDC 63 (2003), 251–263 Steady-state: Pipelining broadcasts on heterogeneous platforms , O. Beaumont et al., IEEE TPDS 16, 4 (2005), 300-313 Yves Robert Scheduling for Heterogeneous Platforms 54/ 86
Limitations Parameters Outline Background on traditional scheduling 1 Packet routing 2 Master-worker on heterogeneous platforms 3 Broadcast 4 Limitations 5 Parameters Communication model Topology hierarchy Putting all together 6 Conclusion 7 Yves Robert Scheduling for Heterogeneous Platforms 55/ 86
Limitations Parameters Good news and bad news � One-port model: first step towards designing realistic scheduling heuristics � Steady-state circumvents complexity of scheduling problems . . . while deriving efficient (often asympotically optimal) scheduling algorithms � Need to acquire a good knowledge of the platform graph � Need to run extensive experiments or simulations Yves Robert Scheduling for Heterogeneous Platforms 56/ 86
Recommend
More recommend