Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks T 3 T 2 T 2 T 3 T 1 T 2 T 3 T 4 F 1 , 2 F 2 , 3 F 3 , 4 T 3 P 4 T 1 P 2 T 4 P 5 P 7 P 1 P 3 P 6 13/64
Mappings with replications tel-00637362, version 1 - 1 Nov 2011 Round-robin distribution of replicated tasks Period τ P 1 T 1 P 2 T 2 T 2 P 3 P 4 T 3 T 3 P 5 T 3 P 6 T 4 P 7 13/64
A short comparison of the three methods 1/3 tel-00637362, version 1 - 1 Nov 2011 Mono-allocation steady-state schedules ◮ Easy to implement � ◮ Handle stateful nodes � ◮ Smaller buffers � ◮ Less efficient schedules (stronger constraints) � 14/64
A short comparison of the three methods 2/3 tel-00637362, version 1 - 1 Nov 2011 General multi-allocation steady-state solution ◮ Optimal throughput � ◮ Polynomial computation time in almost all cases � ◮ Very long periods � ◮ Huge response time � ◮ Complex allocation schemes � ◮ Never fully implemented � 15/64
A short comparison of the three methods 3/3 tel-00637362, version 1 - 1 Nov 2011 Replication with Round-Round distribution ◮ Natural extension to a mono-allocation solution � ◮ No buffer required to keep initial order of data sets � ◮ Fully implemented solution (DataCutter) � ◮ Simple control, with closed form to determine processors � ◮ Less efficient schedules (stronger constraints) � ◮ May not fully exploit each resource � ◮ Hard to determine throughput �� 16/64
Communication model tel-00637362, version 1 - 1 Nov 2011 ◮ Trade-off between realism and tractability ◮ Many different models ◮ One-Port model: ◮ Strict One-Port : a processor can either compute or perform a single communication ◮ Full-Duplex One-Port : a processor can either compute or simultaneously send and receive data ◮ One-Port with overlap of computation by communications ◮ Bounded Multiport model: several concurrent communications, respecting resource bandwidths ◮ Linear cost model: communication time proportional to data size 17/64
Communication model tel-00637362, version 1 - 1 Nov 2011 Which model to choose? ◮ Computing resources (single processors vs. multi-core processors or dedicated co-processors, . . . ) ◮ Network (homogeneous network vs. a server with large bandwidth connected to many light clients) ◮ Applications and software resources (blocking communications vs. multithreaded libraries) ◮ Previously studied models ◮ Algorithmic complexity! (non-trivial results with simpler models vs. realism with complex models) 18/64
Outline tel-00637362, version 1 - 1 Nov 2011 Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion 19/64
Mono-allocation steady-state scheduling tel-00637362, version 1 - 1 Nov 2011 Main idea All instances of a task are processed by the same resource ◮ Less efficient schedules (stronger constraints) � ◮ A single allocation � ◮ Bounded Multiport model ◮ Limited incoming bandwidth bw in q ◮ Limited outgoing bandwidth bw out q ◮ Limited bandwidth per link bw q , r ◮ Period τ , throughput ρ : ρ = 1 /τ = critical resource cycle-time 20/64
Complexity tel-00637362, version 1 - 1 Nov 2011 Problem DAG-Single-Alloc Given a directed acyclic application graph, a platform graph, and a bound B , is there an allocation with throughput ρ ≥ B ? Theorem DAG-Single-Alloc is NP-complete 21/64
Variables and constraints due to the application tel-00637362, version 1 - 1 Nov 2011 ◮ α k q = 1 if task T k is processed on processor P q , and α k q = 0 otherwise ◮ Each task is processed exactly once: P q α k ∀ T k , � q = 1 ◮ β k , l q , r = 1 if file F k , l is transferred using path P q � P r , and β k , l q , r = 0 otherwise ◮ A file transfer must originate from where the file was produced: β k , l q , r ≤ α k q 22/64
Constraints on computations tel-00637362, version 1 - 1 Nov 2011 ◮ The processor computing a task must hold all necessary input data, i.e., either it received or it computed all required input data: α k � β k , l q , r ≥ α l r + r P q � P r ◮ The computing time of a processor is not larger that τ : � α k q × w q , k ≤ τ T k 23/64
Constraints on communications ◮ The amount of data carried by the link P q → P r is: tel-00637362, version 1 - 1 Nov 2011 β k , l � � d q , r = s , t × data k , l P s � P t with F k , l P q → P r ∈ P s � P t ◮ The link bandwidth must not be exceeded: d q , r ≤ τ bw q , r ◮ The output bandwidth of a processor P q must not be exceeded: d q , r � ≤ τ bw out q P q → P r ∈ E P ◮ The input bandwidth of a processor P q must not be exceeded: d q , r � ≤ τ bw in r P q → P r ∈ E P 24/64
Objective tel-00637362, version 1 - 1 Nov 2011 Minimize the maximum time τ spent by all resources Theorem An optimal solution of the above linear program describes an allocation with maximal throughput Summary ◮ Solutions based on mixed-linear programs ◮ NP-complete problem � ◮ Need for clever heuristics � 25/64
Greedy mapping strategies tel-00637362, version 1 - 1 Nov 2011 ◮ Simple mapping: ◮ assign “largest” task to best processor ◮ continue with second “largest” task, assign it to the processor that decreases the least the throughput ◮ . . . ◮ Refined greedy: ◮ take communication times into account when sorting tasks ◮ when mapping a task, select the processor such that the maximum occupation time of all resources (processors and links) is minimized 26/64
Rounding the linear program tel-00637362, version 1 - 1 Nov 2011 1. Solve the linear program over the rationals 2. Based on the rational solution, select an integer variable α k i : RLP-max: ◮ Select the α k i with maximum value ◮ Set α k i to 1 RLP-rand: ◮ Select a task T k not yet mapped ◮ Randomly choose a processor P i with probability α k i ◮ Set α k i to 1 3. Goto step 1 until all variables are set 27/64
Delegating computations tel-00637362, version 1 - 1 Nov 2011 ◮ Start from solution where all tasks are processed by a single processor ◮ Try to move a (connected) subset of tasks to another processor to increase throughput ◮ Repeat this process until no more improvement is found Several issues to overcome: ◮ Find interesting groups of tasks to move ◮ for all tasks, we test all possible immediate neighborhoods, and then try to increase the group along chains ◮ Hard to find a good evaluation metric: some moves do not directly decrease throughput, but are still interesting ◮ for a given mapping, we sort all resource occupation times by lexicographical order, and use the ordered list instead of the throughput in comparisons 28/64
Neighborhood-centric strategy tel-00637362, version 1 - 1 Nov 2011 ◮ First, evaluate the cost of any task and its immediate neighbors on an idle platform ◮ Cost of a task: maximum occupation time over all resources ◮ Consider each task T k ordered by non-increasing cost: ◮ Evaluate the mapping of T k and its neighbors on each processor ◮ Definitely assign T k to best processor ◮ Same evaluation metric as Delegate 29/64
Performance evaluation – methodology tel-00637362, version 1 - 1 Nov 2011 ◮ Reference heuristics: HEFT, Data-Parallel, Clustering ◮ LP and MLP solved with Cplex 11 ◮ Simulations done using SimGrid ◮ Platforms: actual Grids, from SimGrid repository (only a subset of processors is available for computation) ◮ Applications: random task graphs + one real application ◮ “Small problems”: 8–12 tasks ◮ “Large problems”: up to 47 tasks (MLP not used) ◮ for each application, we compute a CCR = communications computations ◮ we try to cover a large CCR range 30/64
Performance evaluation – results on small problems tel-00637362, version 1 - 1 Nov 2011 2.5 Delegate Upper bound Normalized throughput 2 1.5 1 Optimal 0.5 0.02 0.04 0.2 0.4 2 4 0.01 0.1 1 10 Small communications Large communications log(CCR) 31/64
Performance evaluation – results on small problems tel-00637362, version 1 - 1 Nov 2011 1.2 1 Optimal Normalized throughput Data-parallel 0.8 HEFT Delegate 0.6 0.4 0.2 0.02 0.04 0.2 0.4 2 4 0.01 0.1 1 10 Small communications Large communications log(CCR) 31/64
Performance evaluation – results on small problems tel-00637362, version 1 - 1 Nov 2011 1.2 1 Normalized throughput Optimal Clustering 0.8 Delegate Simple Greedy 0.6 Refined Greedy 0.4 0.2 0 0.02 0.04 0.2 0.4 2 4 0.01 0.1 1 10 Small communications log(CCR) Large communications 31/64
Performance evaluation – results on small problems tel-00637362, version 1 - 1 Nov 2011 Optimal 1 Delegate Normalized throughput 0.8 Neighborhood RLP MAX RLP RAND 0.6 0.4 0.2 0 0.02 0.04 0.2 0.4 2 4 0.01 0.1 1 10 Small communications Large communications log(CCR) 31/64
Summary tel-00637362, version 1 - 1 Nov 2011 ◮ Mono-allocation strategies are close to multi-allocation ones ◮ Outperform HEFT in most cases ◮ Optimal MLP solution restricted to small problems ◮ Efficient heuristics handle larger problems 32/64
Outline tel-00637362, version 1 - 1 Nov 2011 Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion 33/64
A short introduction to the Cell ◮ Joint work of Sony, Toshiba and IBM tel-00637362, version 1 - 1 Nov 2011 ◮ Non-standard architecture 34/64
A theoretical vision of the Cell tel-00637362, version 1 - 1 Nov 2011 SPE 0 SPE 1 SPE 7 SPE 6 P 1 P 2 P 8 P 7 MEMORY PPE 0 EIB P 0 SPE 5 SPE 4 SPE 2 SPE 3 P 6 P 5 P 3 P 4 ◮ One Power core ( PPE ) P 0 standard processor, direct access to memory and L1/L2 cache 35/64
A theoretical vision of the Cell tel-00637362, version 1 - 1 Nov 2011 SPE 0 SPE 1 SPE 7 SPE 6 P 1 P 2 P 8 P 7 MEMORY PPE 0 EIB P 0 SPE 5 SPE 4 SPE 2 SPE 3 P 6 P 5 P 3 P 4 ◮ Eight Synergistic Processing Elements ( SPE s) P 1 , . . . , P 8 256-kB Local Stores, dedicated asynchronous DMA engine 35/64
A theoretical vision of the Cell tel-00637362, version 1 - 1 Nov 2011 SPE 0 SPE 1 SPE 7 SPE 6 P 1 P 2 P 8 P 7 MEMORY PPE 0 EIB P 0 SPE 5 SPE 4 SPE 2 SPE 3 P 6 P 5 P 3 P 4 ◮ Element Interconnect Bus ( EIB ) 200 GB/s → no contention 35/64
A theoretical vision of the Cell tel-00637362, version 1 - 1 Nov 2011 SPE 0 SPE 1 SPE 7 SPE 6 P 1 P 2 P 8 P 7 MEMORY PPE 0 EIB P 0 SPE 5 SPE 4 SPE 2 SPE 3 P 6 P 5 P 3 P 4 ◮ Bidirectional communication link between an element and the EIB bandwidth bw = 25GB/s 35/64
A theoretical vision of the Cell tel-00637362, version 1 - 1 Nov 2011 SPE 0 SPE 1 SPE 7 SPE 6 P 1 P 2 P 8 P 7 MEMORY PPE 0 EIB P 0 SPE 5 SPE 4 SPE 2 SPE 3 P 6 P 5 P 3 P 4 ◮ Limited DMA stack: ◮ at most 16 simultaneous incoming communications for each SPE ◮ at most 8 simultaneous communications between a SPE and the PPE 35/64
Application model tel-00637362, version 1 - 1 Nov 2011 ◮ Previously-used task graph model: G A = ( V A , E A ) ◮ Many data sets ◮ Some enhancements: ◮ read k : data to read before executing T k ◮ write k : data to write after the execution of T k ◮ peek k : number of next data sets to receive before executing T k ◮ Two computations times for T k : w PPE ( T k ) and w SPE ( T k ) 36/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i min period i T i buffer i , k buffer i , l = min period l buffer i , j − min period i peek j peek k T j T k min period j min period k buffer j , l buffer k , l T l peek l min period l = max m ∈ precl ( min period m ) + peek l + 2 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 1 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 2 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 3 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 4 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 5 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 6 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 7 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 8 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 9 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Preprocessing of the schedule ◮ Objective: compute minimal buffer sizes tel-00637362, version 1 - 1 Nov 2011 ◮ min period l = max m ∈ precl ( min period m ) + peek l + 2 ◮ buffer i , l = min period l − min period i peek i = 0 period = 10 min period i = 1 T i buffer i , k = 5 buffer i , l = 9 buffer i , j = 3 peek j = 1 peek k = 3 T j T k min period j = 4 min period k = 6 buffer j , l = 6 buffer k , l = 4 T l peek l = 2 min period l = 10 37/64
Objective tel-00637362, version 1 - 1 Nov 2011 ◮ Maximize throughput ◮ Obtain a periodic schedule ◮ Use a single allocation: code size is critical ◮ Simplification: all communications within a period are simultaneous 38/64
Constraints 1/2 On the application structure: tel-00637362, version 1 - 1 Nov 2011 ◮ Each task is mapped on a processor: α k � ∀ T k i = 1 i ◮ Given a dependence T k → T l , the processor computing T l must receive the corresponding file: β k , l i , j ≥ α l � ∀ ( k , l ) ∈ E , ∀ P j , j i ◮ Given a dependence T k → T l , only the processor computing T k can send the corresponding file: β k , l � i , j ≤ α k ∀ ( k , l ) ∈ E , ∀ P i , i j 39/64
Constraints 2/2 ◮ On a given processor, all tasks must be completed within τ : tel-00637362, version 1 - 1 Nov 2011 α k � ∀ P i , i × t i ( k ) ≤ τ k ◮ All incoming communications must be completed within τ : 1 � � � β k , l α k � � ∀ P j , j × read k + i , j × data k , l ≤ τ bw i k k , l ◮ All outgoing communications must be completed within τ : 1 � � � β k , l α k � � ∀ P i , i × write k + i , j × data k , l ≤ τ bw i k k , l + constraints on the number of incoming/outgoing communications to respect DMA requirements + constraints on the available memory on SPE 40/64
Optimal mapping tel-00637362, version 1 - 1 Nov 2011 ◮ Constraints form a linear program ◮ Binary variables: exponential solving time � ◮ Can we do better? 41/64
Optimal mapping tel-00637362, version 1 - 1 Nov 2011 ◮ Constraints form a linear program ◮ Binary variables: exponential solving time � ◮ Can we do better? ◮ NP-complete problem (reduction from 2-Partition) � ◮ Reasonable running times (small number of cores) � 41/64
Experiments tel-00637362, version 1 - 1 Nov 2011 Hardware: ◮ Sony Playstation 3 ◮ Single Cell processor ◮ Only 6 available SPEs ◮ 256-MB memory Software: ◮ New dedicated scheduling framework ◮ Requires a mono-allocation schedule ◮ Vocoder application (141 tasks) and random graphs ◮ Linear programs solved by Cplex (using a 0 . 05-approximation) ◮ Greedy memory-aware heuristic as reference 42/64
The Vocoder application tel-00637362, version 1 - 1 Nov 2011 Ste pSource work= 21 I/O: 0-> 1 *** STATEFUL *** IntToFloa t work= 6 I/O: 1-> 1 De y la work= 215 I/O: 1-> 1 *** STATEFUL *** DUPLICATE(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r DFTFilte r work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 work= 66 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** PEEKS 28 AHEAD *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** WEIGHTED_ROUND_ROBIN(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) work= null Re cta ngula rToPola r work= 9105 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null DUPLICATE(1,1) WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null work= null FIRSm oothingFilte r Ide ntity Pha Unwra se ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha Unwra se ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha se Unwra ppe r Pha Unwra se ppe r Pha se Unwra ppe r Pha se Unwra ppe r work= 3300 work= 90 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 work= 107 I/O: 15-> 15 I/O: 15-> 15 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** FirstDiffe re nce FirstDiffe nce re FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe re nce FirstDiffe nce re FirstDiffe re nce WEIGHTED_ROUND_ROBIN(1,1) work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= 15 work= null I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** De convolve ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r ConstMultiplie r I/O: 30-> work= 450 30 I/O: 1-> work= 8 1 I/O: 1-> work= 1 8 work= I/O: 1-> 8 1 I/O: 1-> work= 8 1 work= I/O: 1-> 8 1 I/O: 1-> work= 8 1 I/O: 1-> work= 8 1 I/O: 1-> work= 8 1 I/O: 1-> work= 1 8 work= I/O: 1-> 8 1 work= I/O: 1-> 1 8 work= I/O: 1-> 1 8 work= I/O: 1-> 8 1 I/O: 1-> work= 8 1 I/O: 1-> work= 8 1 Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor Accum ula tor WEIGHTED_ROUND_ROBIN(1,1) work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= 14 work= null I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 I/O: 1-> 1 *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** *** STATEFUL *** Line rInte a rpola tor Duplica tor work= 2010 WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= 195 I/O: 15-> 60 work= null I/O: 15-> 20 *** PEEKS 1 AHEAD *** De cim tor a Duplica tor work= 320 work= 195 I/O: 60-> 20 I/O: 15-> 20 WEIGHTED_ROUND_ROBIN(1,1) work= null Multiplie r work= 220 I/O: 40-> 20 Ide ntity work= 120 I/O: 20-> 20 WEIGHTED_ROUND_ROBIN(1,1) work= null Pola rToRe cta ngula r work= 5060 I/O: 40-> 40 WEIGHTED_ROUND_ROBIN(1,1) work= null WEIGHTED_ROUND_ROBIN(1,18,1) work= null Ide ntity Double r Ide ntity Floa tVoid work= 6 work= 252 work= 6 work= 60 I/O: 1-> 1 I/O: 18-> 18 I/O: 1-> 1 I/O: 20-> 0 WEIGHTED_ROUND_ROBIN(1,18,1) work= null Pre _Colla pse dDa ta Pa ra lle l_1 work= 207 I/O: 20-> 20 Adde r work= 146 I/O: 20-> 2 Subtra ctor work= 14 I/O: 2-> 1 ConstMultiplie r work= 8 I/O: 1-> 1 WEIGHTED_ROUND_ROBIN(1,0) work= null InvDe la y work= 9 I/O: 1-> 1 *** PEEKS 13 AHEAD *** Floa tToShort work= 12 I/O: 1-> 1 File Write r work= 0 I/O: 1-> 0 Vocode r 43/64
Throughput variation according to the number of SPEs tel-00637362, version 1 - 1 Nov 2011 Throughput (data sets / sec) for 10,000 data sets Theoretical LP 450 Experimental LP 400 Theoretical Greedy 350 Experimental Greedy 300 250 200 150 100 0 1 2 3 4 5 6 Number of SPEs 44/64
Time to reach steady-state tel-00637362, version 1 - 1 Nov 2011 450 400 Throughput (data sets / sec) 350 300 Experimental 250 Theoretical 200 150 100 50 0 0 2,500 7,500 12,500 17,500 22,500 27,500 32,500 40,000 Number of data sets 45/64
Summary tel-00637362, version 1 - 1 Nov 2011 ◮ Heterogeneity is difficult to handle ◮ Innovative processor, but with strong hardware constraints ◮ Optimal solution to steady-state mono-allocation scheduling problem ◮ New framework dedicated to mono-allocation schedules ◮ Outperforms greedy memory-aware heuristic 46/64
Outline tel-00637362, version 1 - 1 Nov 2011 Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion 47/64
Application and platform tel-00637362, version 1 - 1 Nov 2011 ◮ A linear workflow with many data sets T 1 T 2 T 3 T 4 F 1 F 2 F 3 ◮ Fully connected platform ◮ Heterogeneous processors and communication links ◮ Mapping is given ◮ Objective: determine throughput Communication models ◮ Strict One-Port ◮ Overlap One-Port 48/64
Mapping ◮ A processor processes at most 1 task tel-00637362, version 1 - 1 Nov 2011 ◮ A task is mapped on possibly many processors ◮ Replication count of T i : m i T 1 T 2 T 3 T 4 F 1 F 2 F 3 m 3 = 3 m 1 = 1 m 2 = 2 m 4 = 1 P 4 P 2 P 1 P 5 P 7 P 3 P 6 49/64
Mapping ◮ A processor processes at most 1 task tel-00637362, version 1 - 1 Nov 2011 ◮ A task is mapped on possibly many processors ◮ Replication count of T i : m i ◮ Round-Robin distribution of each task T 1 T 2 T 3 T 4 F 1 F 2 F 3 m 3 = 3 m 1 = 1 m 2 = 2 m 4 = 1 P 4 P 2 P 1 P 5 P 7 P 3 P 6 49/64
Recommend
More recommend