Scheduling on Multi-Cores with GPU Safia Kedad-Sidhoum 1 , Florence Monna 1 , Grégory Mounié 2 , Denis Trystram 2 , 3 1 Laboratoire d’Informatique de Paris 6, 4 Place Jussieu,75005 Paris 2 Grenoble Institute of Technology, 51 avenue Kuntzmann,38330 Montbonnot Saint Martin, France 3 Institut Universitaire de France August 26, 2013 1/1
Outline 2/1
Scheduling with GPU Most computers today include a Multi-core CPU a high performance parallel computing accelerators : the GPGPU (General Purpose Graphical Processing Unit). Examples: Laptop/Tablet/Smartphone (Intel Core i7, Nvidia Tegra 4) Game console (PS4, Xbox One) Titan (in the top of the Top500 of the supercomputers) In each machine, there are vectorial coprocessors with very high computing throughput, an interesting asset for High Performance Computing (HPC). 3/1
GPU programming example Vector addition element by element Compute Y = alpha + X , Y and X being two vectors of 1024 float. 1 prog = create_program([<<EOF 2 __kernel void a d d i t i o n ( f l o a t alpha , 3 __global const f l o a t ∗ x , 4 __global f l o a t ∗ y ) { 5 size_t i g = get_global_id ( 0 ) ; 6 y [ i g ] = alpha + x [ i g ] ; 7 } 8 EOF 9 ] ) 10 c r e a t e _ k e r n e l ( " a d d i t i o n " , prog ) 1 i n p u t = OpenCL : : VArray : : new (FLOAT, 1024) 2 output = OpenCL : : VArray : : new (FLOAT, 1024) 3 input_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4 output_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4/1
GPU programming example Vector addition element by element Compute Y = alpha + X , Y and X being two vectors of 1024 float. 1 prog = create_program([<<EOF 2 __kernel void a d d i t i o n ( f l o a t alpha , 3 __global const f l o a t ∗ x , 4 __global f l o a t ∗ y ) { 5 size_t i g = get_global_id ( 0 ) ; 6 y [ i g ] = alpha + x [ i g ] ; 7 } 8 EOF 9 ] ) 10 c r e a t e _ k e r n e l ( " a d d i t i o n " , prog ) 1 i n p u t = OpenCL : : VArray : : new (FLOAT, 1024) 2 output = OpenCL : : VArray : : new (FLOAT, 1024) 3 input_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4 output_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4/1
GPU programming example Sequence of commands line 1 copy input buffer from CPU memory to GPU memory line 2-4 compute the kernel with the arguments and vector of size 1024 float split in 64 bloc size line 5 copy output buffer from GPU memory to CPU memory 1 enqueue_write_buffer (1024 ∗ 4 , input , input_gpu ) 2 args= set_args ( [ OpenCL : : Float : : new ( 5 . 0 ) , 3 input_gpu , output_gpu ] ) 4 enqueue_NDrange_kernel ( prog , args , [1024] , [ 6 4 ] ) 5 enqueue_read_buffer (1024 ∗ 4 , output_gpu , output ) 5/1
Contribution The tasks assigned to the GPUs must be carefully chosen. Generic method to do the assignment for High Performance Computing Systems. No previous model: start with a simplified problem, without communication issues, precedence relations... 6/1
Description of the Problem - Complexity ( Pm , Pk ) || C max : n independent sequential tasks T 1 ,..., T n . C CPU max p i m CPU p j k GPGPU C GPU max Objective: minimize the makespan of the schedule. If p j = p j for all tasks ( Pm , P 1 ) || C max ⇔ P || C max , NP-hard = ⇒ Problem of scheduling with GPUs also NP-hard 7/1
Description of the Problem - Complexity ( Pm , Pk ) || C max : n independent sequential tasks T 1 ,..., T n . C CPU max p i m CPU p j k GPGPU C GPU max Objective: minimize the makespan of the schedule. If p j = p j for all tasks ( Pm , P 1 ) || C max ⇔ P || C max , NP-hard = ⇒ Problem of scheduling with GPUs also NP-hard 7/1
List based sheduling Lemma ( P 1 , P 1 ) || C max : list scheduling algorithm has a ratio larger than the maximum speedup ratio of a task. CPU T 1 CPU T 2 GPU T 1 GPU T 2 0 1 0 1 x 8/1
Dual approximation technique Use of the dual approximation technique [Hochbaum & Shmoys, 1988]: for a ratio g , take a guess λ and either delivers a schedule of makespan at most g λ , or answers that there exists no schedule of length at most λ . At each step of the dual approximation, dynamic programming algorithm. Case k = 1: performance ratio of g = 4 � n 2 m 2 � 3 in time O . Case k ≥ 2 ratio g = 4 3 + 1 n 2 m 2 k 3 � � 3 k in time O . 9/1
Dual approximation technique Use of the dual approximation technique [Hochbaum & Shmoys, 1988]: for a ratio g , take a guess λ and either delivers a schedule of makespan at most g λ , or answers that there exists no schedule of length at most λ . At each step of the dual approximation, dynamic programming algorithm. Case k = 1: performance ratio of g = 4 � n 2 m 2 � 3 in time O . Case k ≥ 2 ratio g = 4 3 + 1 n 2 m 2 k 3 � � 3 k in time O . 9/1
Dual approximation technique Use of the dual approximation technique [Hochbaum & Shmoys, 1988]: for a ratio g , take a guess λ and either delivers a schedule of makespan at most g λ , or answers that there exists no schedule of length at most λ . At each step of the dual approximation, dynamic programming algorithm. Case k = 1: performance ratio of g = 4 � n 2 m 2 � 3 in time O . Case k ≥ 2 ratio g = 4 3 + 1 n 2 m 2 k 3 � � 3 k in time O . 9/1
The Shelves’ Idea For k = 1, assuming a schedule of length lower than λ exists. The idea: to partition the set of tasks on the CPUs into two sets, each consisting in two shelves. λ / 3 λ 2 λ / 3 2 λ / 3 a first set with a shelf of length λ the other of length λ 3 , a second set with two shelves of length 2 λ 3 . 10/1
The Shelves’ Idea Partition ensures that the makespan on the GPUs is lower than 4 λ 3 . The tasks are independent: the scheduling is straightforward when the assignment of the tasks has been determined. The main problem is to assign the tasks in each shelf on the CPUs or on the GPUs in order to obtain a feasible solution. 11/1
Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 12/1
Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 Property (1) For each task T j , p j � λ , and ∑ π ( j ) ∈ C p j � m λ . 12/1
Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ µ 1 λ / 3 λ 2 λ / 3 2 λ / 3 Property (2) T i , T j two successive tasks on a CPU. If p i > 2 λ 3 , then p j � λ 3 . 12/1
Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 Property (3) Two tasks T i , T j with λ 3 < p l � 2 λ 3 (l = i , j) can be executed successively on the same CPU within a time 4 λ 3 . 12/1
Structure of an Optimal Schedule for k = 1 λ / 3 λ 2 λ / 3 2 λ / 3 λ The remaining tasks (with a processing time lower than 2 q (resp. λ 2 q + 1 )) fit in the remaining space in front of S 1 and between all the others shelves, otherwise the schedule would not satisfy Property 1. 12/1
Partitioning the Tasks into Shelves We solve the assignment problem with a dynamic programming summing up the previous constraints. Here, we take g = 4 3 . � 1 if assigned to a CPU For task T j , binary variable : x j = 0 if assigned to the GPU n W ∗ ∑ C = min p j x j (1) j = 1 s.t. 1 x j + ∑ ∑ x j � m (2) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 n p j ( 1 − x j ) � 4 λ ∑ (3) 3 j = 1 x j ∈ { 0 , 1 } (4) 13/1
Partitioning the Tasks into Shelves Dynamic programming algorithm: solves the previous problem n 2 m 2 � � in O . Reduction of the states on the GPU to a smaller number. λ Number of time intervals of length 3 n for a task T j executed p j � � on the GPU, ν j = . λ / ( 3 n ) N = ν j total number of these intervals on the GPU. ∑ π ( j ) ∈ G Error on processing time of each task ε j = p j − ν j � λ 3 n If all the tasks are assigned to the GPU, error at most n λ 3 n = λ 3 . Constraint (3) becomes N = ∑ π ( j ) ∈ G ν j � 3 n 14/1
Binary Search - Cost Analysis If optimum W ∗ min W c ( n , µ , µ ′ , N ) C = > m λ , no 0 � µ � m , 0 � µ ′ � 2 ( m − µ ) , 0 � N � 2 n solution with makespan � λ exists, algorithm answers “NO” Otherwise, construct feasible solution with makespan � 4 λ 3 , with shelves on the CPUs and µ ∗ , µ ′∗ and N ∗ values. One step of the dual-approximation algorithm, with a fixed guess. Binary search in log ( B max − B min ) . At each step, we have 1 � j � n , 1 � µ � m , 1 � µ ′ � 2 ( m − µ ) , and 0 � N � 3 n so the time complexity is � n 2 m 2 � in O . 15/1
Extension Algorithm can be extended to ( Pm , Pk ) || C max with k ≥ 2. n W ∗ ∑ C = min p j x j (5) j = 1 s.t. 1 x j + ∑ ∑ x j � m (6) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 1 x j + ∑ ∑ x j � k (7) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 N = ∑ ν j � 3 kn (8) π ( j ) ∈ G x j ∈ { 0 , 1 } (9) 16/1
Extension The approximation algorithm can be extended to the problem with k ≥ 2 GPUs with a performance guarantee of 4 3 + 1 3 k . � n 2 k 3 m 2 � To solve each step of the binary search, O states are considered, since 1 � j � n , 1 � µ � m , 1 � µ ′ � 2 ( m − µ ) , 1 � κ � k , 1 � κ ′ � 2 ( k − κ ) , and 0 � N � 3 kn . n 2 k 3 m 2 � � = ⇒ Time complexity in O for each step of the binary search. 17/1
Recommend
More recommend