Scheduling on Multi-Cores with GPU Safia Kedad-Sidhoum 1 , Florence - PowerPoint PPT Presentation

Scheduling on Multi-Cores with GPU Safia Kedad-Sidhoum 1 , Florence Monna 1 , Grégory Mounié 2 , Denis Trystram 2 , 3 1 Laboratoire d’Informatique de Paris 6, 4 Place Jussieu,75005 Paris 2 Grenoble Institute of Technology, 51 avenue Kuntzmann,38330 Montbonnot Saint Martin, France 3 Institut Universitaire de France August 26, 2013 1/1

Outline 2/1

Scheduling with GPU Most computers today include a Multi-core CPU a high performance parallel computing accelerators : the GPGPU (General Purpose Graphical Processing Unit). Examples: Laptop/Tablet/Smartphone (Intel Core i7, Nvidia Tegra 4) Game console (PS4, Xbox One) Titan (in the top of the Top500 of the supercomputers) In each machine, there are vectorial coprocessors with very high computing throughput, an interesting asset for High Performance Computing (HPC). 3/1

GPU programming example Vector addition element by element Compute Y = alpha + X , Y and X being two vectors of 1024 float. 1 prog = create_program([<<EOF 2 __kernel void a d d i t i o n ( f l o a t alpha , 3 __global const f l o a t ∗ x , 4 __global f l o a t ∗ y ) { 5 size_t i g = get_global_id ( 0 ) ; 6 y [ i g ] = alpha + x [ i g ] ; 7 } 8 EOF 9 ] ) 10 c r e a t e _ k e r n e l ( " a d d i t i o n " , prog ) 1 i n p u t = OpenCL : : VArray : : new (FLOAT, 1024) 2 output = OpenCL : : VArray : : new (FLOAT, 1024) 3 input_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4 output_gpu = c r e a t e _ b u f f e r (1024 ∗ 4) 4/1

GPU programming example Sequence of commands line 1 copy input buffer from CPU memory to GPU memory line 2-4 compute the kernel with the arguments and vector of size 1024 float split in 64 bloc size line 5 copy output buffer from GPU memory to CPU memory 1 enqueue_write_buffer (1024 ∗ 4 , input , input_gpu ) 2 args= set_args ( [ OpenCL : : Float : : new ( 5 . 0 ) , 3 input_gpu , output_gpu ] ) 4 enqueue_NDrange_kernel ( prog , args , [1024] , [ 6 4 ] ) 5 enqueue_read_buffer (1024 ∗ 4 , output_gpu , output ) 5/1

Contribution The tasks assigned to the GPUs must be carefully chosen. Generic method to do the assignment for High Performance Computing Systems. No previous model: start with a simplified problem, without communication issues, precedence relations... 6/1

Description of the Problem - Complexity ( Pm , Pk ) || C max : n independent sequential tasks T 1 ,..., T n . C CPU max p i m CPU p j k GPGPU C GPU max Objective: minimize the makespan of the schedule. If p j = p j for all tasks ( Pm , P 1 ) || C max ⇔ P || C max , NP-hard = ⇒ Problem of scheduling with GPUs also NP-hard 7/1

List based sheduling Lemma ( P 1 , P 1 ) || C max : list scheduling algorithm has a ratio larger than the maximum speedup ratio of a task. CPU T 1 CPU T 2 GPU T 1 GPU T 2 0 1 0 1 x 8/1

Dual approximation technique Use of the dual approximation technique [Hochbaum & Shmoys, 1988]: for a ratio g , take a guess λ and either delivers a schedule of makespan at most g λ , or answers that there exists no schedule of length at most λ . At each step of the dual approximation, dynamic programming algorithm. Case k = 1: performance ratio of g = 4 � n 2 m 2 � 3 in time O . Case k ≥ 2 ratio g = 4 3 + 1 n 2 m 2 k 3 � � 3 k in time O . 9/1

The Shelves’ Idea For k = 1, assuming a schedule of length lower than λ exists. The idea: to partition the set of tasks on the CPUs into two sets, each consisting in two shelves. λ / 3 λ 2 λ / 3 2 λ / 3 a first set with a shelf of length λ the other of length λ 3 , a second set with two shelves of length 2 λ 3 . 10/1

The Shelves’ Idea Partition ensures that the makespan on the GPUs is lower than 4 λ 3 . The tasks are independent: the scheduling is straightforward when the assignment of the tasks has been determined. The main problem is to assign the tasks in each shelf on the CPUs or on the GPUs in order to obtain a feasible solution. 11/1

Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 12/1

Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 Property (1) For each task T j , p j � λ , and ∑ π ( j ) ∈ C p j � m λ . 12/1

Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ µ 1 λ / 3 λ 2 λ / 3 2 λ / 3 Property (2) T i , T j two successive tasks on a CPU. If p i > 2 λ 3 , then p j � λ 3 . 12/1

Structure of an Optimal Schedule for k = 1 If there exists a schedule of length at most λ λ / 3 λ 2 λ / 3 2 λ / 3 Property (3) Two tasks T i , T j with λ 3 < p l � 2 λ 3 (l = i , j) can be executed successively on the same CPU within a time 4 λ 3 . 12/1

Structure of an Optimal Schedule for k = 1 λ / 3 λ 2 λ / 3 2 λ / 3 λ The remaining tasks (with a processing time lower than 2 q (resp. λ 2 q + 1 )) fit in the remaining space in front of S 1 and between all the others shelves, otherwise the schedule would not satisfy Property 1. 12/1

Partitioning the Tasks into Shelves We solve the assignment problem with a dynamic programming summing up the previous constraints. Here, we take g = 4 3 . � 1 if assigned to a CPU For task T j , binary variable : x j = 0 if assigned to the GPU n W ∗ ∑ C = min p j x j (1) j = 1 s.t. 1 x j + ∑ ∑ x j � m (2) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 n p j ( 1 − x j ) � 4 λ ∑ (3) 3 j = 1 x j ∈ { 0 , 1 } (4) 13/1

Partitioning the Tasks into Shelves Dynamic programming algorithm: solves the previous problem n 2 m 2 � � in O . Reduction of the states on the GPU to a smaller number. λ Number of time intervals of length 3 n for a task T j executed p j � � on the GPU, ν j = . λ / ( 3 n ) N = ν j total number of these intervals on the GPU. ∑ π ( j ) ∈ G Error on processing time of each task ε j = p j − ν j � λ 3 n If all the tasks are assigned to the GPU, error at most n λ 3 n = λ 3 . Constraint (3) becomes N = ∑ π ( j ) ∈ G ν j � 3 n 14/1

Binary Search - Cost Analysis If optimum W ∗ min W c ( n , µ , µ ′ , N ) C = > m λ , no 0 � µ � m , 0 � µ ′ � 2 ( m − µ ) , 0 � N � 2 n solution with makespan � λ exists, algorithm answers “NO” Otherwise, construct feasible solution with makespan � 4 λ 3 , with shelves on the CPUs and µ ∗ , µ ′∗ and N ∗ values. One step of the dual-approximation algorithm, with a fixed guess. Binary search in log ( B max − B min ) . At each step, we have 1 � j � n , 1 � µ � m , 1 � µ ′ � 2 ( m − µ ) , and 0 � N � 3 n so the time complexity is � n 2 m 2 � in O . 15/1

Extension Algorithm can be extended to ( Pm , Pk ) || C max with k ≥ 2. n W ∗ ∑ C = min p j x j (5) j = 1 s.t. 1 x j + ∑ ∑ x j � m (6) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 1 x j + ∑ ∑ x j � k (7) 2 2 λ / 3 � p j > λ / 3 p j > 2 λ / 3 N = ∑ ν j � 3 kn (8) π ( j ) ∈ G x j ∈ { 0 , 1 } (9) 16/1

Extension The approximation algorithm can be extended to the problem with k ≥ 2 GPUs with a performance guarantee of 4 3 + 1 3 k . � n 2 k 3 m 2 � To solve each step of the binary search, O states are considered, since 1 � j � n , 1 � µ � m , 1 � µ ′ � 2 ( m − µ ) , 1 � κ � k , 1 � κ ′ � 2 ( k − κ ) , and 0 � N � 3 kn . n 2 k 3 m 2 � � = ⇒ Time complexity in O for each step of the binary search. 17/1

Scheduling on Multi-Cores with GPU Safia Kedad-Sidhoum 1 , Florence - PowerPoint PPT Presentation

Scheduling on Multi-Cores with GPU Safia Kedad-Sidhoum 1 , Florence Monna 1 , Grgory Mouni 2 , Denis Trystram 2 , 3 1 Laboratoire dInformatique de Paris 6, 4 Place Jussieu,75005 Paris 2 Grenoble Institute of Technology, 51 avenue

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

NVIDIA QUADRO RTX NVIDIA TURING GPU Turing SM RT Cores Turing SM RT Cores Up to 10 Giga

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

In Investor Pre r Presentation 2 nd Qu Quarte ter 2 2016 Safe H Safe Har arbor St Stat

CMU 15-896 Mechanism design 1: Without money Teacher: Ariel Procaccia Approximate MD wo money

Metaphor Mastery: Content Review Judy Rees Masterclass 1:

Mortality ECON 499: Economics of Inequality Winter 2018 What are the consequences of inequality?

Outline Brief History Problem Overview Current Status Recommendation 2 1

Factorization myths D. J. Bernstein Thanks to: University of Illinois at Chicago NSF

Dodgsons Rule Approximations and Absurdity John M c Cabe-Dansted University of Western

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A