Speculative Scheduling for Stochastic HPC Applications Ana Gainaru 1 - PowerPoint PPT Presentation

Speculative Scheduling for Stochastic HPC Applications Ana Gainaru 1 , Guillaume Pallez 2 , Hongyang Sun 1 , Padma Raghavan 1 1. Vanderbilt University; 2. Inria & Univ Bordeaux ICPP, August 2019

HPC schedulers Reservation-based batch schedulers: ◮ Relies on (reasonably) accurate runtime estimation from the user/application ◮ Two queues: (i) large (main) jobs; (ii) small jobs used for backfilling. ◮ Cost to users: Pay what you use. → need to guarantee that the time asked is sufficient. 1

HPC schedulers Reservation-based batch schedulers: ◮ Relies on (reasonably) accurate runtime estimation from the user/application ◮ Two queues: (i) large (main) jobs; (ii) small jobs used for backfilling. ◮ Cost to users: Pay what you use. → need to guarantee that the time asked is sufficient. ◮ Job killed, need to resubmit; ◮ Job completed early (?). additional cost to user. ◮ May waste system resources (if no ◮ Waste of system resources. backfilling possible). 1

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” 2

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” User : “ I don’t want to pay if I don’t use” 2

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” User : “ I don’t want to pay if I don’t use” Sysadmin : “Sure, then you only pay what you use.” 2

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” User : “ I don’t want to pay if I don’t use” Sysadmin : “Sure, then you only pay what you use.” User has one job J 1 whose execution time is exactly 50h . - What does User do? - Is Sysadmin happy? 2

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” User : “ I don’t want to pay if I don’t use” Sysadmin : “Sure, then you only pay what you use.” User has one job J 1 whose User has one job J 2 whose execution time is execution time is exactly 50h . between 46h and 54h . - What does User do? - What does User do? - Is Sysadmin happy? - Is Sysadmin happy? 2

Motivational examples Sysadmin : “I want to sell all the compute slots on my platform” User : “ I don’t want to pay if I don’t use” Sysadmin : “Sure, then you only pay what you use.” User has one job J 1 whose User has one job J 2 whose User has one job J 3 whose execution time is execution time is execution time is exactly 50h . between 46h and 54h . between 2h and 98h . - What does User do? - What does User do? - What does User do? - Is Sysadmin happy? - Is Sysadmin happy? - Is Sysadmin happy? 2

Anecdotal? Study of application data from Intrepid (2009 ANL system) (data from Parallel Workload Archive). Average job size 880 nodes / 3089 node hours Average small jobs size 48.6 nodes / 31 node hours Over-estimated submissions 82.2 % Under-estimated submissions 17.7% Average over-estimation space 2132 node hours Percentage of small jobs 30.8% = ⇒ Unused backfilling space: 2.8 hours/day factor = estimate - walltime walltime 3

Stochastic applications “Second generation” of HPC applications (BigData, ML) with heterogeneous, dynamic and data-intensive properties. ◮ Execution time is input dependent ◮ Large variations ◮ Unpredictable even for same input size 4

Contributions ◮ Demonstrate the efficiency of using a multi-request type algorithm for HPC schedulers ◮ Idea: Overwrite for all jobs their requested time at submission ◮ Demonstrate the efficiency of Speculative backfilling ◮ Idea: Overwrite the request time temporarily during backfill 5

Model ◮ A system with P identical processors and two queues. 6

Model ◮ A system with P identical processors and two queues. ◮ Long queue: J = { J 1 , J 2 , . . . , J M } of large stochastic jobs ◮ processor allocation p j ◮ each walltime follows a given probability distribution (random variable) 6

Model ◮ A system with P identical processors and two queues. ◮ Long queue: J = { J 1 , J 2 , . . . , J M } of large stochastic jobs ◮ processor allocation p j ◮ each walltime follows a given probability distribution (random variable) ◮ Short queue: A stream B of small jobs ◮ arrival rate λ ◮ average execution time ε much smaller than that of the large jobs. ◮ Continuous approximation: modeled as a stream of work arriving continuously in the queue with a rate Z = λε 6

Model ◮ A system with P identical processors and two queues. ◮ Long queue: J = { J 1 , J 2 , . . . , J M } of large stochastic jobs ◮ processor allocation p j ◮ each walltime follows a given probability distribution (random variable) ◮ Short queue: A stream B of small jobs ◮ arrival rate λ ◮ average execution time ε much smaller than that of the large jobs. ◮ Continuous approximation: modeled as a stream of work arriving continuously in the queue with a rate Z = λε Optimization objective ◮ System Utilization: Useful Work / ( P · Total Time) ◮ System response time: average time between submission and completion. 6

Reservation-based Approach Given a job J of duration t (unknown). The user makes a reservation of time t 1 . Two cases: ◮ t ≤ t 1 The reservation is enough and the job succeeds. ◮ t > t 1 The reservation is not enough. The job fails. The user needs to ask for another reservation t 2 > t 1 . A strategy is a sequence of such reservations. 7

Reservation-based Approach Given a job J of duration t (unknown). The user makes a reservation of time t 1 . Two cases: ◮ t ≤ t 1 The reservation is enough and the job succeeds. ◮ t > t 1 The reservation is not enough. The job fails. The user needs to ask for another reservation t 2 > t 1 . A strategy is a sequence of such reservations. For J 3 (exec 2h to 98h ): • Strategy: t 1 = 5 h , t 2 = 40 h , t 3 = 60 h , t 4 = 98 h . If the job is 33h : 1. We run the 5 h reservation; it fails. 2. Then we run the 40 h ; it succeeds. Is the sysadmin happy? Is the user happy? 7

Reservation-based Approach Given a job J of duration t (unknown). The user makes a reservation of time t 1 . Two cases: ◮ t ≤ t 1 The reservation is enough and the job succeeds. ◮ t > t 1 The reservation is not enough. The job fails. The user needs to ask for another reservation t 2 > t 1 . A strategy is a sequence of such reservations. For J 3 (exec 2h to 98h ): • Strategy: t 1 = 5 h , t 2 = 40 h , t 3 = 60 h , t 4 = 98 h . If the job is 33h : 1. We run the 5 h reservation; it fails. 2. Then we run the 40 h ; it succeeds. Is the sysadmin happy? Is the user happy? Util: 33/45 instead of 33/98 7

Reservation-based Approach Given a job J of duration t (unknown). The user makes a reservation of time t 1 . Two cases: ◮ t ≤ t 1 The reservation is enough and the job succeeds. ◮ t > t 1 The reservation is not enough. The job fails. The user needs to ask for another reservation t 2 > t 1 . A strategy is a sequence of such reservations. For J 3 (exec 2h to 98h ): • Strategy: t 1 = 5 h , t 2 = 40 h , t 3 = 60 h , t 4 = 98 h . If the job is 33h : 1. We run the 5 h reservation; it fails. 2. Then we run the 40 h ; it succeeds. Is the sysadmin happy? Is the user happy? Util: 33/45 instead of 33/98 Cost: 38 instead of 33. 7

Two phase scheduling algorithm Truthfully I do not know how to maximize the expected utilization. Writing the problem is already painful. Instead we’ll go naive with a two phase algorithm based on intuition: ◮ First phase: compute a reservation strategy for each job J i : { t i, 1 , t i, 2 , . . . } . ◮ Second phase: reservation scheduling 8

Phase 1: Reservation strategy Idea: Use the reservation strategy that minimizes the expected makespan ( TOptimal ) as if job J i was alone in the system ∗ ◮ It is optimal for utilization if job J i is the only large job in the system � . ◮ We extended it ( ATOptimal ) to take into account backfilling: we define for J i its backfilling rate: ζ i = Z · p i P = λεp i P ∗ See our paper at IPDPS’19 if you like maths. 9

Phase 1: Reservation strategy Idea: Use the reservation strategy that minimizes the expected makespan ( TOptimal ) as if job J i was alone in the system ∗ ◮ It is optimal for utilization if job J i is the only large job in the system � . ◮ We extended it ( ATOptimal ) to take into account backfilling: we define for J i its backfilling rate: ζ i = Z · p i P = λεp i P Algorithm Sequence of requests (in hours) TOptimal 10.8, 13.4, 15.4, 17.1, 18.7, 20.0 ATOptimal ( ζ = 0 . 1 ) 10.86, 13.91, 18.69, 20.0 ATOptimal ( ζ = 0 . 5 ) 13.04, 20.0 ATOptimal ( ζ = 0 . 9 ) 17.39, 20.0 ATOptimal ( ζ = 1 ) 20.0 Example of strategies depending on the backfilling rate ζ . Distribution is Truncated Normal on 0 to 20 hours, µ = 8 h, σ = 2 h ∗ See our paper at IPDPS’19 if you like maths. 9

Phase 2: Job scheduling We follow a batch scheduler model. We want to execute a batch of jobs from the long queue (typically 100 jobs). 1 For all jobs of the batch, submit to the scheduler their smallest reservation ( ∀ i, t i, 1 ). 2 Let the scheduler compute its schedule the usual way 3 In case of t i, 1 is not enough, J i is resubmitted with t i, 2 4 The scheduler computes a new schedule with all resubmitted t i, 2 and so on. 10

Speculative Scheduling for Stochastic HPC Applications Ana Gainaru 1 - PowerPoint PPT Presentation

Speculative Scheduling for Stochastic HPC Applications Ana Gainaru 1 , Guillaume Pallez 2 , Hongyang Sun 1 , Padma Raghavan 1 1. Vanderbilt University; 2. Inria & Univ Bordeaux ICPP, August 2019 HPC schedulers Reservation-based batch

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Speculative Defragmentation Speculative Defragmentation A Technique to Improve the

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

NTT-MIT Research Collaboration http://www.ai.mit.edu/projects/ntt Musashino, January 13, 2000

Parents Briefing Workshop What we will cover today: What are Mindsets? How do different

Delivering More with Less : The Discretion of the Ombudsman Austerity, public policy and politics

The 3rd World Sustainability Forum: Sustainable Entrepreneurship and

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF GHANA PSCY 335: DEVELOPMENTAL PSYCHOLOGY 1: CONCEPTION TO

Topology-preserving discrete deformable model: Application to multi-segmentation of brain MRI

UCSF/UC Hastings Suzanne Seger, Jamie Dolkas, Esq. MTS, CNM, Director of Womens I

1 Children come to live and understand in different social worlds, by collaborative