Facilitating HPC job debugging through job scripts archival Andy - PowerPoint PPT Presentation

Facilitating HPC job debugging through job scripts archival Andy Georges 2 February 2020 FOSDEM 2020 - HPC, Big Data & Data Science devroom 1

About • I am an HPC sysadmin at Ghent University • Only doing user support very occasionally • When something is sent my way • But . . . I am responsible for logging things • And for the scheduler 2

Motivation • HPC clusters run a gazillion jobs over their lifetime • These jobs sit in the queue after submission • For a while . . . • Some jobs die unexpectedly • Then the user wants to know why • Probably to avoid it happens again • And because it cannot be their fault, obviously 3

The key problem Figure out what was running in the job under which environment 4

Surely we can ask the user to provide the job script • They no longer have it • They may have changed it (and not under version control) to be used in another job • They may not recall which version was submitted • They may claim to know exactly what was submitted and provide you with the wrong script • In all of the above they would have been acting in good faith 5

The user is not the only actor • The scheduler may have changed the script • Or its settings, like the requested cores, memory, . . . • Through a submit filter • But . . . it does keep a copy • Or does it? 6

Surely the scheduler can provide the required information when we ask it • The script is saved • In the spool directory • Once the job is queued • Until it crashes 7

Should we patch the scheduler? • Yes, but no, but yes, but no, but maybe, but no • If the scheduler is FOSS • Write a patch • To save the exact job script in a secondary location • Forget about it, to avoid deletion upon job completion • Maintain said patch forever • Unless you can get it upstream • But why should it be accepted? • Saving a duplicate copy is not the scheduler’s task • It makes for more work to be done on each job submission • You may need to adjust, test, . . . in the next release 8

Complications • Your site may be running multiple schedulers • Depending on the vendor • You may need to pay just to get a duplicate copy of the job scripts • And other sites might too (hey, it’s free money) • So even if your current scheduler is FOSS and got patched, the next one may be different 9

Takeaway The scheduler may not be the best place to obtain job script backups 10

Enter SArchive • FOSS (duh), written in Rust • Separates the front end (finding job scripts for the scheduler) from the back end (archival of said job scripts) • Started out as a tool for Slurm, but also supports Torque • Should be trivial to add support for schedulers that also drop job scripts in a spool directory, e.g. Univa Grid Engine, LSF, PBS Pro, . . . 11

What it does • Monitor the spool director(y)(ies) • Upon receiving a desired change notification tell the . . . • . . . scheduler-savvy front end code to pick up the data as it knows how to • The resulting job information is pushed onto a FIFO queue for further processing • To allow fast processing of data as jobs can be entering the system suddenly in large quantities • The back-end takes the items out of the FIFO queue and archives the information 12

Supported back-ends • Saving to a file hierarchy with YYYY[MM[DD]] sub-directories • Sending a JSON structure with the job script information to Elasticsearch • Producing a JSON structure with the job script information to Kafka • Note: I only implemented the features that we need/use for ES/Kafka (which is fairly limited) 13

24 hours of job scripts injected into ES through Kafka (6 Ghent University clusters) 14

Resources • https://github.com/itkovian/sarchive • https://crates.io/crates/sarchive (may be behind master, depends on dependencies) • Fork it, add to it and open a PR :) • Or open an issue if you want or need a feature 15

Facilitating HPC job debugging through job scripts archival Andy - PowerPoint PPT Presentation

Facilitating HPC job debugging through job scripts archival Andy Georges 2 February 2020 FOSDEM 2020 - HPC, Big Data & Data Science devroom 1 About I am an HPC sysadmin at Ghent University Only doing user support very

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Scriptless Scripts Andrew Poelstra grindelwald@wpsoftware.net March 4, 2017 Scriptless Scripts

Scriptless Scripts Andrew Poelstra grindelwald@wpsoftware.net May 10, 2017 Scriptless Scripts

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

scripts.mit.edu Quentin Smith scripts@mit.edu Student Information Processing Board October 29,

perf scripts jiri olsa 1 PERF SCRIPTS | JIRI OLSA HI basics perf in python post

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Listing Presentation Scripts Agent Scripts & Dialogues Listing Presentation Script for Real

The Presentation Kit-Book: Instant Scripts for Business The Presentation Kit-Book: Instant Scripts

Collaboration Scripts for CSCL Frank Fischer, Ingo Kollar, Christof Wecker Introduction

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Scheduling Aperiodic and Sporadic J obs Def init ions Comparison t o t radit ional

Extending lmtest A framework for heteroskedasticity-robust specification and misspecification

Testing Part 2 1 Three Important Testing Questions How shall we generate/select test

SCATTERING ALONG A CURVE IN THE PLAIN J. DITTRICH NUCLEAR PHYSICS INSTITUTE CAS, RE Z,

Python Programming: An Introduction to Computer Science Chapter 13 Algorithm Design and

Scheduling Algorithm and Analysis RT Synchronization Protocol (Module 34) Yann-Hang Lee Arizona

Second Programming Assignment l Parallel Implementation of Minimum Spanning Tree l Deadline:

Kernel development: How things go wrong (And why you should participate anyway) Jonathan Corbet

Facilitating HPC job debugging through job scripts archival Andy - PowerPoint PPT Presentation

Facilitating HPC job debugging through job scripts archival Andy Georges 2 February 2020 FOSDEM 2020 - HPC, Big Data & Data Science devroom 1 About I am an HPC sysadmin at Ghent University Only doing user support very

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Scriptless Scripts Andrew Poelstra grindelwald@wpsoftware.net March 4, 2017 Scriptless Scripts

Scriptless Scripts Andrew Poelstra grindelwald@wpsoftware.net May 10, 2017 Scriptless Scripts

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

scripts.mit.edu Quentin Smith scripts@mit.edu Student Information Processing Board October 29,

perf scripts jiri olsa 1 PERF SCRIPTS | JIRI OLSA HI basics perf in python post

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Listing Presentation Scripts Agent Scripts &amp; Dialogues Listing Presentation Script for Real

The Presentation Kit-Book: Instant Scripts for Business The Presentation Kit-Book: Instant Scripts

Collaboration Scripts for CSCL Frank Fischer, Ingo Kollar, Christof Wecker Introduction

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Scheduling Aperiodic and Sporadic J obs Def init ions Comparison t o t radit ional

Extending lmtest A framework for heteroskedasticity-robust specification and misspecification

Testing Part 2 1 Three Important Testing Questions How shall we generate/select test

SCATTERING ALONG A CURVE IN THE PLAIN J. DITTRICH NUCLEAR PHYSICS INSTITUTE CAS, RE Z,

Python Programming: An Introduction to Computer Science Chapter 13 Algorithm Design and

Scheduling Algorithm and Analysis RT Synchronization Protocol (Module 34) Yann-Hang Lee Arizona

Second Programming Assignment l Parallel Implementation of Minimum Spanning Tree l Deadline:

Kernel development: How things go wrong (And why you should participate anyway) Jonathan Corbet

Listing Presentation Scripts Agent Scripts & Dialogues Listing Presentation Script for Real