Collecting Application- Level Job Completion Statistics CUG 2010, - PowerPoint PPT Presentation

Collecting Application- Level Job Completion Statistics CUG 2010, Edinburgh Matthew Ezell HPC Systems Administrator

National Institute for Computational Sciences University of Tennessee • NICS is the latest NSF HPC center • Kraken #3 on Top 500 – 1.030 Petaflop peak; 831.7 Teraflops Linpack – First academic petaflop • Athena #30 on Top 500 – 166 Teraflops peak; 125 Teraflops Linpack 2

Motivation and Goals • Need for statistics on the frequency and nature of job failures • XT Systems produce massive amounts of log data – Some job-level error messages are only put in job standard output or standard error • It should have the ability to explain “cryptic” error messages to users • Should not increase job walltime or modify user experience CUG 2010 3

Design: apwrap Data Flow !"#.-////////////////////////////!"#.- +,'%-/ ()!"&* 9$':%& !"#$%" !"#$%" 0'1,,&' +,'%- !"#&'' !"#&'' 7$6/(8'.," +,4'1,/51"161!& 2$*,%"&/3$#&! CUG 2010 4

Design: Prologues and Epilogues • Allow arbitrary, system-defined programs to run before and after aprun execution • Should be able to send messages to the user and/or prevent the application from being launched • Can be integrated with other tools, such as the Automatic Library Tracking Database (ALTD) at NICS CUG 2010 5

Design: Example Rules rules => [{ name=> 'NODEFAIL', pattern=> '^\[NID \d+\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} Apid \d+ killed. Received node failed or halted event for nid (\d+)', message=> 'A compute node had a hardware failure. Please resubmit your job.' },{ name=> 'SEGFAULT', pattern=> '^_pmii_daemon$SIGCHLD$: PE \d+ exit signal Segmentation fault', message=> 'A node experienced a segmentation fault. This happens when the code attempts to access a memory location that it is not allowed to.' }] CUG 2010 6

Successful Completion Rate Exited Non- Zero 13% Completed Successfully 87% CUG 2010 8

Types of Errors Experienced NID_UNKNOWN 10% NODEFAIL 0% OOM SEGFAULT 2% 2% APRUN_ARGS MPI_ABORT 0% 61% EXCEEDS_ALLOC 1% EXE_NOTFOUND KILLED 7% 16% FLOAT_EXCEPTION 1% CUG 2010 9

MPI_ABORT (61%) • The code purposely calls this function • May occur if – an input file could not be found – the algorithm reaches numeric instability – a call to malloc() returns a NULL pointer – etc… • Usually not a system problem CUG 2010 10

KILLED (16%) • Two Causes – Job runs out of walltime, batch system kills it – User chooses to kill the job/app • Extended walltime may be due to a system problem, but it’s difficult to tell CUG 2010 11

NID_UNKNOWN (10%) • Usually code-specific The last 50 lines from stderr follow: wks.c: Error in opngks_(): Could not open "./ 20100517-gmeta/comref-2010051700_spg40-24h.gmeta" FORTRAN STOP [NID 00078] 2010-05-17 11:57:19 Apid 1409935: initiated application termination CUG 2010 12

Conclusions • Most errors experienced by users are (most likely) due to users errors • System-level errors are more rare, and require administrator involvement to debug CUG 2010 13

Questions? Contact me at ezell@nics.utk.edu CUG 2010 14

Collecting Application- Level Job Completion Statistics CUG 2010, - PowerPoint PPT Presentation

Collecting Application- Level Job Completion Statistics CUG 2010, Edinburgh Matthew Ezell HPC Systems Administrator National Institute for Computational Sciences University of Tennessee NICS is the latest NSF HPC center Kraken #3 on

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

ELD Completion Module Advice for students on completion of Modules A, B & C Why?

Collecting Data: New Information Sources November 2019 Outlines Collecting Data Legal

Collecting Engineering Data Three ways of collecting data on the impacts of factors on a response

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Collecting Social Photography Connect to collect, the social dimension of collecting. Bente

Collecting Remote Sales Taxes Collecting Remote Sales Taxes on the Economy of the Commonwealth

4.6 Unfailing Completion Classical completion: Try to transform a set E of equations into an

Lecture 15: Exact Tensor Completion Joint Work with David Steurer Lecture Outline Part I:

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Pointer arithmetic arrays only arrays only Pointer arithmetic Can add or subtract an

GlideinWMS Marco Mambelli Stakeholders Meeting November 13, 2019 Overview Project updates

Version 5 Statistical Analysis Thomas Hearty L2 Stats p.1/31 Outline National Aeronautics

Modern Modern Template Techniques Template The Simplest Function Template

Programming With Data One-Slide Summary A list is a data structure , a way of storing and

Contemporary Peer Code Review Practices Jeffrey C. Carver Nasir U. Eisty University of Alabama

Are Scientific Experiments in Security Possible? Vicraj Thomas vthomas@bbn.com 18 November 2008

MSc Thesis Department of Computer Science Gerth Stlting Brodal gerth@cs .au.dk September 2018

Sambuz

Useful Links

Newsletter

Mail Us

Collecting Application- Level Job Completion Statistics CUG 2010, - PowerPoint PPT Presentation

Collecting Application- Level Job Completion Statistics CUG 2010, Edinburgh Matthew Ezell HPC Systems Administrator National Institute for Computational Sciences University of Tennessee NICS is the latest NSF HPC center Kraken #3 on

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

ELD Completion Module Advice for students on completion of Modules A, B &amp; C Why?

Collecting Data: New Information Sources November 2019 Outlines Collecting Data Legal

Collecting Engineering Data Three ways of collecting data on the impacts of factors on a response

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Collecting Social Photography Connect to collect, the social dimension of collecting. Bente

Collecting Remote Sales Taxes Collecting Remote Sales Taxes on the Economy of the Commonwealth

4.6 Unfailing Completion Classical completion: Try to transform a set E of equations into an

Lecture 15: Exact Tensor Completion Joint Work with David Steurer Lecture Outline Part I:

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Pointer arithmetic arrays only arrays only Pointer arithmetic Can add or subtract an

GlideinWMS Marco Mambelli Stakeholders Meeting November 13, 2019 Overview Project updates

Version 5 Statistical Analysis Thomas Hearty L2 Stats p.1/31 Outline National Aeronautics

Modern Modern Template Techniques Template The Simplest Function Template

Programming With Data One-Slide Summary A list is a data structure , a way of storing and

Contemporary Peer Code Review Practices Jeffrey C. Carver Nasir U. Eisty University of Alabama

Are Scientific Experiments in Security Possible? Vicraj Thomas vthomas@bbn.com 18 November 2008

MSc Thesis Department of Computer Science Gerth Stlting Brodal gerth@cs .au.dk September 2018

Sambuz

Useful Links

Newsletter

Mail Us

ELD Completion Module Advice for students on completion of Modules A, B & C Why?

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2