TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE - PowerPoint PPT Presentation

▸ ATLAS SCALE - UP TEST ON P IZ D AINT Gianfranco Sciacca AEC - Laboratory for High Energy Physics, University of Bern, Switzerland LHConCray WG - 7 November 2017

  TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE + 1data stager (both doing staging) - maxdelivery=“100” ▸ No ARC caching ▸ 2 LRMS queues: wlcg (production queue), atltest (added for this test)   ▸ Preliminary setup: SLURM reservation with 11 Piz Daint nodes: 72 HT slots, 64GB RAM ▸ originally decided to use 64 out of the 72 slots ▸ 16-core jobs: 44 jobs to fill the system (704 slots) ▸ Scale-up setup: SLURM reservation with 384 Piz Daint nodes: 72 HT slots, 64GB RAM ▸ decided to use all of the 72 slots ▸ 18-core jobs: 1536 jobs to fill the system (27648 slots)   ▸ Job setup: Validation task: https://bigpanda.cern.ch/task/12491843/ ▸ 4M events, 40 k jobs, 40k input files, up to 148MB/file (mostly 115MB) ▸ jobs tuned to ~1h duration (maxEvents=100) ▸ ramCount=900 MBPerCore ▸ Output expected: ~70MB/job ▸ Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

  PREPARATION 3 ATLAS SCALE UP TEST ON P IZ D AINT Started 02 Nov 4 PM ▸ Started submitting jobs, 2 Nov at 4PM Load spike on the data stager, breaks GPFS ▸ ▸ set maxdelivery=“30” we also had:   ▸ # 300 at the end means that it wont cancel/submit more than 300 jobs at the same time   maxjobs="40000 20000 8000 80000 300” ▸ Jobs started running Settled eventually on:   ▸ [grid-manager]   maxjobs="40000 20000 8000 80000 800"   [data-staging]   maxdelivery="30"   ▸ The ARC CE reports 0 running, only for the “atltest” partition. The “wlcg” partition seems to be reported correctly This prevented the aCT from submitting continuously   ▸ [root@arc04 arc]# tail /var/spool/nordugrid/jobstatus/job.helper.errors   /usr/share/arc/scan-SLURM-job: line 226: [: -ne: unary operator expected   /usr/share/arc/scan-SLURM-job: line 228: [: ExitCode: integer expression expected   date: invalid date 'Start'   date: invalid date ‘End'   /usr/share/arc/scan-SLURM-job: line 287: - : syntax error: operand expected (error token is "- ")   /usr/share/arc/scan-SLURM-job: line 226: [: -ne: unary operator expected   /usr/share/arc/scan-SLURM-job: line 228: [: ExitCode: integer expression expected   date: invalid date 'Start'   date: invalid date 'End'   /usr/share/arc/scan-SLURM-job: line 287: - : syntax error: operand expected (error token is "- ") Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

PREPARATION 4 ATLAS SCALE UP TEST ON P IZ D AINT Job pattern with bad infosys Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

PREPARATION 5 ATLAS SCALE UP TEST ON P IZ D AINT Bad output from scan-SLURM-job What is the issue? ▸ At times ` sacct ` does not return anything, but ` scontrol ` does for a specific jobid. ▸ In such cases the script seems to die miserably ARC seems capable of producing the correct value of   ▸ nordugrid-cluster-usedcpus for one queue only It seems to query SLURM for the first queue that is defined?   ▸ We decided to move to a dedicated ARC CE (a 10GbE VM now, no staging) and do all ▸ the staging over the data stager only Jobs started flow from aCT and run in stable condition ▸ Unfortunately, we did NOT realise this one was running   ▸ nordugrid-arc-arex-5.3.0-1.el7.centos.x86_64   Only realised it after the scale-up run had started ▸ We considered upgrading on the fly vs. babysit ▸ Decided it was too risky to upgrade (the admin was not comfortable doing that) ▸ Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

  SCALING UP 6 ATLAS SCALE UP TEST ON P IZ D AINT Started 06 Nov 8 AM Reached 1420 jobs (25560 cores) in ~1h ▸ a-rex died straight away, needed restarting by hand   ▸ [2017-11-06 08:53:13] [Arc.Daemon] [ERROR] [78862/28075008] Watchdog detected application timeout - killing process ▸ [2017-11-06 08:53:13] [Arc.A-REX] [INFO] [78864/28075008] Shutting down job processing ▸ [2017-11-06 08:53:13] [Arc.A-REX] [INFO] [78864/28075008] Shutting down data staging threads   fairly linear otherwise, 27 jobs/min (486 slots) ▸ seemingly dominated by SLURM, not aCT/ARC or GPFS   ▸ gazillion of messages like   ▸ (arched:61671): GLib-WARNING **: GChildWatchSource: Exit status of a child process was requested but ECHILD was received by waitpid(). Most likely the process is ignoring SIGCHLD, or some other thread is invoking waitpid() with a nonpositive first argument; either behavior can break applications that use g_child_watch_add()/g_spawn_sync() either directly or indirectly.   These seen to be harmless. Then why?   Increased the maxqueued on the aCT to have a large enough buffer and avoid draining between restarts ▸ Stable running for 3h from 11 AM ▸ We disabled the watchdog, still some/several manual a-rex restarts needed ▸ Stopped submission at 2 PM ▸ Killed all running from the aCT at 2:45 PM ▸ System clean at 3 PM ▸ Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

TEST SUMMARY 7 ATLAS SCALE UP TEST ON P IZ D AINT Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

TEST SUMMARY 8 ATLAS SCALE UP TEST ON P IZ D AINT https://bigpanda.cern.ch/taskprofileplot/?jeditaskid=12491843 Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

TEST SUMMARY 9 ATLAS SCALE UP TEST ON P IZ D AINT http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=resourceutil&sites%5B%5D=CSCS-LCG2&sitesCat%5B%5D=CH-CHIPP- CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=null&end=null&timerange=last48&granularity=Hourly&generic=0&sortby=16&series=30&activities%5B%5D=all 1M events processed (25% of total): 10162 jobs (out of 11785) ▸ Total input size: 1TB (no ARC caching), output size: 0.7TB (staged to a SE in Spain) ▸ Max running jobs reached 1432 (25774/27648 cores - 93.22% , some nodes were down) ▸ Unfortunately we could not test the latest stable ARC version ▸ My feel is that ARC can easily become a bottleneck (if unstable, etc…) ▸ Gianfranco Sciacca - AEC / LHEP Universität Bern • ATLAS scale up test on Piz Daint

TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE - PowerPoint PPT Presentation

ATLAS SCALE - UP TEST ON P IZ D AINT Gianfranco Sciacca AEC - Laboratory for High Energy Physics, University of Bern, Switzerland LHConCray WG - 7 November 2017 TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE +

TESTING EQUIPMENTS FOR SAFETY TEST LIST OF TEST EQUIPMENT TEST SETUP FOR AIR CONDITIONER 1.

TESTING EQUIPMENTS FOR IRON LIST OF TEST EQUIPMENT Routine / Lab Test: ENDURANCE TEST SETUP

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Scintillators: Setup, performance and lessons learned Ran Hong CENPA, University of Washington

TESTING EQUIPMENTS FOR MIXER LIST OF TEST EQUIPMENT TEST SETUP FOR DOMESTIC MIXER MOTOR 1.

FI RST MGPA V2 TEST RESULTS wit h I PNL SETUP And MGPA V2 TEST PRODUCTI ON R. Della-Negra,

Test of optical links on CN V.3 David M unchow for the Gieen Group Belle II PXD/SVD

200511316 200511316 Test plan Test design specification g p

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

TEST ANXIETY Strategies to Handle Test Anxiety OVERVIEW What is test anxiety? Positive verses

TEST AUTOMATION AT BMAR BMAR TEST TEAM Test Automation Planning 1. Selection Of Test

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

The Good Samaritan Luke 10:25-37 Here is some test text Here is some test text Here is some

The Centrality of Meaning in Language Acquisition and Teaching

What is Neologicism? Second order logic plus Humes Principle (an abstraction

1 Peter 5:8, Be sober [stable in your thinking], be vigilant; because your adversary the devil

Generality & ExistenceIV Modality& Identity Greg Restall arch, st andrews 3 december

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &

Software lifecycle lifecycle (simplified) (simplified) Software Problem

1 Macroevolution 2 3 4 5 6 7 8 9 10 11 12 13 Keeping track of concepts 14

Historical Development Hull provides buoyancy that keeps ship floating (Safety) of Hull

TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE - PowerPoint PPT Presentation

ATLAS SCALE - UP TEST ON P IZ D AINT Gianfranco Sciacca AEC - Laboratory for High Energy Physics, University of Bern, Switzerland LHConCray WG - 7 November 2017 TEST SETUP 2 ATLAS SCALE UP TEST ON P IZ D AINT ARC setup: 1 ARC CE +

TESTING EQUIPMENTS FOR SAFETY TEST LIST OF TEST EQUIPMENT TEST SETUP FOR AIR CONDITIONER 1.

TESTING EQUIPMENTS FOR IRON LIST OF TEST EQUIPMENT Routine / Lab Test: ENDURANCE TEST SETUP

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Scintillators: Setup, performance and lessons learned Ran Hong CENPA, University of Washington

TESTING EQUIPMENTS FOR MIXER LIST OF TEST EQUIPMENT TEST SETUP FOR DOMESTIC MIXER MOTOR 1.

FI RST MGPA V2 TEST RESULTS wit h I PNL SETUP And MGPA V2 TEST PRODUCTI ON R. Della-Negra,

Test of optical links on CN V.3 David M unchow for the Gieen Group Belle II PXD/SVD

200511316 200511316 Test plan Test design specification g p

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

TEST ANXIETY Strategies to Handle Test Anxiety OVERVIEW What is test anxiety? Positive verses

TEST AUTOMATION AT BMAR BMAR TEST TEAM Test Automation Planning 1. Selection Of Test

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

The Good Samaritan Luke 10:25-37 Here is some test text Here is some test text Here is some

The Centrality of Meaning in Language Acquisition and Teaching

What is Neologicism? Second order logic plus Humes Principle (an abstraction

1 Peter 5:8, Be sober [stable in your thinking], be vigilant; because your adversary the devil

Generality &amp; ExistenceIV Modality&amp; Identity Greg Restall arch, st andrews 3 december

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &amp;

Software lifecycle lifecycle (simplified) (simplified) Software Problem

1 Macroevolution 2 3 4 5 6 7 8 9 10 11 12 13 Keeping track of concepts 14

Historical Development Hull provides buoyancy that keeps ship floating (Safety) of Hull

Generality & ExistenceIV Modality& Identity Greg Restall arch, st andrews 3 december

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &