monitoring and analysis at alcf
play

Monitoring and Analysis at ALCF Kevin Harms harms@alcf.anl.gov - PowerPoint PPT Presentation

Monitoring and Analysis at ALCF Kevin Harms harms@alcf.anl.gov ALCF Operations Mark Fahey Eric Pershey Doug Waldron Ben Allen ... ALCF Philosophy Collect all data and store (ETL) into central location (data warehouse) Sometimes


  1. Monitoring and Analysis at ALCF Kevin Harms – harms@alcf.anl.gov ALCF Operations Mark Fahey Eric Pershey Doug Waldron Ben Allen ...

  2. ALCF Philosophy ¤ Collect all data and store (ETL) into central location (data warehouse) ¥ Sometimes data is reduced or summarized ¥ Raw data is retained ¥ Not all data that is collected as an associated analysis or monitor Analysis3 ¥ Approach facilitates ad-hoc analysis by any staff after the fact Analysis2 Discover useful analyses as we go ¡ Analysis1 ¤ Focus primarily on monitoring for internal staff ¥ Augmenting and improving this is a continuous goal ¤ Limited data provided for users ¥ Improving this is a long term goal All The Things 2

  3. Major Analysis ¤ Job Failure Analysis (JFA) ¥ Establish root cause for any job failure (non-zero) exit code ¥ Combination of automatic processing of “business” logic with human-in-the- loop to look at outliers ¥ Generates “failure” records ¥ Find areas improvement at a system level ¤ Operational Data Processing System (ODPS) ¥ Produces various system wide metrics ¥ Availability, MTTI, utilization, etc. ¤ Machine Time Overlay (MTO) ¥ Graph node allocations over time on 2-d grid with annotations of events or other temporal-spatial information ¤ Darshan (I/O monitoring) ¤ XALT (library tracking) 3

  4. Total Knowledge of I/O (TOKIO) Framework Transforms monitoring data from across the data center into answers to answer "why is my I/O slow?" Elastic https://www.nersc.gov/research-and-development/tokio/ https://www.github.com/nersc/pytokio

  5. Major Data Sources ¤ BG/Q Control System ¤ ALPS Logs ¤ HSSdb (Cray hardware supervisory system) ¤ Scheduler logs (job, reservations) ¤ File system logs (GPFS, Lustre) ¤ Accounting (sbank) ¤ Job Logs (standard output,error,info) ¤ Theta control system logs (boot log, etc.) ¤ Job instrumentation (Darshan, AutoPerf, ..) ¤ Future ¥ LDMS? 5

  6. The (complicated) Big Picture… 6

  7. Our standard availability report 7

  8. Machine Time Overlay… ¤ Y axis are the allocable chunks of the machine ¤ X axis is time ¤ Analyze scheduling performance and behavior ¤ Any information such as data, location, time can be displayed this way coolant temperature, power consumption, etc.. ¥ 8

  9. Theta usage

  10. Theta – library usage (XALT) 10

  11. Theta – I/O usage (Darshan) 11

  12. Acknowledgements ALCF Operations Staff! This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. 12

Recommend


More recommend