Monitoring and Analysis at ALCF Kevin Harms – harms@alcf.anl.gov ALCF Operations Mark Fahey Eric Pershey Doug Waldron Ben Allen ...
ALCF Philosophy ¤ Collect all data and store (ETL) into central location (data warehouse) ¥ Sometimes data is reduced or summarized ¥ Raw data is retained ¥ Not all data that is collected as an associated analysis or monitor Analysis3 ¥ Approach facilitates ad-hoc analysis by any staff after the fact Analysis2 Discover useful analyses as we go ¡ Analysis1 ¤ Focus primarily on monitoring for internal staff ¥ Augmenting and improving this is a continuous goal ¤ Limited data provided for users ¥ Improving this is a long term goal All The Things 2
Major Analysis ¤ Job Failure Analysis (JFA) ¥ Establish root cause for any job failure (non-zero) exit code ¥ Combination of automatic processing of “business” logic with human-in-the- loop to look at outliers ¥ Generates “failure” records ¥ Find areas improvement at a system level ¤ Operational Data Processing System (ODPS) ¥ Produces various system wide metrics ¥ Availability, MTTI, utilization, etc. ¤ Machine Time Overlay (MTO) ¥ Graph node allocations over time on 2-d grid with annotations of events or other temporal-spatial information ¤ Darshan (I/O monitoring) ¤ XALT (library tracking) 3
Total Knowledge of I/O (TOKIO) Framework Transforms monitoring data from across the data center into answers to answer "why is my I/O slow?" Elastic https://www.nersc.gov/research-and-development/tokio/ https://www.github.com/nersc/pytokio
Major Data Sources ¤ BG/Q Control System ¤ ALPS Logs ¤ HSSdb (Cray hardware supervisory system) ¤ Scheduler logs (job, reservations) ¤ File system logs (GPFS, Lustre) ¤ Accounting (sbank) ¤ Job Logs (standard output,error,info) ¤ Theta control system logs (boot log, etc.) ¤ Job instrumentation (Darshan, AutoPerf, ..) ¤ Future ¥ LDMS? 5
The (complicated) Big Picture… 6
Our standard availability report 7
Machine Time Overlay… ¤ Y axis are the allocable chunks of the machine ¤ X axis is time ¤ Analyze scheduling performance and behavior ¤ Any information such as data, location, time can be displayed this way coolant temperature, power consumption, etc.. ¥ 8
Theta usage
Theta – library usage (XALT) 10
Theta – I/O usage (Darshan) 11
Acknowledgements ALCF Operations Staff! This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. 12
Recommend
More recommend