Birds-of-Feather: ECP Center and Application Monitoring - Working Group Day: Thursday, February 6, 2020 Time: 1:30pm – 3:00pm CT Room: Founders I ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Recent Advances Are Disrupting The Status Quo… s u o e n e g o m o H s e i r o m e M s u o e n e g o m o H s r o s s e c o r P 2 2
Where Data-Driven Decisions Make A Difference… • Improved feedback to Application Developers on how their jobs performed (e.g. others with your application signature have used the XXXX library) • Improved feedback to CS Researchers on how to improve the software environment (e.g., link compiler and memory data) • Improved Job Scheduler to decide what runs on our machines • Improved feedback to Operations (e.g. Chiller Management) • Improved feedback to Planners on the characteristics of our workload (e.g., prefer 5% faster memory over 12% faster interconnect) • Improved feedback to Vendors on how we use systems (e.g., 22% of jobs use GPUs in an XXXXX manner) • Security & better quantification our outputs: better ways to identify applications (avoid inappropriate usage like malware or bit coin mining) and answer questions about science hours, utilization, and so on. • Plus many, many other uses 3 3
But There Are Barriers to Overcome… A tremendous amount of data is currently collected within our • Currently users are not aware of most of computer center, but it is artificially separated by knowledge domains the data & do not have adequate access Sysadmin data – Data available from various Data available from tools and performance counters – environment tools is underutilized Resilience & health of system data – Workload characteristics & resource usage details for – future procurements, System designers don’t have adequate And so on… access to how the systems are used – The artificially separated domains form a high barrier to • understanding & a wealth of information is currently largely untapped What We Have… SEPARATE STOVEPIPE SYSTEMS “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” – Sherlock Holmes (Sir Arthur Conan Doyle) Ops CS Research Planning Users Security 4 4
ECP’s Interest: Better Data Mitigates Identified Risks… Risk* Risk ID WBS Adding Data Component To Mi Mitigat ation on Mitigat Mi ation on Support Solution (* Taken From ECP Risk Register v6.1) (best) (worst) Track memory utilization 10000 2.2 $1.0M $1.5M If Aurora has lower than anticipated aggregate memory capacity, then some projects maybe be unable to run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2. 10001 2.2 $1.0M $1.5M If Frontier has lower than anticipated aggregate memory capacity, then some projects maybe be unable to Track memory utilization of apps run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2. 10007 2.3.1 $2.0M $3.0M If MPICH does not meet the other ECP subproject performance needs, for example, in interactions between Monitor MPI performance CPU and GPU, directly between GPUs, in latency hiding, etc., then this will be a significant impact to the overall project as vendor implementations of MPI often rely heavily on MPICH 10008 2.3.1 $1.6M $2.4M If OpenMP 4.5/5.0 does not meet the other ECP subproject needs, there will be a significant impact to the Track OpenMP utilization in applications overall project as OpenMP is a widely used mechanism for achieving good node-level performance. & plugins to tools API 10009 2.3.3 $1.6M $2.4M If sparse linear solvers fail to perform well at scale and on multi-node architectures or otherwise don't meet Track Libraries utilization and how they ECP application needs, several ECP applications will be at risk of not being able to meet their KPPs. scale 10010 2.3.3 $0.8M $1.2M If dense linear algebra kernels fail to perform well on multi-node architectures, several ECP applications will Track libraries implementation be at risk of not being able to meet their KPPs 10012 2.3.5 If ST products are perceived as, or in fact are, inferior or overly complex, AD performance could Track utilization of ST products suffer and ST products will not be adopted. 10016 2.2 Aurora or Frontier HW or SW has defects (e.g. bugs). Monitor error bugs and aggregate them until a threshold is met 10018 2.2 $0.6M $0.9M Language features used by applications perform poorly or are not fully supported on Frontier Compiler Tracking of language standard and/or Aurora utilization and possibly performance. 10019 2.3.2 If vendor software does not provide required functionality or performance, Applications and/or ST Track Application Utilization products may not perform as required. 10022 2.3.2 If we do not have a Fortran compiler on Aurora that supports OpenMP target offload capabilities, $1.0M $1.5M Track Fortran utilization then we will not be able to compile applications. 10023 2.3.2 $1.0M $1.5M If we do not have a Fortran compiler on Frontier that supports OpenMP target offload capabilities, Track Fortran + OpenMP utilization then we will not be able to compile applications. 10025 2.3 If vendors produce new high-performance programming models for next generation architectures Track SYCL, PM utilization e.g. HIP or SYCL, instead of ST-supported models, ST products or functionality may be underused. 10032 2.3 If ST products do not function, meet performance targets, or support key system capabilities at full Track how ST products are used system size, then dependent AD and ST codes will not meet goals. Because there are no effective proxy systems, these issues are revealed late in the ECP project. 10046 2.4 If the Facilities do not provide reliable, timely access to the systems for integration of ECP ST, ECP AD 4 Track utilization for information sharing 5 5 $10.6M $10.6M $15.9M $15.9M products, and/or resources in support of ECP efforts, then this will delay demonstration of KPP's. with Facilities
The ECP Center and Application Monitoring - Working Group Goals for the BoF Notes: https://confluence.exascaleproject.org/display/HISD/Annual+Meeting+Notes+ - +BoF Capture the current state of center-wide monitoring at ECP institutions Make a determination of need Produce a white paper on the Identified Need BoF Agenda Time Speaker(s) Description Introduction 5 mins Jones/Montoya Some challenges (work toward white paper) ECP Viewpoint 5 mins Heroux/Quinn What ECP Level 2’s and Level 3’s want to see ECP Use Cases 15 mins Panel 1 What does workflow need? What do apps teams need? Current State 25 mins Panel 2 A brief overview of the monitoring activities at several institutions (LANL/OLCF/ALCF/SNL/LLNL/NERSC/Cray) Open Discussion 35 mins Full Audience Audience participation; Also cover Next Steps 6 6
Recommend
More recommend