UMAMI: A Recipe for Generating Meaningful Metrics through Holistic - PowerPoint PPT Presentation

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis Glenn K. Lockwood, Shane Snyder, Wucherl Yoo, Kevin Harms, Zachary Nault, Suren Byna, Philip Carns, Nicholas J. Wright October 27, 2017 - 1 -

Understanding I/O today is hard Compute Nodes IO Nodes, Storage Servers • Storage hierarchy is BB Nodes ge1ng more complicated - 2 -

Understanding I/O today is hard Compute Nodes IO Nodes, Storage Servers • Storage hierarchy is BB Nodes ge1ng more complicated • Currently monitor each component separately is standard prac:ce Custom Custom ES Binary .txt Binary Format Format HDF5 - 3 -

Understanding I/O today is hard Compute Nodes IO Nodes, Storage Servers • Storage hierarchy is BB Nodes ge1ng more complicated • Currently monitor each component separately is standard prac:ce Custom Custom Expert Knowledge ES Binary .txt Binary Format Format HDF5 - 4 - I/O expert (Phil Carns) from ATPESC: hHps://insidehpc.com/2017/10/hpc-io-computa:onal-scien:sts/

Total Knowledge of I/O with holistic analysis Compute Nodes IO Nodes, Storage Servers • Can we augment expert BB Nodes knowledge? • Using exis:ng tools? - 5 -

Total Knowledge of I/O with holistic analysis Compute Nodes IO Nodes, Storage Servers • Can we augment expert BB Nodes knowledge? • Using exis:ng tools? • Combine, index, and normalize their metrics • Provide a holis:c view Custom Custom ES Binary .txt Binary Format HDF5 Format Total Knowledge of I/O (TOKIO) - 6 - I/O expert (Phil Carns) from ATPESC: hHps://insidehpc.com/2017/10/hpc-io-computa:onal-scien:sts/

What is possible with holistic I/O analysis? • Run four different I/O workloads every day for a month – Jobs scaled to achieve > 80% of peak fs performance – Exercise file-per-proc, shared file, big and small xfers Run on ALCF Mira (IBM BG/Q) and NERSC • Edison (Cray XC) – One GPFS file system on Mira ( gpfs-mira ) – Two Lustre file systems on Edison ( lustre-reg and lustre-bigio ) Use data from producYon monitoring tools at • ALCF and NERSC – Darshan for applica:on-level I/O profiling – GPFS and Lustre-specific server-side monitoring tools - 7 -

Defining performance variation • " Frac%on of Peak Performance " is rela:ve to max performance for that app on that file system • Normalizes out the effects of applica:on I/O paHerns and gpfs (Mira) peak file system performance - 8 -

Variation due to application I/O pattern • "Bad I/O paHerns" can cause – bad performance – bad performance varia%on • Some applica:on paHerns are more suscep:ble to high gpfs (Mira) amounts of varia:on! - 9 -

Variation across file system architectures gpfs (Mira) lustre-bigio (Edison) ApplicaYon I/O paZerns are not the only contributor to performance variaYon - 10 -

Variation between Lustre configurations lustre-bigio (Edison) lustre-reg (Edison) Significant differences even on similar Lustre file systems— other factors (configuraYon, workload) also maZer! - 11 -

What does this tell us about variation? gpfs (Mira) lustre-bigio (Edison) lustre-reg (Edison) Performance variaYon a funcYon of • applica:on I/O paHerns (cf. HACC, VPIC) • architecture (cf. gpfs , lustre-bigio ) • other factors (cf. lustre-bigio , lustre-reg ) - 12 -

What does this tell us about variation? File systems have their own "I/O climate" gpfs (Mira) lustre-bigio (Edison) lustre-reg (Edison) (like Berkeley vs. Argonne) Understanding these "other factors" (climate) holisYcally is essenYal to understanding performance variability! - 13 -

Exploring I/O weather and climate Let's look at a few cases of bad performance using a Unified Monitoring and Metrics Interface (UMAMI) lustre-reg (Edison) What can a holis:c view (climate) tell us about performance (weather)? - 14 -

Case Study #1: " HACC write performance on lustre-reg • Is this a snowy day at Argonne or a snowy day at Berkeley? • Quan:ta:vely define "bad" based on quar:les • Use UMAMI to determine which aspects of weather were "bad" - 15 -

Case Study #1: " First guess: blame someone else Coverage Factor = how much global bandwidth was consumed by my job? - 16 -

Case Study #1: " Add Coverage Factor to UMAMI Most jobs get exclusive access to Lustre bandwidth ( CF bw ≈ 1.0) - 17 -

Case Study #1: " Add Coverage Factor to UMAMI Bad performance coincided with low CF Performance varia:on caused by bandwidth conten:on - 18 -

Case Study #2: " VPIC/GPFS: when bandwidth contention isn't the issue Bad performance did not coincide with low CF Either use expert knowledge or sta:s:cal analysis to add more metrics - 19 -

Case Study #2: " VPIC/GPFS: when bandwidth contention isn't the issue Sta:s:cally "bad" levels of conten:on for metadata IOPS Performance loss affected by file system implementa:on - 20 -

Case Study #3: " HACC/lustre-bigio: effects of "I/O climate change" Abnormally good performance revealed a long-term bad I/O climate Bandwidth conten:on was not the culprit - 21 -

Case Study #3: " HACC/lustre-bigio: effects of "I/O climate change" • Moderate nega:ve correla:on with OSS CPU load • Strong nega:ve correla:on with file system fullness • Result of Lustre block alloca:on at >90% fullness - 22 -

Conclusions • Performance variability is a funcYon of file system climate: – file system architecture – overall system workload – file system configura:on (default striping, etc) and health • No single metric predicts variaYon universally; many factors can affect I/O weather: – bandwidth conten:on – metadata op conten:on (GPFS) – file system fullness (Lustre) • A holisYc view of the storage subsystem is essenYal to understand performance on complex I/O architectures - 23 -

Closer to Total Knowledge Compute Nodes IO Nodes, Storage Servers • Incorporate machine learning BB Nodes – Cluster similar I/O mo:fs to define I/O climates – Infer cri:cal metrics to remove expert from the loop • Join the TOKIO effort! Custom Custom Binary .txt ES Binary HDF – Open source & development – Format 5 Format Total Knowledge of I/O (TOKIO) contribu:ons welcome! – hHps://github.com/nersc/pytokio/ – Support for new component-level tools being added regularly - 24 -

This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contracts DE-AC02-05CH11231 and DE-AC02-06CH11357 (Project: A Framework for Holistic I/O Workload Characterization, Program manager: Dr. Lucy Nowell). This research used resources and data generated from resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231 and the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE- AC02-06CH11357. - 25 -

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic - PowerPoint PPT Presentation

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis Glenn K. Lockwood, Shane Snyder, Wucherl Yoo, Kevin Harms, Zachary Nault, Suren Byna, Philip Carns, Nicholas J. Wright October 27, 2017 - 1 -

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

The design recipe Readings: HtDP , section 2.5 Thrival and Style Guides Topics: Programs as

The design recipe Readings: HtDP , section 2.5 Thrival and Style Guides Topics: Programs as

Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe for Success

Bruin Patch & Big Wave C ate ring Recipe for Success: Community Partnerships Recipe for

Slide 1 Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe

Slide 1 Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe

Cognitive IoT Recipe Maven Cognitive IoT Recipe Maven Digital Expertise in the Kitchen Digital

Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe for Success

Vancouvers Recipe for Energy Vancouvers Recipe for Energy Which percentage indicates

A Machine Learning Approach to Recipe Flow Construction Shinsuke Mori, Tetsuro Sasada, Yoko

I s todays I s todays I s today s I s today s design m ethodology design m ethodology a

Umami Demo The Drupal Out of the Box Initiative Presented by: Lauri Eskola, Cristina Chumillas,

HIE The Bridge to Interoperability Facilitating Meaningful Use and MACRA Meaningful Use Today

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Predicate Logic: Natural Deduction Alice Gao Lecture 15 Based on work by J. Buss, L. Kari, A.

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 ,

An Introduction to Holistic 3D Reconstruction Yi Ma EECS Department, UC Berkeley 1 What is 3D

Welcome! LGSEC.org Explore a New Funding/Partner-Finding Platform from the California Energy

Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles Graham

On the Impact of Isolation Costs on Locality-aware Cloud Scheduling Ankit Bhardwaj, Meghana G

STANDARDIZING QUALITY ASSESSMENT FOR THE MULTILINGUAL WEB Leonid

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic - PowerPoint PPT Presentation

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis Glenn K. Lockwood, Shane Snyder, Wucherl Yoo, Kevin Harms, Zachary Nault, Suren Byna, Philip Carns, Nicholas J. Wright October 27, 2017 - 1 -

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

The design recipe Readings: HtDP , section 2.5 Thrival and Style Guides Topics: Programs as

The design recipe Readings: HtDP , section 2.5 Thrival and Style Guides Topics: Programs as

Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe for Success

Bruin Patch &amp; Big Wave C ate ring Recipe for Success: Community Partnerships Recipe for

Slide 1 Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe

Slide 1 Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe

Cognitive IoT Recipe Maven Cognitive IoT Recipe Maven Digital Expertise in the Kitchen Digital

Welcome to the CACFP Annual Training called Recipe for Success during CACFP Recipe for Success

Vancouvers Recipe for Energy Vancouvers Recipe for Energy Which percentage indicates

A Machine Learning Approach to Recipe Flow Construction Shinsuke Mori, Tetsuro Sasada, Yoko

I s todays I s todays I s today s I s today s design m ethodology design m ethodology a

Umami Demo The Drupal Out of the Box Initiative Presented by: Lauri Eskola, Cristina Chumillas,

HIE The Bridge to Interoperability Facilitating Meaningful Use and MACRA Meaningful Use Today

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Predicate Logic: Natural Deduction Alice Gao Lecture 15 Based on work by J. Buss, L. Kari, A.

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 ,

An Introduction to Holistic 3D Reconstruction Yi Ma EECS Department, UC Berkeley 1 What is 3D

Welcome! LGSEC.org Explore a New Funding/Partner-Finding Platform from the California Energy

Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles Graham

On the Impact of Isolation Costs on Locality-aware Cloud Scheduling Ankit Bhardwaj, Meghana G

STANDARDIZING QUALITY ASSESSMENT FOR THE MULTILINGUAL WEB Leonid

Bruin Patch & Big Wave C ate ring Recipe for Success: Community Partnerships Recipe for