IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - PowerPoint PPT Presentation

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir*, Raj eev Thakur # , and Alok Choudhary + * : Pennsylvania S tate University + : Northwestern University # : Argonne National Laboratory 1

Outline  Motivation  Overview  Background: IOPin  Technical Details  Evaluation  Conclusion & Future Work Parallel Data Storage Workshop 12 2

Motivation  Users of HPC systems frequently find that limiting the performance of the applications is the storage system, not the CPU, memory, or network.  I/O behavior is the key factor to determine the overall performance.  Many I/O-intensive scientific applications use parallel I/O software stack to access files in high performance.  Critically important is understanding how the parallel I/O system operates and the issues involved.  Understand I/O behavior!!! Parallel Data Storage Workshop 12 3

Motivation (cont’d)  Manual instrumentation for understanding I/O behavior is extremely difficult and error-prone.  Most parallel scientific applications are expected to run on large-scale systems with 100,000 ↑ processors to achieve better resolution.  Collecting and analyzing the trace data from them is challenging and burdensome. Parallel Data Storage Workshop 12 4

Our Approach  IOPin – Dynamic performance and visualization tool  We leverage a light-weight binary instrumentation using probe mode in Pin. – Language independent instrumentation for scientific applications written in C/C++ and Fortran – Neither source code modification nor recompilation of the application and the I/O stack components  IOPin provides a hierarchical view for parallel I/O: – Associating MPI I/O call issued from the application with its sub-calls in the PVFS layer below  It provides detailed I/O performance metrics for each I/O call: I/O latency at each layer, # of disk accesses, disk throughput  Low overhead: ~ 7% Parallel Data Storage Workshop 12 5

Background: Pin  Pin is a software system that performs runtime binary instrumentation.  Pin supports two modes of instrumentation, JIT mode and probe mode.  JIT mode uses a just-in-time compiler to recompile the program code and insert instrumentation; while probe mode uses code trampolines (jump) for instrumentation.  In JIT mode, the incurred overhead ranges from 38.7% to 78% of the total execution time with 32, 64, 128, and 256 processes.  In probe mode, about 7%. Parallel Data Storage Workshop 12 6

Overview: IOPin  The pin process on the client creates two trace log info. for the MPI library and PVFS client. – rank, mpi_call_id, pvfs_call_id, I/O type (write/read), latency  The pin process on the server produces a trace log info. with server_id, latency, processed bytes, # of disk accesses, and disk throughput.  Each log info is sent to the log manager and the log manager identifies the process that has a max. latency.  Pin process instruments the target process. Parallel Data Storage Workshop 12 7

High-level Technical Details MPI_File_Write_all I/O lib., or App MPI-IO MPI_File_Write_all Generate trace info. for MPI_File_write_all() LIbrary rank, mpi_call_id, pvfs_call_id #define PVFS_sys_write(ref,req,off,buf,mem_req,creds,resp) PVFS_sys_write PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, PVFS_IO_WRITE,PVFS_HINT_NULL) The client Pin sends a log PVFS_hints Pack trace info. into Original call flow to the client log manager. PVFS_hints The client log manager rank, mpi_call_id, pvfs_call_id returns a record that has a max. latency for the I/O. Pin call PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, Pin instruments the Replace PVFS_HINT_NULL PVFS_IO_WRITE, PVFS_hints ) flow corresponding MPI with PVFS_hints process selectively. PVFS Client starting point Client-side Client Log Client PVFS_sys_io(…, hints) Pin Process Manager Client ending point PVFS Server starting point io_start_flow(*smcb, …) flow_callback(*flow_d, …) Server Server ending point Server-side Sever Log Pin Process Manager Disk starting/ending point The server Pin searches hints from *smcb passed trove_write_callback_fn(*user_ptr, …) from the traced process, extracts trace info., generates a log, and sends it to the server log manager. The server log manager identifies/instruments the I/O server that has a max. latency. 8

Computation Methodology: Latency and Throughput  For each I/O operation: – the I/O latency computed at each layer is the maximum of the I/O latencies from the layers below. – I/O throughput computed at any layer is the sum of the I/O throughput from the layers below Parallel Data Storage Workshop 12 9

Evaluation  Hardware: – Breadboard cluster at Argonne National Laboratory – 8 quad-core processors per node: support 32 MPI processes – 16 GB main memory  I/O stack configuration: – Application: S3D I/O – PnetCDF (pnetcdf-1.2.0), mpich2-1.4, pvfs-2.8.2  PVFS configuration: – 1 metadata server – 8 I/O servers – 256 MPI processes Parallel Data Storage Workshop 12 10

Evaluation: S3D-IO  S3D-IO – I/O kernel of S3D application – A parallel turbulent combustion application using a direct numerical simulation solver developed in SNL  A checkpoint is performed at regular intervals. – At each checkpoint, four global arrays ― represen � ng the variables of mass, velocity, pressure, and temperature ― are wri � en to fi les.  We maintain the block size of the partitioned X-Y-Z dimension as 200 * 200 * 200  It generates three checkpoint files, 976.6MB each. Parallel Data Storage Workshop 12 11

Evaluation: Comparison of S3D I/O Execution Time Parallel Data Storage Workshop 12 12

Evaluation: Detailed Execution Time of S3D I/O Parallel Data Storage Workshop 12 13

Evaluation: I/O Throughput of S3D I/O Parallel Data Storage Workshop 12 14

Conclusion & Future Work  Understanding I/O behavior is one of the most important steps for efficient execution of parallel scientific applications.  IOPin provides dynamic instrumentation to understand I/O behavior without affecting the performance: – no source code modification and recompilation – a hierarchical view of the I/O call from the MPI lib. to the PVFS server – metrics: latency of each layer, # of fragmented I/O calls, # of disk accesses, I/O throughput – ~7% overhead  Work is underway: (1) to test IOPin on a very large process counts, (2) to employ it for runtime I/O optimizations. Parallel Data Storage Workshop 12 15

Questions? Parallel Data Storage Workshop 12 16

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - PowerPoint PPT Presentation

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir, Raj eev Thakur # , and Alok Choudhary + : Pennsylvania S tate University + : Northwestern University # :

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

M M adison E adison E mbedded S mbedded S ystems & A ystems & A rchitectures Laboratory

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Uni.lu HPC School 2019 PS4b: Monitoring & Profiling II: Advanced Performance engineering

MALT & NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

<Insert Picture Here> <Insert Picture Here> The Other HPC: Profiling

Leaving no one behind The role of evidence-building and profiling to include displacement in

EFFICIENT SYMBOL-LEVEL TRANSMISSION IN ERROR- PRONE WIRELESS NETWORKS Pouya Ostovari, Jie Wu,

Best Practices for the Consolidated Plan and Action Plan May 2019 Housekeeping Logistics:

Extension Breakdown: Security Analysis of Browsers Extension Resources Control Policies

5/30/2014 Yielding Positions Prone positioning improves VQ To Prone or Not to mismatch

Det Detect ecting ng the he 1% 1%: Gr Grow owing ng the he Sci Science ence of of Vul

Using Multi-System Monitoring Time Series to Predict Performance Events Andreas Schrgenhumer

CS 188: Artificial Intelligence Lecture 7: Utility Theory Pieter Abbeel UC Berkeley Many

Adversarial Examples Hanxiao Liu April 2, 2018 1 / 22 Adversarial Examples Inputs to ML

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - PowerPoint PPT Presentation

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir*, Raj eev Thakur # , and Alok Choudhary + * : Pennsylvania S tate University + : Northwestern University # :

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

M M adison E adison E mbedded S mbedded S ystems &amp; A ystems &amp; A rchitectures Laboratory

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Uni.lu HPC School 2019 PS4b: Monitoring &amp; Profiling II: Advanced Performance engineering

MALT &amp; NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

&lt;Insert Picture Here&gt; &lt;Insert Picture Here&gt; The Other HPC: Profiling

Leaving no one behind The role of evidence-building and profiling to include displacement in

EFFICIENT SYMBOL-LEVEL TRANSMISSION IN ERROR- PRONE WIRELESS NETWORKS Pouya Ostovari, Jie Wu,

Best Practices for the Consolidated Plan and Action Plan May 2019 Housekeeping Logistics:

Extension Breakdown: Security Analysis of Browsers Extension Resources Control Policies

5/30/2014 Yielding Positions Prone positioning improves VQ To Prone or Not to mismatch

Det Detect ecting ng the he 1% 1%: Gr Grow owing ng the he Sci Science ence of of Vul

Using Multi-System Monitoring Time Series to Predict Performance Events Andreas Schrgenhumer

CS 188: Artificial Intelligence Lecture 7: Utility Theory Pieter Abbeel UC Berkeley Many

Adversarial Examples Hanxiao Liu April 2, 2018 1 / 22 Adversarial Examples Inputs to ML

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir, Raj eev Thakur # , and Alok Choudhary + : Pennsylvania S tate University + : Northwestern University # :

M M adison E adison E mbedded S mbedded S ystems & A ystems & A rchitectures Laboratory

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Uni.lu HPC School 2019 PS4b: Monitoring & Profiling II: Advanced Performance engineering

MALT & NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

<Insert Picture Here> <Insert Picture Here> The Other HPC: Profiling