TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS - PowerPoint PPT Presentation

TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS Ole Weidner Adam Barker Malcolm Atkinson School of Informatics School of Computer Science School of Informatics University of Edinburgh University of St Andrews University of Edinburgh ole.weidner@ed.ac.uk adam.barker@st-andrews.ac.uk malcolm.atkinson@ed.ac.uk INTERNATIONAL WORKSHOP ON RUNTIME AND OPERATING SYSTEMS FOR SUPERCOMPUTERS WASHINGTON, D.C., USA, JUNE 27, 2017 1

OUTLINE 1. Application Challenges and Motivation 2. Telemetry as HPC Platform Service 3. Context Graph Model 4. Interaction and Interface 5. Prototype 6. Discussion 2

DEFINITION HPC Telemetry Data Any data that describes the state of an HPC platform and the state of the process-based representation of the applications running on it. 3

1 APPLICATION CHALLENGES & MOTIVATION 4 . 1

A NORMAL DAY AT THE OFFICE Strange runtime distribution of homogeneous tasks 4 . 2

FINDING THE CULPRINT Added logging to the application to understand where time is spent Some tasks spent 10x longer downloading input dataset A faulty edge switch caused external connectivity issues on some nodes Introduced helper tasks that collect process-level metrics Some tasks spent a hughe amount of time in IO Wait A strange problem with Lustre caused slow �lesystem I/O on a small set of nodes 4 . 3

ANOTHER INTRESTING CASE Again, an unexpected runtime distribution of supposedly homogeneous simulation tasks 4 . 4

FINDING THE CULPRINT Used the same instrumentation strategy Outlier tasks run out of memory and stall Speci�c structural properties of the input data would cause the algorithm to take a different trajectory 4 . 5

CONSEQUENCES We encountered unexpected "dynamic behavior", both on the system as well as on the application side Knowing that these are no edge cases, we started making our "debugging" approach a more vital part of the application framework: Collecting process- and OS-level information during all runs Applying simple adaptive strategies to mitigate issues at runtime: Blacklist 'weird' nodes Reducing the task-packing (preempt other tasks on the node) when memory usage exceeds threshold 4 . 6

EXPERIENCE & LESSONS LEARNED Instrumetation requires a lot of effort Collecting and analysing data (at scale) is non-trivial Interpreting and feeding the data to the application is dif�cult Existing tooling is sparse and mostly geard toward post- mortem, parallel code debugging Without knowing and understanding the platform "anatomy" and context, data can be dif�cult to interpret, e.g., what is considered "poor" I/O, what is the spatial layout of processes across nodes? 4 . 7

EXPERIENCE & LESSONS LEARNED CONT. Application-speci�c instrumentation is wide spread technique to mitigate heterogeneity, dynamic behavior, etc. Adressing the issue is expensive, but ignoring it can be expensive, too: 4 . 8

2 TELEMETRY AS HPC PLATFORM SERVICE 5 . 1

STATUS QUO: APPLICATION-DRIVEN Application-level collection and processing of telemtry data can cause a lot of overhead. 5 . 2

PLATFORM SERVICE APPROACH Telemetry service takes over data collection and provides data access and higher-level functions to applications 5 . 3

REQUIREMENTS Captures the time-variant physicla anatomy and properties of applications Captures the time-variant anatomy and properties of the HPC platform Describes the mapping between the two (contex!) Allows for arbitrary levels of detail Provides programmatic access to the data Allows of�oading data analytics, e.g. extracting trends from streams of raw data Has noti�cations capabilities 5 . 4

REQUIREMENTS CONT. Keeps historic data (possibly in condensed form) Is deployable at scale (think exascale!) Consistent across platforms 5 . 5

3 CONTEXT GRAPH MODEL 6 . 1

GRAPH-BASED MODEL Provides the context in which time-series can be embedded We use attributed graphs to describe entities and their relationships Graphs provide a intuitive way to model arbitrary levels of complexity A single context graph ( CG ) captures the connections between the platform anatomy (sub-)graph ( PAG ) and the application anatomy (sub-)graphs ( AAG ) 6 . 3

SPATIAL-TEMPORAL DYNAMICS Anatomy and structure of platform and applications is not static: Application process start and stop Nodes appear and disappear Hardware (e.g., GPUs or FPGAs) is added ... All nodes and edges have timestamps that qualify their existence To get a snapshot of the platform and applications at a speci�c point in time, the graph can be queried for a speci�c time or time range 6 . 4

4 INTERACTION AND INTERFACE 7 . 1

USER- / APPLICATION-FACING API Language-Agnostic HTTP/REST API allows to: Explore / traverse the context graph Register simple "server-side" "derived metrics" functions De�ne and register call-backs (Websockets) GraphQL for complex graph queries { process(id: 1) { siblings { processes { cpu_iowait memory_uses } } } } 7 . 2

5 PROTOTYPE 8 . 1

SYSTEM COMPONENTS 8 . 2

6 DISCUSSION This is how we envision an ideal system from the application developer's / user's perspective 9 . 1

THANK YOU Slides available online: https://oweidner.github.io/ross-2017-talk 10

TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS - PowerPoint PPT Presentation

TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS Ole Weidner Adam Barker Malcolm Atkinson School of Informatics School of Computer Science School of Informatics University of Edinburgh University of St Andrews University

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

TELEMETRY Project overview and update JULIEN DANJOU IRC: JD_ GORDON CHUNG IRC: GORDC What

Tracking and Telemetry with Amateur Radio Jason Winningham, KG4WSV University of Alabama in

Collecting telemetry data using P4 and RDMA Rutger Beltman Silke Knossen Supervisors: Joseph

High Tide Technologies, LLC Advances In Telemetry- Satellite Telemetry COMPANY INTRODUCTION

All Saints: Pi Top Telemetry System ALL SAINTS SOLAR CAR Usage The Telemetry system

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

This time Continuing with Web Security Cookies XSS & CSRF Required reading for this

Blacklisting the Blacklist in Digital Advertising Improving Delivery by Bidding for What You Can

S GX E LIDE : Enabling Enclave Code Secrecy via Self-Modification Erick Bauman 1 , Huibo Wang 1 ,

This PIN Can Be Easily Guessed Analyzing the Security of Smartphone Unlock PINs Philipp Markert,

into Drive-by Cryptocurrency Mining and Its Defense RAJSHAKHAR PAUL Outlines Introduction

SQL Injection Attacks Many web servers have backing databases Much of their information

CSCI 4250/6250 Fall 2015 Computer and Networks Security Network Security Goodrich, Chapter

Detecting Hardware Trojans: A Tale of Two Techniques Sharad Malik sharad@princeton.edu FMCAD

TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS - PowerPoint PPT Presentation

TOWARDS A UNIFIED TELEMETRY SERVICE FRAMEWORK FOR HPC ENVIRONMENTS Ole Weidner Adam Barker Malcolm Atkinson School of Informatics School of Computer Science School of Informatics University of Edinburgh University of St Andrews University

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

TELEMETRY Project overview and update JULIEN DANJOU IRC: JD_ GORDON CHUNG IRC: GORDC What

Tracking and Telemetry with Amateur Radio Jason Winningham, KG4WSV University of Alabama in

Collecting telemetry data using P4 and RDMA Rutger Beltman Silke Knossen Supervisors: Joseph

High Tide Technologies, LLC Advances In Telemetry- Satellite Telemetry COMPANY INTRODUCTION

All Saints: Pi Top Telemetry System ALL SAINTS SOLAR CAR Usage The Telemetry system

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

This time Continuing with Web Security Cookies XSS &amp; CSRF Required reading for this

Blacklisting the Blacklist in Digital Advertising Improving Delivery by Bidding for What You Can

S GX E LIDE : Enabling Enclave Code Secrecy via Self-Modification Erick Bauman 1 , Huibo Wang 1 ,

This PIN Can Be Easily Guessed Analyzing the Security of Smartphone Unlock PINs Philipp Markert,

into Drive-by Cryptocurrency Mining and Its Defense RAJSHAKHAR PAUL Outlines Introduction

SQL Injection Attacks Many web servers have backing databases Much of their information

CSCI 4250/6250 Fall 2015 Computer and Networks Security Network Security Goodrich, Chapter

Detecting Hardware Trojans: A Tale of Two Techniques Sharad Malik sharad@princeton.edu FMCAD

This time Continuing with Web Security Cookies XSS & CSRF Required reading for this