Effjcient Monitoring and Root Cause Analysis in Complex Systems - PowerPoint PPT Presentation

Effjcient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk

Agenda ● Benefjts of robust monitoring ● Measurements vs. Alarms ● Importance of Alarms Correlation ● Effective Alerting ● Self-healing

Why is Monitoring useful? ● Improve system / application uptime ● Reduce administration burden ● Resource optimization ● Prevent bottlenecks ● Make use of collected data (e.g. billing)

Use Case Customer escalation: “We have cloud outage! Keystone is fmapping up and down continuously and many requests get 503 service unavailable error.”

Healthcheck Simple HTTP endpoint up or down checks on services. http_status [0, 1] http_response_time

Metrics ● Metrics measure and report on quantifjable data from your system ● cpu, memory, network, fjlesystem, disk IO ● Services ○ MySQL, RabbitMQ, Apache, MemcacheD, etc. ● LibVirt, Open vSwitch ● Applications: ○ StatsD, Prometheus ● Custom checks

Dimensions ● Dimensions are a dictionary of key, value pairs used to describe metrics. ● hostname ● service ● component ● url ● device

Transaction-level vs. System-level metrics ● Transaction-level: end user perspective ○ Is Horizon working correctly? ● System-level: administrator perspective ○ Reveals failures of service components

Dependencies Apache MemcacheD Keystone MySQL

Gathered metrics http_status http_response_time apache.net.hits apache.performance.idle_worker_count mysql.performance.open_fjles mysql.net.connections memcache.curr_connections memcache.get_misses_rate process.cpu_perc process.open_fjle_descriptors

Dashboards

Alarms Status of the system or resource meets criteria indicating an action is required.

Alarm defjnitions ● Alarm defjnitions are templates specifying how alarms should be created. ● grouping http_status > 0, match_by: ["service", "component", "hostname", "url"] ● ● fjltering avg(cpu.idle_perc{service=monitoring}) < 20 ●

Use case (alarms) MemcacheD number of connections is high on node A. MemcacheD hit rate is low on node A. Keystone API is down on node A. Keystone API is up on node A. Keystone API is down on node A. Keystone API is up on node A. Keystone API is down on node A. Keystone API is up on node A. Keystone API is up on node A. Keystone API is down on node A.

Alarms correlation ● “80% of the mean time to repair is wasted on trying to locate the issue” Gartner ● Remove noise from the environment ● Alerts should be: ○ meaningful ○ actionable ○ indicate the point of failure

Vitrage ● OpenStack Root Cause Analysis service ● organize alarms ○ defjne relationships between alarms ○ represent as an entity graph ● analyze ○ represent system health ● fjnd root cause ○ graphical visualization

Dependencies Apache MemcacheD Keystone MySQL

Dependencies MemcacheD Keystone instances Keystone cluster

Monitor Analyze Plan Execute (MAPE) Analyze Plan Monitor Execute Sensors Effectors Managed Resource

Vitrage Templates ● Vitrage Templates are used to express Condition Action scenarios. → ● if <condition> then raise deduced alarm ● if <condition> then set deduced state ● if <condition> then add causal relationship (used for RCA capability) ● if <condition> then execute Mistral workfmow

Self-healing MemcacheD Keystone instances Keystone cluster

OpenStack Healthcheck APIs ● more detailed checks would be useful for most OpenStack services ● common middleware should get implemented in Oslo ● existing old effort: ○ https://storyboard.openstack.org/#!/story/2001439 ○ https://review.opendev.org/617924

Summary ● Robust monitoring is essential ● Measurements vs. Alarms ● Importance of Alarms Correlation ● Self-healing

Thank You 谢谢 Questions and Answers

Effjcient Monitoring and Root Cause Analysis in Complex Systems - PowerPoint PPT Presentation

Effjcient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk Agenda Benefjts of robust monitoring Measurements vs. Alarms Importance of Alarms Correlation Effective Alerting Self-healing Why is

Root Cause Analysis 1 Root Cause Analysis Root Cause Analysis is a method that is used to

PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO PRESS ROOT TO CONTINUE: PRESS ROOT TO

Root C t Cause An Analysis Presented by: Isaac Garcia, RCC Objec ectives es Define Root

Root River Fisheries Root River Fisheries Craig Helker Craig Helker WDNR WDNR Root River

Root Cause Analysis Information Session SAICA Offices, JHB 27 June 2017 2 Root Cause Analysis

Certicate Transparency Root Explorer Nikita Korzhitskii Niklas Carlsson Web Public Key

Adapting Service Delivery in Response to Crisis and Uncertainty ROOT CAUSE WEBINAR SERIES FOR

F root anycast: What, why and how Joo Damas ISC Overview What is a root server? What is

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

Thoughts on F-Root Futures Jeff Osborn President, Internet Systems Consortium Whats the

Continuous Improvement Through Networked Improvement Communities Root Cause Analysis and Theory

Risk Control Projects Workforce Capability and Human Error Event Analysis Root Cause

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

A Scalable, Portable, and Memory-Effjcient Lock-Free FIFO Queue Ruslan Nikolaev Systems

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

BARE ROOT AND BARE ROOT AND CONTAINERIZED FOREST CONTAINERIZED FOREST PLANTS PLANTS PLANTS

IPAWS Alert Origination Service Provider Webinar Series Mark Lucero, Chief Engineer IPAWS

Alerting with Time Series Fabian Reinartz, CoreOS github.com/fabxc @fabxc Time Series Stream

Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes

Privately Constraining and Programming PRFs, the LWE Way Chris Peikert Sina Shiehian PKC 2018

Isolario: the real-time Internet routing observatory Alessandro Improta Luca Sani

Publish/Subscribe Hans-Arno Jacobsen Bell University Laboratory Chair in Software Engineering

IMAGE VPP Market Place State Bi-Annual Web Call August 19, 2020 Connecting for Solutions

EVV Showcase April 30, 2019 The content contained herein (Confidential Information) is the