Migrating from Nagios to Prometheus Runtastic Infrastructure Base - PowerPoint PPT Presentation

NOV 07, 2019 Migrating from Nagios to Prometheus

Runtastic Infrastructure Base Virtualization Core DBs Technologies Linux (Ubuntu) Linux KVM Physical Really a lot open source OpenNebula Hybrid SDN (Cisco) 3600 CPU Cores Big Chef 20 TB Memory Terraform 100 TB Storage 2

Our Monitoring back in 2017... Nagios ● Many Checks for all Servers ○ Checks for NewRelic ○ Pingdom ● External HTTP Checks ○ Specific Nagios Alerts ○ Alerting via SMS ○ NewRelic ● Error Rate ○ Response Time ○ 3

Configuration hell…. 4

Alert overflow... 5

Goals for our new Monitoring system Make On Call as comfortable as possible ● Automate as much as possible ● Make use of graphs ● Rework our alerting ● Make it scaleable! ● 6

Starting with Prometheus...

Prometheus 8

Our Prometheus Setup 2x Bare Metal ● 8 Core CPU ● Ubuntu Linux ● 7.5 TB of Storage ● 7 month of Retention time ● Internal TSDB ● 9

Automation

Our Goals for Automation Roll out Exporters on new servers automatically ● using Chef ○ Use Service Discovery in Prometheus ● using Consul ○ Add HTTP Healthcheck for a new Microservice ● using Terraform ○ Add Silences with 30d duration ● using Terraform ○ 11

Consul Consul for our Terraform State ● Agent Rollout via Chef ● One Service definition per Exporter on each Server ● 12

Consul 13

What Labels do we need? What’s the Load of all workers of our Newsfeed service? ● ○ node_load1{service=”newsfeed”, role=”workers”} What’s the Load of a specific Leaderboard server? ● ○ node_load1{hostname=”prd-leaderboard-server-001”} 14

...and how we implemented them in Consul { "service": { "name": "prd-sharing-server-001-mongodbexporter", "tags": [ "prometheus", "role:trinidad", "service:sharing", "exporter:mongodb" ], "port": 9216 } } 15

Scrape Configuration - job_name: prd consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prometheus,.* action: keep - source_labels: [__meta_consul_node] target_label: hostname - source_labels: [__meta_consul_tags] regex: .*,service:([^,]+),.* replacement: '${1}' target_label: service 16

External Health Checks 3x Blackbox Exporters ● Accessing SSL Endpoints ● Checks for ● HTTP Response Code ○ SSL Certificate ○ Duration ○ 17

Add Healthcheck via Terraform resource "consul_service" "health_check" { name = "${var.srv_name}-healthcheck" node = "blackbox_aws" tags = [ "healthcheck", "url:https://status.runtastic.com/${var.srv_name}", "service:${var.srv_name}", ] } 18

Job Config for Blackbox Exporters - job_name: blackbox_aws metrics_path: /probe params: module: [http_health_monitor] consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,healthcheck,.* action: keep - source_labels: [__meta_consul_tags] regex: .*,url:([^,]+),.* replacement: '${1}' target_label: __param_target 19

Add Silence via Terraform resource "null_resource" "prometheus_silence" { provisioner "local-exec" { command = <<EOF ${var.amtool_path} silence add 'service=~SERVICENAME' \ --duration='30d' \ --comment='Silence for the newly deployed service' \ --alertmanager.url='http://prd-alertmanager:9093' EOF } 20

OpsGenie

Our Initial Alerting Plan Alerts with Low Priority ● Slack Integration ○ Alerts with High Priority (OnCall) ● Slack Integration ○ OpsGenie ○ 22

...why not forward all Alerts to OpsGenie? 23

Define OpsGenie Alert Routing Prometheus OnCall Integration ● High Priority Alerts (e.g. Service DOWN) ○ Call the poor On Call Person ○ Post Alerts to Slack #topic-alerts ○ Prometheus Ops Integration ● Low Priority Alerts (e.g. Chef-Client failed runs) ○ Disable Notifications ○ Post Alerts to Slack #prometheus-alerts ○ 24

Setup Alertmanager Config - receiver: 'opsgenie_oncall' group_wait: 10s group_by: ['...'] match: oncall: 'true' - receiver: 'opsgenie' group_by: ['...'] group_wait: 10s 25

...and its receivers - name: "opsgenie_oncall" opsgenie_configs: - api_url: "https://api.eu.opsgenie.com/" api_key: "ourapitoken" priority: "{{ range .Alerts }}{{ .Labels.priority }}{{ end }}" message: "{{ range .Alerts }}{{ .Annotations.title }}{{ end }}" description: "{{ range .Alerts }}\n{{ .Annotations.summary }}\n\n{{ if ne .Annotations.dashboard \"\" -}}\nDashboard:\n{{ .Annotations.dashboard }}\n{{- end }}{{ end }}" tags: "{{ range .Alerts }}{{ .Annotations.instance }}{{ end }}" 26

Why we use group_by[‘...’] Alert Deduplication from OpsGenie ● Alerts are being grouped ● Overlook Alerts ● 27

Example Alerting Rule for On Call - alert: HTTPProbeFailedMajor expr: max by(instance,service)(probe_success) < 1 for: 1m labels: oncall: "true" priority: "P1" annotations: title: "{{ $labels.service }} DOWN" summary: "HTTP Probe for {{ $labels.service }} FAILED.\nHealth Check URL: {{ $labels.instance }}" 28

Example Alerting Rule with Low Priority - alert: MongoDB-ScannedObjects expr: max by(hostname, service)(rate(mongodb_mongod_metrics_query_executor_total[30m])) > 500000 for: 1m labels: priority: "P3" annotations: title: "MongoDB - Scanned Objects detected on {{ $labels.service }}" summary: "High value of scanned objects on {{ $labels.hostname }} for service {{ $labels.service }}" dashboard: "https://prd-prometheus.runtastic.com/d/oCziI1Wmk/mongodb" 29

Alert Management via Slack 30

Setting up the Heartbeat groups: - name: opsgenie.rules rules: - alert: OpsGenieHeartBeat expr: vector(1) for: 5m labels: heartbeat: "true" annotations: summary: "Heartbeat for OpsGenie" 31

...and its Alertmanager Configuration - receiver: 'opsgenie_heartbeat' repeat_interval: 5m group_wait: 10s match: heartbeat: 'true' - name: "opsgenie_heartbeat" webhook_configs: - url: 'https://api.eu.opsgenie.com/v2/heartbeats/prd_prometheus/pi ng' send_resolved: false http_config: basic_auth: password: "opsgenieAPIkey" 32

CI/CD Pipeline

Goals for our Pipeline Put all Alerting and Recording Rules into a Git Repository ● Automatically test for syntax errors ● Deploy master branch on all Prometheus servers ● Merge to master —> Deploy on Prometheus ● 34

How it works Jenkins ● running promtool against each .yml file ○ Bitbucket sending HTTP calls when master branch changes ● Ruby based HTTP Handler on Prometheus Servers ● Accepting HTTP calls from Bitbucket ○ Git pull ○ Prometheus reload ○ 35

Verify Builds for each Branch 36

THANK YOU Niko Dominkowitsch Infrastructure Engineer niko.dominkowitsch@runtastic.com runtastic.com

Migrating from Nagios to Prometheus Runtastic Infrastructure Base - PowerPoint PPT Presentation

NOV 07, 2019 Migrating from Nagios to Prometheus Runtastic Infrastructure Base Virtualization Core DBs Technologies Linux (Ubuntu) Linux KVM Physical Really a lot open source OpenNebula Hybrid SDN (Cisco) 3600 CPU Cores Big Chef

Prometheus Best Practices and Beastly Pitfalls Julius Volz, August 17, 2017 Prometheus

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

Evolution of the @lasssim Runtastic Backend @lister @lasssim Velocity Europe 2018 Evolution

Nagios-RSV Alarm System Sarah Williams OSG Site Admins Workshop at SLAC Nov 14, 2008

Nagios at Funet Teemu Kiviniemi, CSC/Funet 6th June 2012 6th TF-NOC meeting Dublin, Ireland

PromCon 2017 Welcome and Introduction Julius Volz, 17. August 2017 Prometheus Welcome and Thank

110 Rules for Prometheus Brian Brazil Founder Rule 110 110 Rules for Prometheus Brian Brazil

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

Migrating Legacy.com Migrating a top 50 most visited site in the U.S. onto Drupal - Legacy.com

NGI-DE Central Nagios Monitoring Torsten Antoni, Wilhelm Bhler, Dimitri Nilsen, Pavel Weber

NMS - Exploring NAV/OpenNMS/Monolith/Nagios NORDUnet Jonas Hagstrm Outline Background

Practical monitoring with Prometheus and Grafana Jess Portnoy jess.portnoy@kaltura.com, Kaltura,

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com

Rethinking monitoring with Prometheus Martn Ferrari Based on a previous talk prepared with

3. Agent-Oriented Methodologies Part 2: D) ems Design (MASD The PROMETHEUS The PROMETHEUS

Deploying Prometheus Filippo Giunchedi - Operations Engineer filippo@wikimedia.org Agenda

Day 2 Lab2: Implement Topics and Partitions 1. Patient monitoring Introduction In the design

The Heartbleed Bug and Attack Background: the Heartbeat Protocol TLS/SSL protocols provide a

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and

#ifndef POINT_H_ #define POINT_H_ typedef struct { float x; float y; } Point; Point

More Vulnerabilities (buffer overreads, format string, integer overflow, heap overflows) Chester

EnviroTrack: Middleware system designed for object tracking . An Innovative and Simple

Application Heartbeats Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, Anant

Security 1 Recap: Protection Protection Prevent unintended/unauthorized accesses