PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - PowerPoint PPT Presentation

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus

Who am I? Software engineer working at ● Red Hat Alertmanager & consul_exporter ● maintainer Prometheus

Alerting craft Prometheus

Prometheus

Guidelines ● Think about which labels to propagate. ● “Complex” alerts can be harmful. ● Spend some time to learn the template language. Prometheus

When will I be notified that something’s broken? Prometheus

expr: foo > 0 for: 2m time foo.set(1) Prometheus

expr: foo > 0 for: 2m scrape interval time foo.set(1) scrape Prometheus Alertmanager Prometheus

expr: foo > 0 for: 2m evaluation interval time foo.set(1) scrape evaluation (pending) Prometheus Alertmanager Prometheus

expr: foo > 0 for: 2m evaluation interval time foo.set(1) scrape evaluation evaluation (pending) (pending) Prometheus Alertmanager Prometheus

expr: foo > 0 for: 2m at least 2m time foo.set(1) scrape evaluation evaluation evaluation (pending) (pending) (firing) Prometheus Alertmanager Prometheus

expr: foo > 0 for: 2m group_wait time foo.set(1) scrape evaluation evaluation evaluation notification (pending) (pending) (firing) Prometheus Alertmanager Prometheus

expr: foo > 0 for: 2m scrape interval + evaluation interval + for + group_wait time foo.set(1) scrape evaluation evaluation evaluation notification (pending) (pending) (firing) Prometheus Alertmanager Prometheus

Things to know ● Use “for” to avoid flapping alerts. ● group_interval for subsequent updates (including resolution). ● repeat_interval for reminders. ○ `--data.retention` flag (#1806). Prometheus

Routing Prometheus

Prometheus

Guidelines ● Keep it simple. ● First level routes to match services/teams. ● Use amtool or routing tree editor to test/validate. Prometheus

To continue or not? Fictional scenario: ● All notifications should go to Slack. ● Alerts with job=app should email the app team. ○ severity=critical should page the app team too. ● Alerts with severity=critical should page the ops team. Prometheus

Silences, inhibitions, oh my! Prometheus

Inhibition rule Prometheus

Gotchas ● Pick the appropriate silence duration (#1639). ● Corner cases with incident management systems. ● Inhibiting alerts can’t inhibit themselves (#666). Prometheus

High availability Prometheus

High availability ● Broadcast silences and notification logs. ● Based on the hashicorp/memberlist library. ● Requires a dedicated TCP/UDP port. ○ UDP for small messages ( ⩽ 700 bytes) ○ TCP otherwise Prometheus

--cluster.peer=”” alertmanager-0 Prometheus

--cluster.peer=alertmanager-0:9094 alertmanager-1 Prometheus

--cluster.peer=alertmanager-0:9094 --cluster.peer=alertmanager-0:909 4 alertmanager-2 Prometheus

Position: 0 Position: 2 Position: 1 Prometheus

High availability flags ● Server --cluster.listen-address --cluster.advertise-address ● Peering --cluster.peer --cluster.peer-timeout (15s) --cluster.settle-timeout (1m) Prometheus

High availability flags (continued) ● Data exchange --cluster.gossip-interval (250ms) --cluster.pushpull-interval (1m) --cluster.tcp-timeout (10s) ● Probes --cluster.probe-timeout (500ms) --cluster.probe-interval (1s) ● Reconnection --cluster.reconnect-interval (10s) --cluster.reconnect-timeout (6h) Prometheus

Hidden stuff ● Peer names refreshed every 15 seconds. ● Messages gossiped to half of the nodes (min. 3). ● Gossip queue size of 4096 messages. ● Settle phase stops after 3 “stable” iterations. Prometheus

Future work ● Encryption & authentication using mTLS (#1819). ● Better support for advertised address (#1909). Prometheus

Conclusion ● Test all the things. ● Keep it simple. ● We ❤ contributions! Prometheus

Thanks! Simon Pasquier pasquier.simon@gmail.com @SimonHiker Prometheus

Psst, we’re hiring! Prometheus

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - PowerPoint PPT Presentation

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus Who am I? Software engineer working at Red Hat Alertmanager & consul_exporter maintainer Prometheus Alerting craft

Alert acknowledgement With Alertmanager ukasz Mierzwa Alert states in Alertmanager An alert

Customizing alertmanager notifications Tobias Schmidt, @dagrobie, PromCon 2018 How can I

Improved alerting with Prometheus and Alertmanager November 8th, 2019 PromCon Munich Important

Finger Pointing for Fun, Finger Pointing for Fun, Profit and War? Profit and War? Profit and

Enhance OpenSSH for Fun and Security Enhance OpenSSH for Fun and Security Enhance OpenSSH for Fun

Water Quality Fun Book ter Quality Fun Book Water Quality Fun Book ater Quality Fun Book Join

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com

Welcome to PromCon EU 2019 Richard Hartmann @TwitchiH 2019-11-07 Richard Hartmann @TwitchiH

PromCon closing talk Sleep is optional and for the weak Richard Hartmann, @TwitchiH 2019-11-08

Malicious Code Malicious Code for Fun and Profit for Fun and Profit Mihai Christodorescu

Malicious Code Malicious Code for Fun and Profit for Fun and Profit Mihai Christodorescu

Malicious Code Malicious Code for Fun and Profit for Fun and Profit Mihai Christodorescu

Smashing the Stack Protector for Fun and Profit 1 1996: Smashing The Stack for Fun and Profit

Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes

NON NON- -PROFIT SUPPORT PROFIT SUPPORT NON NON - - PROFIT SUPPORT PROFIT SUPPORT FOR

for Fun and Profit Lu Guanqun Intel Corporation Why? The market is big! 3 Famous games apps

Janet6: What it is, and what it isnt. Rob Evans Chief Network Architect, Janet Janet

Network layer Distributed Routing: Link State Routing Link State Routing A very frequently

Why IPsec and BGP dont play well together in real networks Brian Weis Overview Of

CompSci 514: Computer Networks Lecture 10: BGP problems Xiaowei Yang 1 Today Known

Topology Inference from BGP Routing Dynamics David Andersen, Nick Feamster, Steve Bauer, Hari

IPv6 route lookup performance and scaling Michal Kubeek SUSE Labs mkubecek@suse.cz

Ne NetBouncer uncer: A : Act ctiv ive D e Device and ice and Link Failure Lo Li

LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs