@snehainguva prometheus everything, observing kubernetes in the - PowerPoint PPT Presentation

@snehainguva

prometheus everything, observing kubernetes in the cloud digitalocean.com

about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus digitalocean.com

Some stats digitalocean.com

15 kubernetes clusters 12 data centers 300+ production applications digitalocean.com

2 promethei + 1 alertmanager per cluster + 1.5 million+ timeseries 99218 samples/sec (note: data-center wide scraping is at 550k samples/sec) digitalocean.com

the plan: ● the pre-kubernetes days kubernetes at DigitalOcean (aka docc ) ● ● prometheus + alertmanager and kubernetes ● alerting in action: examples ● potential pitfalls next steps ● digitalocean.com

pre-kubernetes: service owners write an application provision a server with chef or ansible use a CI/CD pipeline, bash scripts, or other tools to deploy and update application on a VM digitalocean.com

pre-kubernetes: use nagios + various plugins to monitor use collectd + application metrics + statsd + graphite push data to openTSDB digitalocean.com

pre-kubernetes: longer to provision host than write actual service blackbox monitoring NOT insightful whitebox monitoring services NOT easily queryable digitalocean.com

docc: D igital O cean C ommand C enter A tool for deploying containerized , stateless applications digitalocean.com

What is kubernetes? Container orchestration tool from Google digitalocean.com

What is docc? An abstraction layer on top of kubernetes deployment → pods DOCCSERVER CLI service digitalocean.com

post-docc: service owners write an application service owner dockerizes application describe application in json manifest file deploy! digitalocean.com

post-docc: deployments and updates take minutes , not hours view running applications get application logs easily scale , update , or restart applications digitalocean.com

But what about monitoring? digitalocean.com

Let’s use prometheus + alertmanager digitalocean.com

deployment → pods promconfig docc service alertmanager alertconfig digitalocean.com

1 instrument your application use prometheus golang client expose metrics endpoint digitalocean.com

2 specify metrics, ports, alerts in your manifest file Which metrics endpoin t should be scraped? Which container port needs to be exposed? Specify alerting rule , duration interval, and channel . digitalocean.com

3 use docc CLI to deploy your application deployment → pods CLI doccserver service $ docc deploy manifest.json annotations contain rules and receiver info digitalocean.com

4 prometheus talks to the kubernetes api and grabs the metrics endpoint and port information service promconfig digitalocean.com

5 promconfig grabs alert information and rewrites prometheus rules file service promconfig digitalocean.com

6 alertconfig grabs alert routes and rewrites alertmanager configuration file service alertmanager alertconfig digitalocean.com

What should we monitor ? digitalocean.com

4 Golden Signals request-based system metrics latency R equest traffic E rrors error D uration saturation digitalocean.com

Brendan Gregg’s USE-ful metrics “Solves 80% of server issues with 5% of the effort.” U tilization S aturation E rror digitalocean.com

prom metrics types counters: cumulative, increasing metric gauges: single metric that goes up or down histograms: samples and buckets observations summaries: samples observations, specify quantile digitalocean.com

Putting it all together... digitalocean.com

service metric: traffic how much demand is placed on the system loadbalancer backend traffic fxn: rate() and sum() metric type: counter sum ( rate ( haproxy_backend_bytes_out_total{ labels kubernetes_name="loadbalancer", backend="tls_default_neptune_nyc3_internal_digitalocean_com" } [1m])) BY (backend) digitalocean.com

cluster metric: utilization average time resource is busy servicing work cluster CPU utilization fxn: sum() and rate() metric type: counter ( sum ( rate (container_cpu_ usage_seconds_total {id="/"}[5m])) / sum (machine_cpu_cores)) digitalocean.com

How should we alert ? digitalocean.com

Threshold alerts Do any of the aforementioned metrics exceed a lower or upper bound ? digitalocean.com

Threshold alerts Are more than 80% of cluster CPU cores being utilized? ( sum ( rate (container_cpu_ usage_seconds_total {id="/"}[5m])) / sum(machine_cpu_cores))* 100 > 80 digitalocean.com

State-based alerts Is there a divergence between expected state and actual state of a service? digitalocean.com

State-based alerts Is my service up and/or scrape-able? absent (up{kubernetes_name="doccserver"}) or sum ( up {kubernetes_name="doccserver"}) == 0 digitalocean.com

Common pitfalls digitalocean.com

Pitfall #1: Alerting fatigue digitalocean.com

Solution: Slack and/or Pagerduty send only the most urgent, production alerts to pagerduty try out different promQL queries to have less spikey metrics digitalocean.com

Pitfall #2: Who owns what? digitalocean.com

Solution: opinionated manifest file services owner must include maintainer information alerts themselves include descriptions and summaries with several labels alerts must include team-specific receivers digitalocean.com

Pitfall #3: Meta-monitoring digitalocean.com

Solution: Duplicate promethei and HA alertmanager alertmanager alertmanager alertmanager digitalocean.com

Solution: Deadman’s switch elastalert ALERT JustKeepSwimming IF vector(1) digitalocean.com

digitalocean.com

#1: Automated alerts utilize user-defined memory and cpu limits for threshold alerts automatic state-based alerts digitalocean.com

#2: Leverage metrics for autopilot user trusts in our custom controllers and schedulers collect metrics and build model about resource usage over time accordingly adjust limits and alerts digitalocean.com

#3: Leverage metrics for autoscaling services based on resource usage, # connections, etc. loadbalancers based on # of frontend and backend connections # of worker nodes based on memory and cpu capacity metrics digitalocean.com

a brave new world of container orchestration prometheus + alertmanage r are awesome! extensibility digitalocean.com

thanks! @snehainguva ● The best prometheus tutorials you will ever read, Julius Volz Actual Prometheus Website ● ● Kubernetes Project

@snehainguva prometheus everything, observing kubernetes in the - PowerPoint PPT Presentation

@snehainguva prometheus everything, observing kubernetes in the cloud digitalocean.com about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus digitalocean.com Some stats digitalocean.com 15

Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes

@snehainguva observability and product release: leveraging prometheus to build and test new

Prometheus in Small and Medium Businesses Why You Don't Need to Do Rocket Science (Kubernetes)

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas & Frederic Branczyk,

Automated Canaries with Prometheus, Kubernetes and Service Mesh Bryan Boreham

Monitoring Kubernetes with OMD Labs Edition and Prometheus Michael Kraus - FOSDEM 2017 About

Monitoring Kubernetes with Prometheus Henri Dubois-Ferriere @henridf Percona Live, 2018-11-06

Prometheus Best Practices and Beastly Pitfalls Julius Volz, August 17, 2017 Prometheus

Airflow on Kubernetes: Containerizing your Workflows By Michael Hewitt Agenda Kubernetes

PromCon 2017 Welcome and Introduction Julius Volz, 17. August 2017 Prometheus Welcome and Thank

Roll your own Service Discovery Simon Pasquier @SimonHiker Prometheus Service Discovery Avoid

3. Agent-Oriented Methodologies Part 2: D) ems Design (MASD The PROMETHEUS The PROMETHEUS

110 Rules for Prometheus Brian Brazil Founder Rule 110 110 Rules for Prometheus Brian Brazil

Knowledge in Interviews Brian Brazil Founder Who am I? One of the developers of Prometheus

Continuous Kubernetes Security @sublimino and @controlplaneio Im: - Andy - Dev-like -

Rethinking monitoring with Prometheus Martn Ferrari Based on a previous talk prepared with

Kubernetes Matthias Haeussler Mirna Alaisami Overview Overview Kubernetes is an open-source

K8s or Die! You must do Kubernetes. Or should you? If so when, where, why? How?! Marco Ceppi

Prometheus Adam Goldsmith, Jack Gonsalves, Ben Gillette, and Luke Buquicchio Prometheus

UT#Astronomical#Observing# "Telescopes and CCDs and Spectrographs, oh my! Observing at

From Laptop to the World With Kubernetes @saturnism @googlecloud #kubernetes Ray Tsang

Contributing to kubernetes Who am I? Senior Software Engineer at Gojek Organizer at Kubernetes

Developing Kubernetes Services at Airbnb Scale @MELANIECEBULA What is kubernetes?

The Integrated Marine Observing System: observing Australias changing oceans Katy Hill Tim

@snehainguva prometheus everything, observing kubernetes in the - PowerPoint PPT Presentation

@snehainguva prometheus everything, observing kubernetes in the cloud digitalocean.com about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus digitalocean.com Some stats digitalocean.com 15

Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes

@snehainguva observability and product release: leveraging prometheus to build and test new

Prometheus in Small and Medium Businesses Why You Don't Need to Do Rocket Science (Kubernetes)

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas &amp; Frederic Branczyk,

Automated Canaries with Prometheus, Kubernetes and Service Mesh Bryan Boreham

Monitoring Kubernetes with OMD Labs Edition and Prometheus Michael Kraus - FOSDEM 2017 About

Monitoring Kubernetes with Prometheus Henri Dubois-Ferriere @henridf Percona Live, 2018-11-06

Prometheus Best Practices and Beastly Pitfalls Julius Volz, August 17, 2017 Prometheus

Airflow on Kubernetes: Containerizing your Workflows By Michael Hewitt Agenda Kubernetes

PromCon 2017 Welcome and Introduction Julius Volz, 17. August 2017 Prometheus Welcome and Thank

Roll your own Service Discovery Simon Pasquier @SimonHiker Prometheus Service Discovery Avoid

3. Agent-Oriented Methodologies Part 2: D) ems Design (MASD The PROMETHEUS The PROMETHEUS

110 Rules for Prometheus Brian Brazil Founder Rule 110 110 Rules for Prometheus Brian Brazil

Knowledge in Interviews Brian Brazil Founder Who am I? One of the developers of Prometheus

Continuous Kubernetes Security @sublimino and @controlplaneio Im: - Andy - Dev-like -

Rethinking monitoring with Prometheus Martn Ferrari Based on a previous talk prepared with

Kubernetes Matthias Haeussler Mirna Alaisami Overview Overview Kubernetes is an open-source

K8s or Die! You must do Kubernetes. Or should you? If so when, where, why? How?! Marco Ceppi

Prometheus Adam Goldsmith, Jack Gonsalves, Ben Gillette, and Luke Buquicchio Prometheus

UT#Astronomical#Observing# &quot;Telescopes and CCDs and Spectrographs, oh my! Observing at

From Laptop to the World With Kubernetes @saturnism @googlecloud #kubernetes Ray Tsang

Contributing to kubernetes Who am I? Senior Software Engineer at Gojek Organizer at Kubernetes

Developing Kubernetes Services at Airbnb Scale @MELANIECEBULA What is kubernetes?

The Integrated Marine Observing System: observing Australias changing oceans Katy Hill Tim

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas & Frederic Branczyk,

UT#Astronomical#Observing# "Telescopes and CCDs and Spectrographs, oh my! Observing at