monitoring in motion
play

Monitoring In Motion Challenges in monitoring kubernetes, - PowerPoint PPT Presentation

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure. Ilan Rabinovitch ContainerCon Toronto Director, Technical Community Aug 24, 2016 Datadog $ finger ilan@datadog [datadoghq.com] Name:


  1. Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure. Ilan Rabinovitch ContainerCon Toronto Director, Technical Community 
 Aug 24, 2016 Datadog

  2. 
 $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

  3. Datadog Overview • SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/)

  4. Monitor Everything Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...

  5. $ cat ~/.plan 1. Intro: The Importance of Monitoring 2. The Challenge : Monitoring Dynamic Infrastructure 3. Finding the Signal: How do we know what to monitor? 4. Implementation: Applying it to Containerized Workloads

  6. Our Focus Area Culture Automation Metrics Sharing Damon Edwards and John Willis DevOps Day LA

  7. Culture “ organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations ” - Melvin E. Conway

  8. Follow @honest_update on Twitter

  9. Collecting data is cheap; 
 not having it when you need it can be expensive

  10. Instrument all the things!

  11. Sharing Looping Back on Culture Describe the problem as your “enemy” not each other Learn Together

  12. Sharing Using and Sharing the same metrics and measurements across teams is key to avoiding misunderstandings.

  13. Source: http://bit.ly/1SvvbuP

  14. Source: http://bit.ly/1RQRsXW

  15. Operational Complexity Increases with.. • Number of things to measure 
 • Velocity of change

  16. https://www.datadoghq.com/docker-adoption/

  17. How much we measure? 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application

  18. How much we measure? 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application 
 N containers • 150*N metrics

  19. Operational Complexity 500 100 containers instances

  20. Operational Complexity: Scale 800 160 metrics per host metrics per host Assuming 5 containers per host

  21. Operational Complexity: Scale 80,000 100 instances metrics Assuming 5 containers per host

  22. How much we measure? Metrics Overload! 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application 
 N containers • 150*N metrics

  23. Operational Complexity Increases with.. • Number of things to measure 
 • Velocity of change

  24. Source: Datadog

  25. Source: http://bit.ly/1qFylWK

  26. Operational Complexity Increases with.. • Number of things to measure 
 • Velocity of change

  27. Open Questions • Where is my container running? • What is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

  28. Source: http://bit.ly/1YxJ7Jy

  29. More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

  30. Monitoring 101

  31. Finding Signal - Categorizing Your Metrics

  32. Examples: NGINX - Metrics Work Metrics: 
 Resource Metrics: 
 • Requests Per Second • Disk I/O • Request Time • Memory • Error Rates (4xx or 5xx) • CPU • Success (2xx) • Queue Length

  33. Examples: NGINX - Events • Configuration Change • Code Deployment • Service Started / Stopped

  34. Examples: Events

  35. When to let a sleeping engineer lie?

  36. When to alert?

  37. Recurse until you find root cause

  38. What to demand from our monitoring tooling?

  39. Cryptic Alerts ? T A H W

  40. EVERY ALERT MUST BE ACTIONABLE

  41. Host Centric

  42. Service Centric

  43. Static vs Dynamic Static configurations tracking dynamic infrastructure are not a recipe for success.

  44. Query Based Monitoring “What’s the average throughput of application:nginx per version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

  45. Getting at the metrics…

  46. Resource Metrics Utilization: Saturation • CPU (user + system) • throttling • memory • swap • i/o • network traffic Error • Network Errors 
 (receive vs transmit)

  47. Container Events • Starting / Stopping Containers • Scaling Events for Underlying Instances • Deploying a new container build

  48. How do we get at the upper layers?

  49. Getting at the Metrics NETWORK CPU METRICS MEMORY METRICS I/O METRICS METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

  50. Pseudo-files • Provide visibility into container metrics via the file system. • Generally under: 
 /cgroup/<resource>/docker/$CONTAINER_ID/ 
 or 
 /sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/ 


  51. Pseudo-files: CPU Metrics $ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot Pseudo-files: CPU Throttling $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)

  52. Docker API • Detailed streaming metrics as JSON HTTP socket 
 $ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats 


  53. STATS Command # Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

  54. Side Car Containers

  55. Aren’t we still missing a layer?

  56. Open Questions • What is the capacity of my cluster? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container? • Where is my container running? what port?

  57. Service Discovery A O A O Integration Configurations Host Level Monitoring Agent Config Backend Metrics Container Additional Metadata Containers List & (Tags, etc) Metadata Docker API Orchestrator

  58. Custom Metrics • Instrument custom applications 
 • You know your key transactions best. 
 • Use async protocols like Etys’ STATSD or 
 DogstatsD

  59. Source: http://bit.ly/1NoW6aj

  60. Resources Monitoring 101: Alerting 
 https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/ 
 The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/

Recommend


More recommend