monitors distributed systems a long time ago in a galaxy
play

monitors distributed systems a long time ago in a galaxy far far - PowerPoint PPT Presentation

monitors distributed systems a long time ago in a galaxy far far away... Distributed architectures are hard 4 Monitoring distributed systems - Time windows - Rates 100k write/sec 105k write/sec - Percentiles - Cluster monitoring -


  1. monitors distributed systems

  2. a long time ago in a galaxy far far away...

  3. Distributed architectures are hard 4

  4. Monitoring distributed systems - Time windows - Rates 100k write/sec 105k write/sec - Percentiles - Cluster monitoring - Correlation between metrics - State transitions (OK => KO) 101k write/sec 1k write/sec - Alerts (mail, slack, pagerduty…) - Flexibility - ... 5

  5. 6

  6. - Created by Kyle Kingsbury (Aphyr) - Event processing - Clojure - Monitoring 7

  7. An immutable event :host “foo.bar.com” :service “df_percent_bytes_used_root” :state “critical” :time 1493243041 :metric 90 :description “Disk is full” :tags [“disk”] :ttl 60 8

  8. Java Collectd Syslog-ng Kafka Haskell Telegraf Logstash Nagios check Go K8s/Heapster Fluentd Chef Python Statsd ... ... Perl Graphite ... ... Good Drop packets Slow Compat HTTP TCP UDP Graphite TLS OpenTSBD 9

  9. Streams ... ... ... 10

  10. :host “foo1.com” :host “foo1.com” :host “foo1.com” :service “api_rate” :service “foobar” :service “api_rate” :time 1493243041 :time 1493243041 :time 1493243044 :metric 90 :metric 90 :metric 90 where = service “api_rate” :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 11

  11. :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 fixed-time-window 10 :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 12

  12. :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 smap sum :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 13

  13. :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 where < metric 200 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 14

  14. :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 email “ops@riemann.io” 15

  15. where = service “api_rate” fixed-time-window 10 smap sum where < metric 200 email “ops@riemann.io” 16

  16. ( where (= service “api_rate”) ( fixed-time-window 10 ( smap sum ( where (< metric 200) ( email “ops@riemann.io”))))) 17

  17. (where (= service “api_rate”) ( fixed-time-window 10 (smap sum Use map ! (where (< metric 200) (email “ops@riemann.io”))))) 18

  18. Configuration as code (Your config is 100 % Clojure) 19

  19. (with {:description “Disk is full” :state “critical”} child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :description “Disk is full” 20

  20. where has (where (service “foo”) 2 children (with {:description “cat”} First child (email “ops@riemann.io”)) (with {:description “dog”} Second child (email “dev@riemann.io”))) 21

  21. where has (where (service “foo”) 2 children (with {:description “cat”} with has First child 1 child (email “ops@riemann.io”)) (with {:description “dog”} with has Second child (email “dev@riemann.io”))) 1 child 22

  22. Clojure datastructures Immutability No side effects between streams 23

  23. (where (service “df_percent_bytes_used_var_log”) ) (where (service # “^df_percent_bytes_used_”) ) (where (and (service # “^df_percent_bytes_used_”) (> (:metric event) 80))) 24

  24. (default :ttl 60 child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :ttl 60 25

  25. (smap (fn [event] (assoc event :ttl 60)) child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :ttl 60 26

  26. (fixed-time-window 60 child1 child2) t 0 60 120 180 240 27

  27. (moving-time-window 60 child) t 0 60 120 180 240 28

  28. (fixed-event-window 3 child) t 29

  29. (moving-event-window 3 child) t 30

  30. (rate 5 child) 1.4 0.6 3.6 5 1 1 2 1 9 4 5 t 0 5 10 15 31

  31. (scale (/ 1 1024 1024 1024) child) bytes Gigabytes t t 32

  32. (ddt child) 4.75 2 15 19 38 t 0 5 25 65 33

  33. (changed :state {:init “ok”} child) Sent ko ok ko to child ok ok ko ko ko ok ko t 0 34

  34. (by [:host :service] (changed :state {:init “ok”} child)) ok ko ko ok :host “foo.com” :service “kafka lag” t 0 :host “foo.com” ko ko ok ok ok :service “disk /root %” t 0 35

  35. (where (service “api request”) (percentiles 60 [0.5 0.99] child)) :host “riemann.io” :service “api request 0.5” :metric 20 10 20 30 t :host “riemann.io” 0 60 :service “api request 0.99” :metric 30 36

  36. (where (state “critical”) (throttle 2 3600 (email “foo@riemann.io”))) child critical critical ok critical critical critical critical t 0 3600 37

  37. (where (service “cpu %”) (coalesce 10 (smap max))) :host “host 2” every 10s ... :service “cpu %” :metric 20 cpu % ... ... ... :host “host 1” host 1 host 2 :service “cpu %” :metric 10 Max 38

  38. Pagerduty InfluxDB Email Nagios VictorOps Elasticsearch Hipchat Shinken Twilio Graphite Slack ... Alerta Kafka Mailgun ... Logstash ... Datadog Cloudwatch Riemann ... 39

  39. (batch 100 1 ;; batch size = 100 every 1 sec (async-queue! :influxdb ;; create a threadpool {:queue-size 10000 :core-pool-size 4} (influxdb {:host 127.0.0.1 ;; forward to influx :db ”riemann”}))) 40

  40. (exception-stream (email “alert@riemann.io”) (influxdb {:host 127.0.0.1 :db ”riemann”})) 41

  41. Configuration as code Split your configuration 42

  42. (def check-critical-state ;; a var containing a stream (where (state “critical”) (email “admin@riemann.io”))) (defn check-state ;; a function returning a stream [s email-addr] (where (state s) (email email-addr))) (streams check-critical-state (check-state “critical” “admin@riemann.io”)) 43

  43. /etc/riemann/riemann.config /mycorp/app/elasticsearch.clj /mycorp/output/mail.clj /mycorp/system/disk.clj /mycorp/system/ram.clj … + A plugin system 44

  44. Configuration as code Tests 45

  45. (scale (/ 1 1024 1024 1024) (tap :scale-tap) child) (tests (deftest foo-test (is (= (:scale-tap (inject! [{:metric 1000}])) [{:metric (/ 1000 1024 1024 1024)}])))) 46

  46. The index - In memory datastructure (hashmap) - Key : [host service] - Value: an event - The index stream adds event to the index (where (service “ram_percent”) (index))

  47. service Time 3 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 40 :metric 90 :ttl 60 :ttl 60 :time 1 :time 3 foo.bar fizz.buzz host 48

  48. service Time 10 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 45 :metric 90 :ttl 60 :ttl 60 :time 10 :time 3 foo.bar fizz.buzz host 49

  49. service Time 64 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 45 :metric 90 :ttl 60 :ttl 60 :time 10 :time 3 foo.bar fizz.buzz host 50

Recommend


More recommend