monitors distributed systems
a long time ago in a galaxy far far away...
Distributed architectures are hard 4
Monitoring distributed systems - Time windows - Rates 100k write/sec 105k write/sec - Percentiles - Cluster monitoring - Correlation between metrics - State transitions (OK => KO) 101k write/sec 1k write/sec - Alerts (mail, slack, pagerduty…) - Flexibility - ... 5
6
- Created by Kyle Kingsbury (Aphyr) - Event processing - Clojure - Monitoring 7
An immutable event :host “foo.bar.com” :service “df_percent_bytes_used_root” :state “critical” :time 1493243041 :metric 90 :description “Disk is full” :tags [“disk”] :ttl 60 8
Java Collectd Syslog-ng Kafka Haskell Telegraf Logstash Nagios check Go K8s/Heapster Fluentd Chef Python Statsd ... ... Perl Graphite ... ... Good Drop packets Slow Compat HTTP TCP UDP Graphite TLS OpenTSBD 9
Streams ... ... ... 10
:host “foo1.com” :host “foo1.com” :host “foo1.com” :service “api_rate” :service “foobar” :service “api_rate” :time 1493243041 :time 1493243041 :time 1493243044 :metric 90 :metric 90 :metric 90 where = service “api_rate” :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 11
:host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 fixed-time-window 10 :host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 12
:host “foo1.com” :host “foo1.com” :service “api_rate” :service “api_rate” :time 1493243041 :time 1493243044 :metric 90 :metric 90 smap sum :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 13
:host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 where < metric 200 :host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 14
:host “foo1.com” :service “api_rate” :time 1493243044 :metric 180 email “ops@riemann.io” 15
where = service “api_rate” fixed-time-window 10 smap sum where < metric 200 email “ops@riemann.io” 16
( where (= service “api_rate”) ( fixed-time-window 10 ( smap sum ( where (< metric 200) ( email “ops@riemann.io”))))) 17
(where (= service “api_rate”) ( fixed-time-window 10 (smap sum Use map ! (where (< metric 200) (email “ops@riemann.io”))))) 18
Configuration as code (Your config is 100 % Clojure) 19
(with {:description “Disk is full” :state “critical”} child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :description “Disk is full” 20
where has (where (service “foo”) 2 children (with {:description “cat”} First child (email “ops@riemann.io”)) (with {:description “dog”} Second child (email “dev@riemann.io”))) 21
where has (where (service “foo”) 2 children (with {:description “cat”} with has First child 1 child (email “ops@riemann.io”)) (with {:description “dog”} with has Second child (email “dev@riemann.io”))) 1 child 22
Clojure datastructures Immutability No side effects between streams 23
(where (service “df_percent_bytes_used_var_log”) ) (where (service # “^df_percent_bytes_used_”) ) (where (and (service # “^df_percent_bytes_used_”) (> (:metric event) 80))) 24
(default :ttl 60 child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :ttl 60 25
(smap (fn [event] (assoc event :ttl 60)) child) :host “foo.bar.com” :host “foo.bar.com” :service “df_home_mathieu” :service “df_home_mathieu” :state “ok” :state “ok” :time 1493243041 :time 1493243041 :metric 90 :metric 90 :ttl 60 26
(fixed-time-window 60 child1 child2) t 0 60 120 180 240 27
(moving-time-window 60 child) t 0 60 120 180 240 28
(fixed-event-window 3 child) t 29
(moving-event-window 3 child) t 30
(rate 5 child) 1.4 0.6 3.6 5 1 1 2 1 9 4 5 t 0 5 10 15 31
(scale (/ 1 1024 1024 1024) child) bytes Gigabytes t t 32
(ddt child) 4.75 2 15 19 38 t 0 5 25 65 33
(changed :state {:init “ok”} child) Sent ko ok ko to child ok ok ko ko ko ok ko t 0 34
(by [:host :service] (changed :state {:init “ok”} child)) ok ko ko ok :host “foo.com” :service “kafka lag” t 0 :host “foo.com” ko ko ok ok ok :service “disk /root %” t 0 35
(where (service “api request”) (percentiles 60 [0.5 0.99] child)) :host “riemann.io” :service “api request 0.5” :metric 20 10 20 30 t :host “riemann.io” 0 60 :service “api request 0.99” :metric 30 36
(where (state “critical”) (throttle 2 3600 (email “foo@riemann.io”))) child critical critical ok critical critical critical critical t 0 3600 37
(where (service “cpu %”) (coalesce 10 (smap max))) :host “host 2” every 10s ... :service “cpu %” :metric 20 cpu % ... ... ... :host “host 1” host 1 host 2 :service “cpu %” :metric 10 Max 38
Pagerduty InfluxDB Email Nagios VictorOps Elasticsearch Hipchat Shinken Twilio Graphite Slack ... Alerta Kafka Mailgun ... Logstash ... Datadog Cloudwatch Riemann ... 39
(batch 100 1 ;; batch size = 100 every 1 sec (async-queue! :influxdb ;; create a threadpool {:queue-size 10000 :core-pool-size 4} (influxdb {:host 127.0.0.1 ;; forward to influx :db ”riemann”}))) 40
(exception-stream (email “alert@riemann.io”) (influxdb {:host 127.0.0.1 :db ”riemann”})) 41
Configuration as code Split your configuration 42
(def check-critical-state ;; a var containing a stream (where (state “critical”) (email “admin@riemann.io”))) (defn check-state ;; a function returning a stream [s email-addr] (where (state s) (email email-addr))) (streams check-critical-state (check-state “critical” “admin@riemann.io”)) 43
/etc/riemann/riemann.config /mycorp/app/elasticsearch.clj /mycorp/output/mail.clj /mycorp/system/disk.clj /mycorp/system/ram.clj … + A plugin system 44
Configuration as code Tests 45
(scale (/ 1 1024 1024 1024) (tap :scale-tap) child) (tests (deftest foo-test (is (= (:scale-tap (inject! [{:metric 1000}])) [{:metric (/ 1000 1024 1024 1024)}])))) 46
The index - In memory datastructure (hashmap) - Key : [host service] - Value: an event - The index stream adds event to the index (where (service “ram_percent”) (index))
service Time 3 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 40 :metric 90 :ttl 60 :ttl 60 :time 1 :time 3 foo.bar fizz.buzz host 48
service Time 10 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 45 :metric 90 :ttl 60 :ttl 60 :time 10 :time 3 foo.bar fizz.buzz host 49
service Time 64 :host “foo.bar” :host “fizz.buzz” :service “ram_percent” :service “ram_percent” ram_% :metric 65 :metric 80 :ttl 120 :ttl 120 :time 2 :time 2 :host “foo.bar” :host “fizz.buzz” :service “cpu_percent” :service “cpu_percent” cpu_% :metric 45 :metric 90 :ttl 60 :ttl 60 :time 10 :time 3 foo.bar fizz.buzz host 50
Recommend
More recommend