expectations on remote data
play

Expectations on Remote Data Supporting the Prometheus Remote Storage - PowerPoint PPT Presentation

Expectations on Remote Data Supporting the Prometheus Remote Storage API Alfred Landrum Engineer @ Sysdig Twitter: @alfred-landrum Github: alfred-landrum Sysdig Backend Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data


  1. Expectations on Remote Data Supporting the Prometheus Remote Storage API Alfred Landrum Engineer @ Sysdig Twitter: @alfred-landrum Github: alfred-landrum

  2. Sysdig Backend Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data Engine and Store Status Data ● Orchestrator State Distributed Datastore ● Service Topology ● Time/Group Aggregation ● Application Checks ● RBAC ● Downsampling Time Series Data ● StatsD ● JMX ● Prometheus ● ...

  3. Sysdig Backend Grafana Prometheus HTTP API PromQL Evaluation PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data API Status Data ● Orchestrator State ● Service Topology Sysdig Data Engine ● Application Checks and Store Time Series Data ● StatsD ● JMX ● Prometheus ● ...

  4. API PromQL Storage labels : __name__ = “up” range query from t0 to t1, step 10s: env = “prod” up{env=”prod”} > 1 time: start: (t0 - 5m) end: t1 … 1. Why ask for an extra 5 minutes? labels : __name__ = “alerts_total” range query from t0 to t1, step 10s: time: rate(alerts_total[1m]) start: t0 end: t1 read hints: func: “rate” 2. What’s a “func” hint? … 3. What does “rate” mean?

  5. value time Storage data model: a set of time series, identified by metric name and labels. No alignment guarantees.

  6. value time Instant Query: evaluate an expression at a particular time.

  7. value time step start end Range Query: logically, a repeated instant query on [start,end] every step .

  8. value time What if there’s no sample at the evaluation time?

  9. value time 1m Range Vector Selector PromQL: avg_over_time(queue_depth[ 1m ])

  10. value time Instant vector selector PromQL: queue_depth The most recent value found at or before the evaluation time.

  11. value time step start end Same for range queries, applied at each step.

  12. value time step start end PromQL now has aligned values, for calculations, comparisons, etc.

  13. evaluation time last scraped datapoint ????? value time How long do you see the value of the last sample? Controlled in 2 different ways:

  14. evaluation time last scraped datapoint lookbackDelta value time First way is via a configuration setting: lookbackDelta Default is 5 minutes.

  15. evaluation time last scraped datapoint lookbackDelta value time step start Same for range queries, applied at each step.

  16. evaluation time last scraped datapoint lookbackDelta value time Consider an alert that should fire if there’s no value.

  17. no value evaluated value stale marker scraped datapoint valu time e The second way: stale markers Scraping logic adds them 1-2 intervals after the last sample.

  18. Sysdig Backend Grafana Prometheus HTTP API PromQL Evaluation PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data API Status Data ● Orchestrator State ● Service Topology Sysdig Data Engine ● Application Checks and Store Time Series Data ● StatsD ● JMX ● Prometheus ● ...

  19. API PromQL Storage labels : __name__ = “up” range query from t0 to t1, step 10s: env = “prod” up{env=”prod”} > 1 time: start: (t0 - 5m ) end: t1 … ✔ Sample Alignment labels : __name__ = “alerts_total” range query from t0 to t1, step 10s: time: rate(alerts_total[1m]) start: t0 end: t1 read hints: func: “rate” 2. What’s a “func” hint? … 3. What does “rate” mean?

  20. time Scraping intervals typically on the order of 1 minute. A query for a month’s data would take ~45k samples. That’s likely more than the pixel width of your display.

  21. time Store an aggregation of many samples within some fixed resolution. What representative value should you store?

  22. time Average

  23. time Maximum

  24. 05 05 12 16 06 14 time Sum

  25. 05 05 12 16 06 14 time Not limited to a single aggregation - store several. How to select the best one for a query?

  26. API PromQL Storage labels : __name__ = “up” range query from t0 to t1, step 10s: env = “prod” up{env=”prod”} > 1 time: start: (t0 - 5m ) end: t1 … ✔ Sample Alignment labels : __name__ = “alerts_total” range query from t0 to t1, step 10s: time: rate(alerts_total[1m]) start: t0 end: t1 read hints: func: “rate” ✔ Aggregation Selection … 3. What does “rate” mean?

  27. time A decrease in value indicates a reset occurred. A common reason for a reset is a restarted instance.

  28. time rate() : divide the difference in events by a time duration.

  29. time No resets? Sum the deltas between samples.

  30. time Reset? Add post-reset value.

  31. time effectively: slide everything up after each reset.

  32. time What kind of aggregation would you need for rate?

  33. time t1 t2 How many events occurred between t1 and t2 ?

  34. time t1 t2 How many events occurred between t1 and t2 ?

  35. time t1 t2 How many events occurred between t1 and t2 ?

  36. time t1 t2 How many events occurred between t1 and t2 ?

  37. time t1 t2 How many events occurred between t1 and t2 ?

  38. time t1 t2 How many events occurred between t1 and t2 ?

  39. time t1 t2 How many events occurred between t1 and t2 ? In this case, the difference between: ● the sum of events in t2 window ● the last value before t1

  40. time t3 t4 How many events occurred between t3 and t4 ?

  41. time t3 t4 How many events occurred between t3 and t4 ?

  42. time t3 t4 How many events occurred between t3 and t4 ? Can we just take the difference between the last raw & the sum?

  43. boundary aligned reset! time t3 t4 No: a border reset means values in t3 don’t matter: just the t4 event sum. Need to store the first and last raw values to detect border resets.

  44. time Store first, last raw values, and sum in events.

  45. time How to turn this into a response for PromQL remote read?

  46. time t0 t1 t2 t3 t4 Generate a response sequence for each query.

  47. time t0 t1 t2 t3 t4 PromQL sees a monotonically increasing value, with no resets.

  48. API PromQL Storage labels : __name__ = “up” range query from t0 to t1, step 10s: env = “prod” up{env=”prod”} > 1 time: start: (t0 - 5m ) end: t1 … ✔ Sample Alignment labels : __name__ = “alerts_total” range query from t0 to t1, step 10s: time: rate(alerts_total[1m]) start: t0 end: t1 read hints: func: “rate” ✔ Aggregation Selection … ✔ Counter Downsampling

  49. Sysdig Backend Grafana Prometheus HTTP API PromQL Evaluation PromQL Dashboards/Alerts Sysdig Dashboards/Alerts/Topology Host/Node based Agents Sysdig Data API Status Data ● Orchestrator State ● Service Topology Sysdig Data Engine ● Application Checks and Store Time Series Data ● StatsD ● JMX ● Prometheus ● ...

  50. Thanks!

Recommend


More recommend