cortex prometheus as a service one year on
play

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon - PowerPoint PPT Presentation

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex https://www.youtube.com/watch?v=3Tb4Wc0kfCM Prometheus HA Grafana Cortex: Prometheus as a Service Alertmanager


  1. Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex

  2. https://www.youtube.com/watch?v=3Tb4Wc0kfCM

  3. Prometheus HA Grafana Cortex: Prometheus as a Service Alertmanager • Natively multi tenant; isolate different customers in the same services. Your Your Your Your Your Jobs • Different story around scaling & HA Cortex • “Virtually infinite” retention and durability • Opportunities for performance enhancements

  4. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Consul Ingester DynamoDB S3 Memcache Cortex Architecture

  5. A Year’s Evolution

  6. Problem #1: DynamoDB Write Throughput

  7. https://github.com/weaveworks/cortex/issues/254

  8. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Consul Ingester Table Manager DynamoDB S3 Memcache Cortex Architecture

  9. Problem #2: DynamoDB Write Throughput, again

  10. Original schema: • Hash Key: <user ID>:<hour>:<metric name> • Range Key: <label name>:<label value>:<chunk ID> New schema: • Hash Key: <user ID>:<day>:<metric name>:<label name> • Range Key: <chunk ID>:<chunk end time> https://github.com/weaveworks/cortex/pull/262

  11. Problem #3: Queries of Death

  12. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Table Manager DynamoDB S3 Memcache Cortex Architecture

  13. Problem #3: Recording rules and alerts

  14. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager DynamoDB S3 Memcache Cortex Architecture

  15. Problem #4: Long tail

  16. https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

  17. Problem #5: Cost

  18. S3 DynamoDB IOP Cost 5x10 -6 2x10 -7 ($/IOP) Storage Cost 0.023 0.250 ($/GB/Month) https://github.com/weaveworks/cortex/issues/141

  19. 0.025 DynamoDB 0.02 0.015 Cost ($) S3 0.01 0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 Object size (GB)

  20. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager DynamoDB Memcache Cortex Architecture

  21. Problem #6: DynamoDB, again

  22. Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager BigTable Memcache Cortex Architecture

  23. DynamoDB BigTable 99th Percentile Write 70-100 50-150 Latency (ms) 99th Percentile Read 100-2500 ~250 Latency (ms) LOC ~2000 ~400 DynamoDB numbers courtesy of Weaveworks

  24. Closing thoughts

  25. 1. DynamoDB Write Throughput 2. DynamoDB Write Throughput, again 3. Recording rules and alerts 4. Long tail 5. Cost 6. DynamoDB, again

  26. Running for >12months • Availability: querier unavailable for <12hrs ~99.9% • Durability: lost <2 days of data >99.5% • 99th percentile write performance ~60ms • 99th percentile query performance <200ms

  27. Future • Direct chunk writes from Prometheus to Cortex Chunk Store • Separate ingester index for better load balancing • Use prometheus/tsdb for the ingesters • Etcd & gossip for ring storage • Chunks in Google Cloud Storage

  28. One more thing…

  29. I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co

  30. Metrics

  31. Logs

  32. Traces

  33. Thank you! Questions?

Recommend


More recommend