Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex
https://www.youtube.com/watch?v=3Tb4Wc0kfCM
Prometheus HA Grafana Cortex: Prometheus as a Service Alertmanager • Natively multi tenant; isolate different customers in the same services. Your Your Your Your Your Jobs • Different story around scaling & HA Cortex • “Virtually infinite” retention and durability • Opportunities for performance enhancements
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Consul Ingester DynamoDB S3 Memcache Cortex Architecture
A Year’s Evolution
Problem #1: DynamoDB Write Throughput
https://github.com/weaveworks/cortex/issues/254
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Consul Ingester Table Manager DynamoDB S3 Memcache Cortex Architecture
Problem #2: DynamoDB Write Throughput, again
Original schema: • Hash Key: <user ID>:<hour>:<metric name> • Range Key: <label name>:<label value>:<chunk ID> New schema: • Hash Key: <user ID>:<day>:<metric name>:<label name> • Range Key: <chunk ID>:<chunk end time> https://github.com/weaveworks/cortex/pull/262
Problem #3: Queries of Death
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Table Manager DynamoDB S3 Memcache Cortex Architecture
Problem #3: Recording rules and alerts
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager DynamoDB S3 Memcache Cortex Architecture
Problem #4: Long tail
https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/
Problem #5: Cost
S3 DynamoDB IOP Cost 5x10 -6 2x10 -7 ($/IOP) Storage Cost 0.023 0.250 ($/GB/Month) https://github.com/weaveworks/cortex/issues/141
0.025 DynamoDB 0.02 0.015 Cost ($) S3 0.01 0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 Object size (GB)
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager DynamoDB Memcache Cortex Architecture
Problem #6: DynamoDB, again
Write requests Read requests Control requests Frontend Prometheus Your Jobs Ditributor Querier Consul Ingester Ruler Table Manager BigTable Memcache Cortex Architecture
DynamoDB BigTable 99th Percentile Write 70-100 50-150 Latency (ms) 99th Percentile Read 100-2500 ~250 Latency (ms) LOC ~2000 ~400 DynamoDB numbers courtesy of Weaveworks
Closing thoughts
1. DynamoDB Write Throughput 2. DynamoDB Write Throughput, again 3. Recording rules and alerts 4. Long tail 5. Cost 6. DynamoDB, again
Running for >12months • Availability: querier unavailable for <12hrs ~99.9% • Durability: lost <2 days of data >99.5% • 99th percentile write performance ~60ms • 99th percentile query performance <200ms
Future • Direct chunk writes from Prometheus to Cortex Chunk Store • Separate ingester index for better load balancing • Use prometheus/tsdb for the ingesters • Etcd & gossip for ring storage • Chunks in Google Cloud Storage
One more thing…
I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co
Metrics
Logs
Traces
Thank you! Questions?
Recommend
More recommend