Time Series Schemas @Percona Live 2017 1
Who Am I? Chris Larsen ● Maintainer and author for OpenTSDB since 2013 ● Software Engineer @ Yahoo ● Central Monitoring Team Who I’m not: • A marketer • A sales person 2
What are Time Series? Time Series : A sequence of discrete data points (values) ordered and indexed by time associated with an identity. E.g.: web01.sys.cpu.busy.pct 45% 1/1/207 12:01:00 web01.sys.cpu.busy.pct 52% 1/1/207 12:02:00 web01.sys.cpu.busy.pct 35% 1/1/207 12:03:00 ^ Identity ^ Value ^ Timestamp 3
What are Time Series? 4
What are Time Series? Data Point : Metric + Tags + Value: 42 + Timestamp: 1234567890 ^ a data point ^ sys.cpu.user 1234567890 42 host=web01 cpu=0 ● Payload could also be a string, a blob, a histogram, etc. 5
Chose your own Adventure! ● You’re developing a new app and want to see how long it takes to call that backend service. ● A web server is super slow and you want to track connections and latencies without parsing logs. ● You’re running a lab experiment and want to count cell divisions per second. 6
In the Beginning… Flat Files • Slap in some code to append to a file. • Import CSVs to Excel and graph it! • PLUS: - Easy to share - Easy to parse with code 7
Chose your own Adventure! Co-workers: “I like your instrumentation and graphs! We have more (apps | servers | experiments) for you to instrument. Can you do it? And give us a UI and CLI and (etc, etc, etc)” You: “... Sure!” 8
In the Beginning… Flat Files Now you see some problems: • Many series == many files. • How do you query lots of files? • What if you grow to the point you’re thrashing the disk IO? • Roll your own query and join code between files. • Roll your own graphing server, CLI etc. 9
RDBMS to the rescue! Pros: • Industry standard APIs and tools. • Standard query language with transforms, filtering, etc. • Replication, backups, high availability. • Lots of vendors (OSS and • paid) to choose from. • Just have to create a UI. 10
First Schema: • Index on metric and timestamp. • Easy to query for time ranges and specific metrics. SELECT max (value) FROM timeseriesTable WHERE metric = 'web01.sys.cpu.busy.pct' AND timestamp BETWEEN '2011-05-07' AND '2011-05-07 23:59:59:999' 11
Chose your own Adventure! Co-workers: “I SQL so much! Thank you! By the way, we’re going to push 1000 new metrics per second in an hour. Have a great lunch break.” You: “...” 12
First Schema: Cons: • More metrics and/or more frequent data means: • Bigger and bigger indices • Slower queries as the data set grows • Deleting data to cleanup huge tables takes longer 13
Second Schema: • Shard tables by month (later on by day, then hour…). • Join across tables in the DB or in app. • Delete old data by dropping a table. • Room to grow. SELECT max (value) FROM timeseriesTable_2011_05_07 WHERE metric = 'web01.sys.cpu.busy.pct' AND timestamp BETWEEN '2011-05-07' AND '2011-05-07 23:59:59:999' 14
Chose your own Adventure! Co-workers: “Thanks for bringing the DB back up but it’s down again. I think it could be because the __?__ group started pushing 100,000 metrics per second and are now sending metrics like host.system.cpu.core.busy.pct . You: “... oh.” 15
Second Schema: Cons: While it helps buy some time, with continued growth • you still have the problems of V1. One abuser can easily take down your system. • 16
Third Schema: • Shard tables by time and group. (even by server) • Reduce storage by using UID tables. SELECT max (ts.value), m.metric, h. host , dc.dataCenter FROM groupA_2011_05_07 ts JOIN dataCenters dc ON ts.dataCenterId = dc.dataCenterId JOIN metrics m ON ts.metricId = m.metricId JOIN hosts h ON ts.hostId = h.hostId WHERE m.metric = 'web01.sys.cpu.busy.pct' AND h. host REGEXP 'web.*' AND dc.dataCenter IN ('lga', 'phx') AND ts. timestamp BETWEEN '2011-05-07' AND '2011-05-07 23:59:59:999' 17
Chose your own Adventure! Co-workers: “Great work on the schema! Those queries are so much faster. Now we need more dimensions like X , Y , Z , Z ’ , etc. Can we also store JSON events, Git commits, strings, histograms and get some alerting?” You: sigh “Your wish is my command.” 18
Third Schema: Cons: • Doesn’t allow for unbounded dimensions (tags). Requires complex shard routing code. • Different columns or tables per data type or stored • procedures to encode/decode blobs. 19
Explore Dedicated Time Series Systems! 20
Problems to Solve: • Handle unbounded metrics and dimensions. • Handle high cardinality dimensions. • E.g. userId=? where unique(userId) >= 1M • Query wide time ranges at lower resolution. • E.g. use time rollups for 1 year queries. • Aggregate multiple time series into single views. • E.g. sum(sys.if.traffic_in) where datacenter = phx. • Perform transformations and extract useful analytics. • E.g. Top 10 highest traffic hosts. 99th percentile query latency. • Replication, High Availability, Write and Read throughput. 21
1990’s - MRTG and RRDTool 22
1990’s - MRTG and RRDTool Schema: • Circular buffer, fixed time interval and numeric data. Pros: • Fixed file sizes with lower resolution storage. • Built in graphing and simple methods. • Portable, backup-able. Cons: • Many series == many files == IO thrashing. • No replication/HA. 23 • Manual aggregation, no dimensions.
1990’s - KDB+, Informix Schema: • ? Proprietary. Pros: • Designed for time series. • Complex analysis. • Commercial support. Cons: • Commercial fees. • Little integration with open-source projects. 24
2000’s - Graphite Schema: • Circular buffer, fixed time interval and numeric data. Pros: • Aggregations and rollups available. • Transform functions and dashboarding. • Working on distributed stores. Cons: • Lack of replication/HA. • Same as RRDTool. 25
2010 - OpenTSDB • Open Source Time Series Database based on Google’s in-house time series DB. • Store trillions of data points at millions of writes per second. • Keeps raw data at the original timestamp and precise value. • Keep it forever or TTL it out. • Scales using HBase or Bigtable. • Provides multi-series analysis. 26
What are HBase and Bigtable? ● HBase is an OSS distributed LSM backed hash table based on Google’s Bigtable. ● Key value, row based column store. ● Sorted by row, columns and cell versions. ● Supports: o Scans across rows with filters. o Get specific row and/or columns. o Atomic operations. ● CP from CAP theorem. 27
OpenTSDB Schema ● Row key is a concatenation of UIDs and time: salt + metric + timestamp + tagk1 + tagv1… + tagkN + tagvN o ● sys.cpu.user 1234567890 42 host=web01 cpu=0 \x01\x00\x00\x01\x49\x95\xFB\x70\x00\x00\x01\x00\x00\x01\x00\x00\x02\x00\x00\x02 ● Timestamp normalized on hour or daily boundaries. ● All data points for an hour or day are stored in one row. ● Data: VLE 64 bit signed integers or single/double precision signed floats, Strings and raw histograms. ● Saves storage space but requires UID conversion. 28
OpenTSDB Schema Row Key Columns (qualifier/value) m t1 tagk1 tagv1 o1/v1 o2/v2 o3/v3 m t1 tagk1 tagv2 o1/v1 o2/v2 m t1 tagk1 tagv1 tagk2 tagv3 o1/v1 o2/v2 o3/v3 m t1 tagk1 tagv2 tagk2 tagv4 o1/v1 o3/v3 m t1 tagk3 tagv5 o1/v1 o2/v2 o3/v3 m t1 tagk3 tagv6 o2/v2 m t2 tagk1 tagv1 o1/v1 o3/v3 m t2 tagk1 tagv2 o1/v1 o2/v2 29
OpenTSDB Use Cases ● Backing store for Argus: Open source monitoring and alerting system. ● 50M writes per minute. ● ~4M writes per TSD per minute. ● 23k queries per minute. ● https://github.com/salesforce/Argus 30
OpenTSDB Use Cases ● Monitoring system, network and application performance and statistics. ● Single cluster: 10M to 18M writes/s ~ 3PB. ● Multi-tenant and Kerberos secure HBase. ● ~200k writes per second per TSD. ● Central monitoring for all Yahoo properties. ● Over 1 billion active time series served. ● Leading committer to OpenTSDB. 31
Other Users 32
OpenTSDB Pros: • Scalable with HBase/HDFS or hosted Google Bigtable including replication. • Annotation and distributed histograms (digests). • Rollup, pre-aggregate support. • Built-in graphing and analytics or use OSS tools (Grafana). Cons: • Distributed HBase is complex. (Hosted Bigtable easy). • UID resolution and current lack of metadata. • High cardinality or dense time series still an issue. 33
OpenTSDB For version 3.0: • New query engine with: • Distributed queries. • Time based caching. • Write-through caching using Facebook Beringei. • Pluggable storage engines. • Anomaly detection via machine learning. 34
Recommend
More recommend