TimescaleDB: Re-engineering PostgreSQL as a time-series database Michael J. Freedman Co-founder / CTO, Timescale Professor of Computer Science, Princeton mike@timescale.com · github.com/timescale · Apache 2 License
Financial & Datacenter Marketing & DevOps Time-series Data is Emerging Web / mobile Events Everywhere Transportation & Logistics Industrial Machines
So, what is time-series data ?
What you have been told Name CPU Tags Host=Name,Region=West Data 1990-01-01 01:02:00 70 1990-01-01 01:03:00 71 1990-01-01 01:04:00 72 1990-01-01 01:04:00 73 1990-01-01 01:04:00 100
What you have been told Name CPU FreeMem Tags Host=Name,Region=West Host=Name,Region=West Data 1990-01-01 01:02:00 70 1990-01-01 01:02:00 800M 1990-01-01 01:03:00 71 1990-01-01 01:03:00 600M 1990-01-01 01:04:00 72 1990-01-01 01:04:00 400M 1990-01-01 01:04:00 73 1990-01-01 01:04:00 200M 1990-01-01 01:04:00 100 1990-01-01 01:04:00 0 2 time-series?
This is wrong
Time-series data has a richer structure Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120
Fewer queries Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 select * where time = x
Complex filters Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 select * where temp > 100
Complex aggregates Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 avg(mem_free) group by (cpu/10)
Correlations Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 How does temp correlate with memfree?
Leverage relations CPU Host Region Data 1990-01-01 01:02:00 70 1 91 1990-01-01 01:03:00 71 2 91 1990-01-01 01:04:00 72 3 93 1990-01-01 01:04:00 73 4 93 1990-01-01 01:04:00 100 5 95 Region stored in separate host metadata table
How is this different than having a time field?
Treat changes as inserts , not overwrites.
You probably have time-series data and don’t even realize it
What database for time-series data?
What database for time-series data? Relational 32% NoSQL 68% https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/
Why so much NoSQL?
Why not use a simple relational database table ?
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage) Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage) Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
B-tree Insert Pain Insert batch: 8 17 5 1 10 20 1 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK
B-tree Insert Pain Insert batch: 8 17 5 1 10 20 1 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK
B-tree Insert Pain Insert batch: 8 17 1 10 20 1 5 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK
B-tree Insert Pain Insert batch: 8 1 10 20 1 5 10 13 17 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK
Challenge in scaling up • Indexes write to random parts of B-tree Device, Time DESC – Example: BTREE(device_id, time DESC) • As table grows large – Indexes no longer fit in memory – Writes to random locations in B-tree Device: A Device: Z – Causes swapping Time: 01:01:01 Time: 01:01:01
Why not just use a NoSQL DB ?
NoSQL champion: Log-Structured Merge Trees • Compressed data storage Key-value store with indexed key lookup at • Common approach for time series: high-write rates use key <name, tags, field, time> +
Log-Structured Merge Tree Insert batch: 8 17 5 1 1 2 3 3 4 5 5 7 7 9 9 10 10 11 Older Memtable sstables
Log-Structured Merge Tree Insert batch: 1 1 2 3 3 8 4 5 17 5 7 5 7 9 9 10 10 11 Older Memtable sstables
Log-Structured Merge Tree Insert batch: 1 1 2 3 3 5 4 5 8 5 7 17 7 9 9 10 10 11 Older Memtable sstables
Problems with non-KV queries If querying for key 5: stop after finding it in first sstable If querying for data in time range 5-10: 1 1 2 3 have to scan all sstables 3 5 4 5 8 5 7 Requires (in-memory) data structs to 17 7 9 9 efficiently map tags to keys 10 10 11 ⇒ high-cardinality issue Older No secondary index support ⇒ “tag lock-in” sstables
NoSQL + LSMTs Come at a Cost • Less powerful queries • Weaker consistency (no ACID) • No JOINS • Loss of SQL ecosystem +
Is there a better way ?
TimescaleDB: Scalable time-series database, full SQL Packaged as a PostgreSQL extension Apache 2 Licensed
TimescaleDB : Scalable time-series database, full SQL Packaged as a PostgreSQL extension Easy to Use Scalable Reliable • Inherits 20+ years of Supports full SQL High write rates • • PostgreSQL reliability Time-oriented features Time-oriented optimizations • • • Streaming replication, Easy to manage: looks Fast complex queries • • backups, HA clustering like a regular table 100s billions rows / node • One DB for relational • & time-series data
How?
Time-series workloads are different.
OLTP Time-series • Primarily UPDATEs • Primarily INSERTs • Writes randomly distributed • Writes to recent time interval • Transactions to multiple • Writes associated with a primary keys timestamp and primary key
How it works
Time (older)
Time-space partitioning (for both scaling up & out) Time (older) Intervals 1) manually specified 2) automatically adjusted
Time-space partitioning (for both scaling up & out) Time (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)
Time-space partitioning (for both scaling up & out) Time Chunk (sub-table) (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)
Time-space partitioning (for both scaling up & out) Hypertable Time Chunk (sub-table) (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)
The Hypertable Abstraction Hypertable Indexes • Triggers • Constraints • Foreign keys • UPSERTs • Chunks Table mgmt •
PostgreSQL 10 doesn’t solve the problem (more later)
Not well supported in PG10 Hypertable Indexes • Triggers • Constraints • Foreign keys • UPSERTs • Chunks Table mgmt •
Automatic Space-time Partitioning
Automatic Space-time Partitioning
Chunking benefits
Chunks are “right-sized” Recent (hot) chunks fit in memory
Single node: Scaling up via adding disks How Benefit Faster inserts • Chunks spread across many disks (elastically!) Parallelized queries either RAIDed or via distinct tablespaces •
Under development Multi-node: Scaling out via existing mechanisms • Chunks spread across servers • Insert/query to any server D e v e l o p m e n t • Distributed query optimizations (push-down LIMITs and aggregates, etc.)
Avoid querying chunks via constraint exclusion SELECT time, temp FROM data WHERE time > now() - interval ‘7 days’ AND device_id = ‘12345’
Avoid querying chunks via constraint exclusion SELECT time, device_id, temp FROM data WHERE time > ‘2017-08-22 18:18:00+00’
Avoid querying chunks via constraint exclusion SELECT time, device_id, temp FROM data Plain Postgres WHERE time > now() - interval ’24 hours’ won’t exclude chunks
Efficient retention policies SELECT time, device_id, temp FROM data Drop chunks, don’t delete rows WHERE time > now() - interval ’24 hours’ ⇒ avoids vacuuming
Is it just about performance?
Simplify your stack Application Application VS RDBMS NoSQL TimescaleDB (with JOINS)
Rich Time Analytics
Geospatial Temporal Analysis (with PostGIS)
Data Retention + Aggregations Granularity raw 15 min day 1 week 1 month forever Retention
Enjoy the entire PostgreSQL ecosystem
More recommend