timescaledb
play

TimescaleDB: Re-engineering PostgreSQL as a time-series database - PowerPoint PPT Presentation

TimescaleDB: Re-engineering PostgreSQL as a time-series database Michael J. Freedman Co-founder / CTO, Timescale Professor of Computer Science, Princeton mike@timescale.com github.com/timescale Apache 2 License Financial


  1. TimescaleDB: Re-engineering PostgreSQL as a time-series database Michael J. Freedman Co-founder / CTO, Timescale 
 Professor of Computer Science, Princeton mike@timescale.com · github.com/timescale · Apache 2 License

  2. Financial & Datacenter Marketing & DevOps Time-series Data is Emerging Web / mobile Events Everywhere Transportation & Logistics Industrial Machines

  3. So, what is time-series data ?

  4. What you have been told Name CPU Tags Host=Name,Region=West Data 1990-01-01 01:02:00 70 
 1990-01-01 01:03:00 71 1990-01-01 01:04:00 72 1990-01-01 01:04:00 73 1990-01-01 01:04:00 100

  5. What you have been told Name CPU FreeMem Tags Host=Name,Region=West Host=Name,Region=West Data 1990-01-01 01:02:00 70 
 1990-01-01 01:02:00 800M 
 1990-01-01 01:03:00 71 1990-01-01 01:03:00 600M 1990-01-01 01:04:00 72 1990-01-01 01:04:00 400M 1990-01-01 01:04:00 73 1990-01-01 01:04:00 200M 1990-01-01 01:04:00 100 1990-01-01 01:04:00 0 2 time-series?

  6. This is wrong

  7. Time-series data has a richer structure Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 
 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120

  8. Fewer queries Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 
 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 select * where time = x

  9. Complex filters Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 
 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 select * where temp > 100

  10. Complex aggregates Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 
 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 avg(mem_free) group by (cpu/10)

  11. Correlations Tags Host=Name,Region=West CPU MemFree Temp Data 1990-01-01 01:02:00 
 70 800M 80 1990-01-01 01:03:00 71 600M 81 1990-01-01 01:04:00 72 400M 82 1990-01-01 01:04:00 73 200M 83 1990-01-01 01:04:00 100 0 120 How does temp correlate with memfree?

  12. Leverage relations CPU Host Region Data 1990-01-01 01:02:00 
 70 1 91 1990-01-01 01:03:00 71 2 91 1990-01-01 01:04:00 72 3 93 1990-01-01 01:04:00 73 4 93 1990-01-01 01:04:00 100 5 95 Region stored in separate host metadata table

  13. How is this different than having a time field?

  14. Treat changes as inserts , not overwrites.

  15. You probably have time-series data and don’t even realize it

  16. What database for time-series data?

  17. What database for time-series data? Relational 32% NoSQL 68% https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/

  18. Why so much NoSQL?

  19. Why not use a simple relational database table ?

  20. Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage) Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

  21. Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage) Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

  22. B-tree Insert Pain Insert batch: 8 17 5 1 10 20 1 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK

  23. B-tree Insert Pain Insert batch: 8 17 5 1 10 20 1 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK

  24. B-tree Insert Pain Insert batch: 8 17 1 10 20 1 5 10 13 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK

  25. B-tree Insert Pain Insert batch: 8 1 10 20 1 5 10 13 17 24 25 29 IN MEMORY Memory Capacity: 2 NODES WRITE TO DISK

  26. Challenge in scaling up • Indexes write to random parts of B-tree Device, Time DESC – Example: BTREE(device_id, time DESC) • As table grows large – Indexes no longer fit in memory – Writes to random locations in B-tree Device: A Device: Z – Causes swapping Time: 01:01:01 Time: 01:01:01

  27. Why not just use a NoSQL DB ?

  28. NoSQL champion: Log-Structured Merge Trees • Compressed data storage Key-value store with indexed key lookup at • Common approach for time series: high-write rates use key <name, tags, field, time> +

  29. Log-Structured Merge Tree Insert batch: 8 17 5 1 1 2 3 3 4 5 5 7 7 9 9 10 10 11 Older Memtable sstables

  30. Log-Structured Merge Tree Insert batch: 1 1 2 3 3 8 4 5 17 5 7 5 7 9 9 10 10 11 Older Memtable sstables

  31. Log-Structured Merge Tree Insert batch: 1 1 2 3 3 5 4 5 8 5 7 17 7 9 9 10 10 11 Older Memtable sstables

  32. Problems with non-KV queries If querying for key 5: stop after finding it in first sstable If querying for data in time range 5-10: 1 1 2 3 have to scan all sstables 3 5 4 5 8 5 7 Requires (in-memory) data structs to 17 7 9 9 efficiently map tags to keys 
 10 10 11 ⇒ high-cardinality issue Older No secondary index support 
 ⇒ “tag lock-in” sstables

  33. NoSQL + LSMTs Come at a Cost • Less powerful queries • Weaker consistency (no ACID) • No JOINS • Loss of SQL ecosystem +

  34. Is there a better way ?

  35. TimescaleDB: 
 Scalable time-series database, full SQL Packaged as a PostgreSQL extension Apache 2 Licensed

  36. TimescaleDB : 
 Scalable time-series database, full SQL Packaged as a PostgreSQL extension Easy to Use Scalable Reliable • Inherits 20+ years of Supports full SQL High write rates • • PostgreSQL reliability Time-oriented features Time-oriented optimizations • • • Streaming replication, Easy to manage: looks Fast complex queries • • backups, HA clustering like a regular table 100s billions rows / node • One DB for relational 
 • & time-series data

  37. How?

  38. Time-series workloads are different.

  39. OLTP Time-series • Primarily UPDATEs • Primarily INSERTs • Writes randomly distributed • Writes to recent time interval • Transactions to multiple 
 • Writes associated with a 
 primary keys timestamp and primary key

  40. How it works

  41. Time 
 (older)

  42. Time-space partitioning 
 (for both scaling up & out) Time 
 (older) Intervals 1) manually specified 2) automatically adjusted

  43. Time-space partitioning 
 (for both scaling up & out) Time 
 (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)

  44. Time-space partitioning 
 (for both scaling up & out) Time 
 Chunk (sub-table) (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)

  45. Time-space partitioning 
 (for both scaling up & out) Hypertable Time 
 Chunk (sub-table) (older) Intervals 1) manually specified 2) automatically adjusted Space (hash partitioning)

  46. The Hypertable Abstraction Hypertable Indexes • Triggers • Constraints • Foreign keys • UPSERTs • Chunks Table mgmt •

  47. PostgreSQL 10 doesn’t solve the problem 
 (more later)

  48. Not well supported in PG10 Hypertable Indexes • Triggers • Constraints • Foreign keys • UPSERTs • Chunks Table mgmt •

  49. Automatic Space-time Partitioning

  50. Automatic Space-time Partitioning

  51. Chunking benefits

  52. Chunks are “right-sized” Recent (hot) chunks fit in memory

  53. Single node: Scaling up via adding disks How Benefit Faster inserts • Chunks spread across many disks (elastically!) Parallelized queries either RAIDed or via distinct tablespaces •

  54. Under development Multi-node: Scaling out via existing mechanisms • Chunks spread across servers • Insert/query to any server D e v e l o p m e n t • Distributed query optimizations (push-down LIMITs and aggregates, etc.)

  55. Avoid querying chunks via constraint exclusion SELECT time, temp FROM data 
 WHERE time > now() - interval ‘7 days’ 
 AND device_id = ‘12345’

  56. Avoid querying chunks via constraint exclusion SELECT time, device_id, temp FROM data 
 WHERE time > ‘2017-08-22 18:18:00+00’

  57. Avoid querying chunks via constraint exclusion SELECT time, device_id, temp FROM data 
 Plain Postgres WHERE time > now() - interval ’24 hours’ won’t exclude chunks

  58. Efficient retention policies SELECT time, device_id, temp FROM data 
 Drop chunks, don’t delete rows WHERE time > now() - interval ’24 hours’ ⇒ avoids vacuuming

  59. Is it just about performance?

  60. Simplify your stack Application Application VS RDBMS NoSQL TimescaleDB 
 (with JOINS)

  61. Rich Time Analytics

  62. Geospatial Temporal Analysis (with PostGIS)

  63. Data Retention + Aggregations Granularity raw 15 min day 1 week 1 month forever Retention

  64. Enjoy the entire PostgreSQL ecosystem

More recommend