Time-Series Data in MongoDB on a Budget Peter Schwaller – Senior Director Server Engineering, Percona Santa Clara, California | April 23th – 25th, 2018
TIME SERIES DATA in MongoDB on a Budget Click to add text
What is Time-Series Data? Characteristics: • Arriving data is stored as a new value as opposed to overwriting existing values • Usually arrives in time order • Accumulated data size grows over time • Time is the primary means of organizing/accessing the data 3
Time Series Data in MONGODB on a Budget Click to add text
Why MongoDB? • General purpose database • Specialized Time-Series DBs do exist • Do not use mmap storage engine 5
Data Retention Options • Purge old entries • Set up MongoDB index with TTL option (be careful if this index is your shard key) • Aggregate data and store summaries • Create summary document, delete original raw data • Huge compression possible (seconds->minutes->hours->days->months->years) • Measurement buckets • Store all entries for a time window in a single document • Avoids storing duplicate metadata • Individual Documents for Each Measurement • Useful when data is sparse or intermittent (e.g., events rather than sensors) 6
Potential Problems with Data Collection • Duplicate entries • Utilize unique index in MongoDB to reject duplicate entries • Delayed • Out of order 7
Problems with Delayed and Out-of-Order Entries • Alert/Event generation • Incremental Backup 8
Enable Streaming of Data • Add recordedTime field (in addition to existing field with timestamp) • Utilize $currentDate feature of db.collection.update() $currentDate: { recordedTime: true } • You cannot use this field as a shard key! • Requires use of update instead of insert • Which in turn requires specification of _id field • Consider constructing your _id to solve the duplicate entries issue at the same time Allows applications to reliably process each document once and only once. 9
Accessing Your Data It’s only *mostly* write-only.
Create Appropriate Indexes • Avoid collection scans! • Consider using: db.adminCommand( { setParameter: 1, notablescan: 1 } ) • Avoid queries that might as well be collection scans • Create the indexes you need (but no more) • Don’t depend on index intersection • Don’t over index • Each index can take up a lot of disk/memory • Consider using partial indexes { partialFilterExpression: { speed: { $gt: 75.0 } } } 11
Check Your Indexes • Use .explain() liberally • Check which indexes are actually used: db.collection.aggregate( [ { $indexStats: {}}]) 12
Adding Data Getting the Speed You Need
API Methods • Insert array database[collection].insert(doc_array) • Insert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.insert(doc) # loop here bulk.execute() • Upsert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop here bulk.execute() • Insert single database[collection].insert(doc) • Upsert single database[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True) 14
Relative Performance Comparison of API Methods 40000 35000 30000 25000 20000 15000 10000 5000 0 Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single Docs/Sec 15
Benchmarks … and other lies. Answering , “ Why can’t I just use a gigantic HDD RAID array ?”
Benchmark Environment • VMs • 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz • 8 GB RAM • Sandisk Ultra II 960GB SSD • WD 5TB 7200rpm HDD • MongoDB • 3.4.13 • WiredTiger • 4GB Cache • Snappy collection compression • Standalone server (no replica set, no mongos) • Data • 178 bytes per document in 6 fields • 3 indexes (2 compound) • Disk usage: 40% storage, 60% indexes • Using update unordered bulk method, 1000 docs per bulk.execute() 17
Benchmark SSD vs. HDD 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 18
SSD Benchmark 60 Minutes 19
SSD Benchmark 0:30-1:00 20
HDD Benchmark 0:30-1:30 21
HDD Benchmark 0:30-8:45 (42M documents) 22
HDD Benchmark Last Hour 23
SSD Benchmark 0:30-2:10 (42M documents) 24
Benchmark SSD vs. HDD Last Hour 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 25
96 Hour Test 26
TL;DR • Don’t trust someone else’s benchmarks (especially mine!) • Benchmark using your own “schema” and indexes • Artificially accelerate index size exceeding available memory 27
Time Series Data in MongoDB on a BUDGET
Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary, Secondary, Secondary) • Every replica set server on its own hardware • Disk mirroring • Cost cutting options • Primary, Secondary, Arbiter • Locate multiple replica set servers on the same hardware (but NOT from the SAME replica set) • No disk mirroring (how many copies do you really need?) • “I love downtime and don’t care about my data” • Single instance servers instead of replica sets • RAID0 (“no wasted disk space!”) • No backups 29
Storing Lots of Data “ Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. ”
Conventional Sharding • Non-sharded data kept in default replica set • Shard key hashed on timestamp to evenly distribute data • Pros: • Increases insert rate • Arbitrarily large data storage • Cons: • All shard replica sets should have comparable hardware • All shards start thrashing at the same time • Expanding means a LOT of rebalancing 31
Data Access Patterns • New writes are always very recent • Reads are almost always of recent data • Reads of old data are “intuitively” slower … let’s take advantage of that. 32
Sharding by Zone • Non-sharded data kept in default replica set • Most recent time- series data stored in “fast” replica set • Older time- series data stored in “slow” replica sets • Pros: • Pay for speed where we need it • Swap “fast” to “slow” before thrashing kills performance • “Infinite” data size • Cons: • Ceiling on insert speed 33
Prerequisites for Zone Sharding • Sharded cluster configured (config replica set, mongos, etc) • Existing replica set rsmain (primary shard) contains your normal (not time- series) data • TimeSeries collection with an index on “time” • New replica set for time-series data (e.g., rs001) added as a shard 34
Initial Zone Ranges • Run on mongos: use admin sh.enableSharding (‘ DBName ’) sh.shardCollection (‘ DBName.TimeSeries ’, { time : 1 } ) sh.addShardTag('rsmain ', ‘future') sh.addShardTag (‘rs001', ‘ts001') sh.addTagRange('DBName.TimeSeries',{time: new Date("2099-01-01")}, {time:MaxKey},'future') sh.addTagRange (‘ DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01- 01")},‘ts001') # sh.splitAt(' DBName.TimeSeries ', {"time" : new Date("2099-01-01") }) 35
Adding a New Time-Series Replica Set Step 1 – Create new Replica Set • When? • Well before you run out of available fast storage • Before your input capacity is lowered too close to your needs • Where? • On the same server with fast storage as the current time-series replica set • Run on mongos: use admin db.runCommand({addShard: “rs002/ hostname:port", name: "rs002"}) sh.addShardTag (‘rs002’, ‘ts002') var configdb=db.getSiblingDB("config"); configdb.tags.update ({tag:“ts001"},{$set:{' max.time': new ISODate (“2018 -04- 26”) }}) sh.addTagRange (‘ DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01- 01")},‘ts002') # sh.splitAt('DBName.TimeSeries', {"time" : new ISODate("2018-04-26")}) 36
Adding a New Time-Series Replica Set Step 2 – Wait before Relocation • Initially nothing changes – all data is added into previous replica set • Eventually, new entries match the min.time of the new replica set and will be stored there • How long to wait before relocation? • Make sure you don’t fill up your fast storage • How far back in time do “normal” queries go? - Queries to previous replica set will get slower after relocation 37
Adding a New Time-Series Replica Set Step 3 – Relocate to Slow Storage • Follow standard procedure for moving replica set • Multiple server instances can share same server/storage • Use unique ports • Set – wiredTigerCacheSizeGB appropriately 38
Pause for Questions
Recommend
More recommend