Time-Series Data in MongoDB on a Budget Peter Schwaller Senior - PowerPoint PPT Presentation

Time-Series Data in MongoDB on a Budget Peter Schwaller – Senior Director Server Engineering, Percona Santa Clara, California | April 23th – 25th, 2018

TIME SERIES DATA in MongoDB on a Budget Click to add text

What is Time-Series Data? Characteristics: • Arriving data is stored as a new value as opposed to overwriting existing values • Usually arrives in time order • Accumulated data size grows over time • Time is the primary means of organizing/accessing the data 3

Time Series Data in MONGODB on a Budget Click to add text

Why MongoDB? • General purpose database • Specialized Time-Series DBs do exist • Do not use mmap storage engine 5

Data Retention Options • Purge old entries • Set up MongoDB index with TTL option (be careful if this index is your shard key) • Aggregate data and store summaries • Create summary document, delete original raw data • Huge compression possible (seconds->minutes->hours->days->months->years) • Measurement buckets • Store all entries for a time window in a single document • Avoids storing duplicate metadata • Individual Documents for Each Measurement • Useful when data is sparse or intermittent (e.g., events rather than sensors) 6

Potential Problems with Data Collection • Duplicate entries • Utilize unique index in MongoDB to reject duplicate entries • Delayed • Out of order 7

Problems with Delayed and Out-of-Order Entries • Alert/Event generation • Incremental Backup 8

Enable Streaming of Data • Add recordedTime field (in addition to existing field with timestamp) • Utilize $currentDate feature of db.collection.update() $currentDate: { recordedTime: true } • You cannot use this field as a shard key! • Requires use of update instead of insert • Which in turn requires specification of _id field • Consider constructing your _id to solve the duplicate entries issue at the same time Allows applications to reliably process each document once and only once. 9

Accessing Your Data It’s only *mostly* write-only.

Create Appropriate Indexes • Avoid collection scans! • Consider using: db.adminCommand( { setParameter: 1, notablescan: 1 } ) • Avoid queries that might as well be collection scans • Create the indexes you need (but no more) • Don’t depend on index intersection • Don’t over index • Each index can take up a lot of disk/memory • Consider using partial indexes { partialFilterExpression: { speed: { $gt: 75.0 } } } 11

Check Your Indexes • Use .explain() liberally • Check which indexes are actually used: db.collection.aggregate( [ { $indexStats: {}}]) 12

Adding Data Getting the Speed You Need

API Methods • Insert array database[collection].insert(doc_array) • Insert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.insert(doc) # loop here bulk.execute() • Upsert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop here bulk.execute() • Insert single database[collection].insert(doc) • Upsert single database[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True) 14

Relative Performance Comparison of API Methods 40000 35000 30000 25000 20000 15000 10000 5000 0 Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single Docs/Sec 15

Benchmarks … and other lies. Answering , “ Why can’t I just use a gigantic HDD RAID array ?”

Benchmark Environment • VMs • 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz • 8 GB RAM • Sandisk Ultra II 960GB SSD • WD 5TB 7200rpm HDD • MongoDB • 3.4.13 • WiredTiger • 4GB Cache • Snappy collection compression • Standalone server (no replica set, no mongos) • Data • 178 bytes per document in 6 fields • 3 indexes (2 compound) • Disk usage: 40% storage, 60% indexes • Using update unordered bulk method, 1000 docs per bulk.execute() 17

Benchmark SSD vs. HDD 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 18

SSD Benchmark 60 Minutes 19

SSD Benchmark 0:30-1:00 20

HDD Benchmark 0:30-1:30 21

HDD Benchmark 0:30-8:45 (42M documents) 22

HDD Benchmark Last Hour 23

SSD Benchmark 0:30-2:10 (42M documents) 24

Benchmark SSD vs. HDD Last Hour 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 25

96 Hour Test 26

TL;DR • Don’t trust someone else’s benchmarks (especially mine!) • Benchmark using your own “schema” and indexes • Artificially accelerate index size exceeding available memory 27

Time Series Data in MongoDB on a BUDGET

Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary, Secondary, Secondary) • Every replica set server on its own hardware • Disk mirroring • Cost cutting options • Primary, Secondary, Arbiter • Locate multiple replica set servers on the same hardware (but NOT from the SAME replica set) • No disk mirroring (how many copies do you really need?) • “I love downtime and don’t care about my data” • Single instance servers instead of replica sets • RAID0 (“no wasted disk space!”) • No backups 29

Storing Lots of Data “ Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. ”

Conventional Sharding • Non-sharded data kept in default replica set • Shard key hashed on timestamp to evenly distribute data • Pros: • Increases insert rate • Arbitrarily large data storage • Cons: • All shard replica sets should have comparable hardware • All shards start thrashing at the same time • Expanding means a LOT of rebalancing 31

Data Access Patterns • New writes are always very recent • Reads are almost always of recent data • Reads of old data are “intuitively” slower … let’s take advantage of that. 32

Sharding by Zone • Non-sharded data kept in default replica set • Most recent timeseries data stored in “fast” replica set • Older timeseries data stored in “slow” replica sets • Pros: • Pay for speed where we need it • Swap “fast” to “slow” before thrashing kills performance • “Infinite” data size • Cons: • Ceiling on insert speed 33

Prerequisites for Zone Sharding • Sharded cluster configured (config replica set, mongos, etc) • Existing replica set rsmain (primary shard) contains your normal (not timeseries) data • TimeSeries collection with an index on “time” • New replica set for time-series data (e.g., rs001) added as a shard 34

Initial Zone Ranges • Run on mongos: use admin sh.enableSharding (‘ DBName ’) sh.shardCollection (‘ DBName.TimeSeries ’, { time : 1 } ) sh.addShardTag('rsmain ', ‘future') sh.addShardTag (‘rs001', ‘ts001') sh.addTagRange('DBName.TimeSeries',{time: new Date("2099-01-01")}, {time:MaxKey},'future') sh.addTagRange (‘ DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01- 01")},‘ts001') # sh.splitAt(' DBName.TimeSeries ', {"time" : new Date("2099-01-01") }) 35

Adding a New Time-Series Replica Set Step 1 – Create new Replica Set • When? • Well before you run out of available fast storage • Before your input capacity is lowered too close to your needs • Where? • On the same server with fast storage as the current time-series replica set • Run on mongos: use admin db.runCommand({addShard: “rs002/ hostname:port", name: "rs002"}) sh.addShardTag (‘rs002’, ‘ts002') var configdb=db.getSiblingDB("config"); configdb.tags.update ({tag:“ts001"},{$set:{' max.time': new ISODate (“2018 -04- 26”) }}) sh.addTagRange (‘ DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01- 01")},‘ts002') # sh.splitAt('DBName.TimeSeries', {"time" : new ISODate("2018-04-26")}) 36

Adding a New Time-Series Replica Set Step 2 – Wait before Relocation • Initially nothing changes – all data is added into previous replica set • Eventually, new entries match the min.time of the new replica set and will be stored there • How long to wait before relocation? • Make sure you don’t fill up your fast storage • How far back in time do “normal” queries go? - Queries to previous replica set will get slower after relocation 37

Adding a New Time-Series Replica Set Step 3 – Relocate to Slow Storage • Follow standard procedure for moving replica set • Multiple server instances can share same server/storage • Use unique ports • Set – wiredTigerCacheSizeGB appropriately 38

Pause for Questions

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior - PowerPoint PPT Presentation

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California | April 23th 25th, 2018 TIME SERIES DATA in MongoDB on a Budget Click to add text What is Time-Series Data?

Percona Backup for MongoDB Akira Kurogane Percona 3 - 2 - 1 MongoDB Percona Server for

MongoDB Building data model with MongoDB and Mongoose MVC Pattern Connect Express app to

MongoDB Thomas Schwarz, SJ MongoDB History 2007 Developed by 10gen as a Platform as a Service

MongoDB Sharding 101 Agenda What is MongoDB? Single Instances Replica-set

Everything You Know About MongoDB is Wrong (Probably) Mark Smith | MongoDB | @Judy2K Myth 0

External Authentication with Percona Server for MongoDB and MongoDB Enterprise Jason Terpko DBA

1. Instillations o https://www.mongodb.com/download-center/community 2. Download and Install

Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database Me

Information Retrieval in MongoDB Data storage, Indexing and Querying Kaustubh Dhokte (NB97699)

Geospatial and MongoDB MongoDB Geospatial Features Agenda Query Examples Optimizations 2

MongoDB Backups, All Grown up! David Murphy David Murphy MongoDB Practice Manager for Percona

What's New in Percona Server for MongoDB? 2019 Q3: Enterprise Enhancements and v4.2 4:00 PM -

MongoDB and Java 8 Agenda Java8 Main Features MongoDB + Java8 Few Examples RX Driver 3 Java

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

CS 61: Database Systems MongoDB Schema Design Adapted mongodb.com unless otherwise noted Agenda

LiNGAM combined with Instantaneous effects can be incorporated explicitly into account through

Correlating Events with Time Series for Incident Diagnosis Ricardo Reimao Idea: Identifying

Time Series Analysis Henrik Madsen hm@imm.dtu.dk Informatics and Mathematical Modelling

Modeling time series with hidden Markov models Advanced Machine learning 2017 Nadia Figueroa,

Time series analysis Agathe Guilloux Organizational issues be graded.

Lecture 6 Discrete Time Series Colin Rundel 02/06/2017 1 Discrete Time Series 2 Stationary

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

Information Systems M Prof. Paolo Ciaccia