Improvements to MongoRocks in 2017 Mark Callaghan Member of Technical Sta ff , Facebook
MongoDB Engines • mmapv1 - update-in-place B-Tree • WiredTiger - copy-on-write B-Tree • MongoRocks - log structured merge tree (LSM)
Why MongoRocks? • Best space e ffi ciency • Better write e ffi ciency • Good read e ffi ciency • E ff ective with SSD & disk RUM = R ead, U pdate, M emory • An algorithm can’t be optimal for all 3 • See Designing Access Methods: The RUM Conjecture
Making MongoRocks better We make MongoRocks better by making RocksDB better • Smarter compaction • Support for database >> RAM • Less response time variance • Much more done and in progress
Things to make better in MongoRocks • RocksDB has too many options • CPU bound on IO-heavy tests • Percona Server isn’t using recent RocksDB releases
MongoDB versions tested Percona Server for MongoDB • 3.0.15 with RocksDB 4.1.0 (October 2015) • 3.2.15 with RocksDB 4.4.1 (February 2016) • 3.4.6 with RocksDB 4.13.5 (December 2016) Compiled from source • MongoDB 3.4.7 with RocksDB 5.7.3 (August 2017)
Insert benchmark Finding database stalls for 10 years • N collections (N is 1 or 16) Runs in 4 phases • 1 PK, 3 secondary indexes per collection 1. Insert-only - load X million rows • Inserts are in PK order, random for secondary indexes 2. Scan PK, secondary indexes in sequence • Queries are short range scans 3. N clients do queries, N clients each insert 1000/s • Supports MongoDB and MySQL 4. N clients do queries, N clients each insert 100/s • In Python, need to rewrite with something faster Configuration options • Database in memory vs IO-bound • Client per collection
Load throughput Large server, 16 clients IO-bound In-memory 13,157 17,778 3.0 3.0 13,983 21,122 mmapv1 3.2 mmapv1 3.2 14,446 24,286 3.4 3.4 17,286 113,533 3.0 3.0 14,179 94,841 WiredTiger 3.2 WiredTiger 3.2 14,234 93,284 3.4 3.4 7,827 7,931 3.0 3.0 47,990 50,241 MongoRocks 3.2 MongoRocks 3.2 51,943 54,933 3.4 3.4 0 15000 30000 45000 60000 0 30000 60000 90000 120000 Inserts/second Inserts/second
IO-bound load e ffi ciency Large server, 16 clients, database >> RAM, v3.4.6 Avg IO read/ IO KB read/ IO KB write/insert CPU/insert Size (GB) IPS insert insert mmapv1 14446 3.95 22.72 25.53 595 11xx WiredTiger 14234 1.81 26.26 39.31 2082 577 MongoRocks 51493 0.05 2.83 10.32 461 487
Scan secondary indexes Large server, 16 clients IO-bound In-memory 1,334 103 3.0 3.0 1,461 116 mmapv1 3.2 mmapv1 3.2 1,474 121 3.4 3.4 2,078 159 3.0 3.0 2,149 231 WiredTiger 3.2 WiredTiger 3.2 2,182 230 3.4 3.4 13,904 1,932 3.0 3.0 3,022 340 MongoRocks 3.2 MongoRocks 3.2 3,538 363 3.4 3.4 0 1250 2500 3750 5000 0 250 500 750 1000 Seconds Seconds
IO-bound: read-write Large server, 16 clients, database >> RAM 103 11,944 3.0 3.0 120 12,277 mmapv1 3.2 mmapv1 3.2 175 12,253 3.4 3.4 4,025 12,812 3.0 3.0 484 13,930 WiredTiger 3.2 WiredTiger 3.2 630 13,221 3.4 3.4 432 7,119 3.0 3.0 5,596 15,845 MongoRocks 3.2 MongoRocks 3.2 5,390 15,845 3.4 3.4 0 1500 3000 4500 6000 0 4000 8000 12000 16000 Queries/second Inserts/second
IO-bound: read-write Large server, 16 clients, database >> RAM, v3.4.6 Avg Avg IO MB read/ IO MB write/ CPU IPS QPS second second mmapv1 12253 175 416 324 8.3 WiredTiger 13221 630 631 567 42.4 MongoRocks 15845 5390 914 504 21.2
In-memory: read-write Large server, 16 clients, database >> RAM 1,010 15,276 3.0 3.0 7,218 15,352 mmapv1 3.2 mmapv1 3.2 9,822 15,754 3.4 3.4 32,256 15,364 3.0 3.0 29,379 15,845 WiredTiger 3.2 WiredTiger 3.2 29,214 15,845 3.4 3.4 12,820 7,029 3.0 3.0 23,349 15,845 MongoRocks 3.2 MongoRocks 3.2 21,272 15,845 3.4 3.4 0 10000 20000 30000 40000 0 4000 8000 12000 16000 Queries/second Inserts/second
load: oplog impact Large server, 16 clients, v3.4.6 IO-bound In-memory 14,446 oplog 24,286 oplog mmap mmap 14,720 no oplog 25,649 no oplog 14,234 oplog 93,284 oplog WT WT 14,976 no oplog 145,180 no oplog 51,493 oplog 54,933 oplog Rocks Rocks 71,395 no oplog 77,906 no oplog 0 20000 40000 60000 80000 0 40000 80000 120000 160000 Inserts/second Inserts/second
Load: benefit from new features Small server, 1 clients, database >> RAM, v3.4.7 Avg IO read/ IO KB read/ IO KB Load CPU/insert IPS insert insert write/insert old features 9319 0.04 4.42 18.68 6124 new features 10806 0.02 1.91 14.17 5283
Thank you rocksdb.org mongorocks.org smalldatum.blogspot.com twitter.com/markcallaghan
An LSM in one slide memtable 1000.sst 999.sst 998.sst 997.sst Level-0 0:999 0:999 0:999 0:999 993.sst 994.sst 995.sst 996.sst Level-1 0:249 250:499 500:749 750:999 … write ahead log 802.sst 610.sst 471.sst 480.sst Level-2 0:24 25:50 950:974 975:999 … 49.sst 1001.sst 2.sst 1.sst Level-3 0:1 2:4 995:996 997:999
E ffi ciency: RocksDB vs a B-Tree Space e ffi ciency Write e ffi ciency • Fragmentation • Uses more space = more data to write • Fixed page size • sizeof(page) / sizeof(row) • Per-row metadata • large writes • Key prefix encoding • write delta Read e ffi ciency • More data in cache & less data to cache • Bloom filter • Spend less on writes, use more for reads • Read-free index maintenance
Recommend
More recommend