Towards 0-Latency Durability Sang-Won Lee (swlee@skku.edu) Ack.: Moon, Yang, Oh and SKKU VLDB Lab. Members NVRAMOS 2014 1
NVRAM is for 0-latency Durability 2
(DB) Transaction and ACID • E.g. 100$ transfer from A to B account BUFFER POOL • ACID – Atomicity – Consistency – Isolation – Durability MAIN MEMORY (Volatile) • Durability latency in force policy – 20ms @ HDD DISK DB – < 1ms @ SSD (Non- Volatile) – 0-latency @ NVDRAM 3
Transaction and ACID • Durability latency in force policy – Atomicity devil • Redundant write is inevitable: {RBJ, WAL}@SQLite, Metadata Journaling@FS , DWB@MySQL, FPW@Postgres , … • Thus, worse latency – 0-latency @ NVDRAM?? • What about UNDO for atomicity? 4
WAL for Durability and Atomicity • Durability latency in WAL Log BUFFER POOL – 2ms @ HDD Log Buffer – 0.2ms @ SSD – 0-latency @ Begin_tx1; NVDRAM?? Commit_tx1; MAIN MEMORY (Volatile) DISK DB (Non- LOG Volatile) 5
Durable and Ordered Write in Transactional Database • In addition to ACID property of logical transaction level, a few properties of IO are critical for transactional database. – Page write should be durable and atomic – In some case, ordering between two writes should be preserved 6
Contents • DuraSSD [SIGMOD2014] • Latency in WAL log – WAL paradigm is ubiquitous!!! – DuraSSD vs. Ideal Case in TPC-B – DuraSSD vs. Ideal Case in NoSQL YCSB • Future directions 7
Native SSD Performance • Random write performance – $> fio 4KB_random_write 100,000 Clean = 90k 90,000 80,000 70,000 GC/WL 60,000 IOPS 50,000 Steady = 15~20k 40,000 30,000 20,000 10,000 0 0 200 400 600 800 1000 1200 1400 Time (sec)
SSD Performance with MySQL • Running MySQL on top of SSD – $> run LinkBench - MySQL Read Write 7,000 6,000 5,000 Read + write IOPS = 1,000 4,000 IOPS degradation almost 1/20 3,000 2,000 1,000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Time (sec)
MySQL/InnoDB I/O Scenario Database Free buffer 1. Search Free buffer List Buffer Scan LRU List from Tail Main LRU List Head D D D Tail D Dirty Page Set 2. Flush Dirty Pages 3. Page Read Database D D Double Write Buffer on Flash SSD Issue Technique Problem Latency Buffer pool Read is blocked until dirty pages are written to storage Atomicity Redundant writes One to double write buffer, the other to data pages Durability Write barrier Flush dirty pages from OS to device and then from write cache to media 10
Persistency by WRITE_BARRIER • fsync() - “ ordering and durability ” – Flushes dirty pages from OS to device – If WRITE_BARRIER is enabled, OS sends a FLUSH_CACHE command to storage device and flushes the write cache to persistent media: DBMS Write (P1, .. , Pn) fsync() (Buffer Manager) Blocked + File Metadata OS + flush_cache (Write Barrier Enabled) Storage Write Buffer (volatile) – 16MB ~ 512MB (Cache Enabled ) + FTL Address Mapping Data Persistent Storage Media (Flash Memory, Magnetic Disk) 11
Impact of fsync with Barrier • High performance degradation due to fsync – SSD - 70x ↓ HDD – 7x ↓ DuraSSD - NoBarrier DuraSSD SSD - A SSD - B HDD - 15k rpm 13x~68x 15319 12647 10582 15,000 6969 degradation 5020 2556 IOPS (log scale) Ideal 1556 1,500 836 387 381 375 335 251 234 225 184 135 HDD 7x 150 59 15 1 4 8 16 32 64 128 256 no fsync # of write pages per fsync
DuraSSD • DuraSSD – Samsung SM843T with a durable write cache – Economical solution • DRAM cache backed by tantalum capacitors • HDD with battery-backed cache?? Issue Existing Technique Solution Latency Buffer pool Fast write with a write cache Atomicity Redundant writes Single atomic write for small pages (4KB or 8KB) • Durability Write barrier Durability: battery-backed write cache without WRITE_BARRIER • Ordering: NOOP scheduler and in-order command queue 13
Experiment Setup • System configuration – Linux Kernel 3.5.10 – Intel Xeon E5-4620 * 4 sockets (64 cores/with HT) – DDR3 DRAM 384GB (96GB/Socket) – Two Samsung 843T 480GB DuraSSDs (data and log) • Workloads – LinkBench • Social network graph data benchmark (MySQL) – TPC-C • OLTP workload (Oracle DBMS) – YCSB • Key-Value store NoSQL (Couchbase) • Workload A 14
LinkBench: Storage Options • Impacts of double write and WRITE_BARRIER – 100GB DB, 128 clients – 6.4 Million transactions (50K TXS per client) 14,000 13,090 12,000 TPS (Transactions Per Second) 10,034 10,000 7X Pagesize 16KB Buffer 10GB 8,000 10x DB 100GB 5,809 Clients 128 6,000 4,000 4X 2,000 1,346 - ON/ON ON/OFF OFF/ON OFF/OFF 15 Write Barrier/Double Write Buffer
Page Size Tuning 16
LinkBench: Page Size • Benefits of small page – Better read/write IOPS • Exploit internal parallelism – Better buffer-pool hit ratio – vs. [SIGMOD09] – no write opt. less effect of page size tuning MySQL buffer hit ratio (LinkBench) LinkBench (OFF/OFF) 97% 35,000 2.3x 29,974 30,000 96% TPS (Transactions Per Second) Large 25,000 22,253 Hit Ratio 95% 20,000 94% 15,000 13,090 93% 10,000 5,000 92% 2GB 4GB 6GB 8GB 10GB - Bufferpool size 16KB 8KB 4KB 17 Page Size 4KB 8KB 16KB
LinkBench: All Options Combined • Transaction latency – Write optimization Better read latency LinkBench Transaction Latency (mean) 250 OFF/OFF with 4KB ON/ON with 16KB 217.6 214.9 200 Read up to 50x Latency (millisecond) 155.4 150 Write up to 20x 100 86.8 82.2 67.6 67 65.3 51.6 45.5 50 11.2 11.1 9.6 9.8 8.9 5.4 1.5 1.2 1.4 1.3 0 Get Cnt Get Mltget Add Del Upd Add Del Upd Node Link Link_List Link Node Node Node Link Link Link Read Write 18
Database Benchmark • TPC-C for MySQL: up to 23x • YCSB for CouchDB : up to 10x TPC-C - relational database YCSB - Couchbase 120,000 8,000 110,400 Barrier ON Barrier OFF 7,000 TpmC (Transactions per minute Count) 100,000 6,208 OPS (Operations per second) 6,000 5,461 80,000 4,921 5,000 23x 4,209 60,000 4,000 3,464 3,000 2,406 40,000 2,041 2,000 1,400 20,000 1,000 390 4,845 195 - - Barrier ON Barrier OFF 1 2 5 10 100 Batch Size Pagesize 8KB/Buffer 2GB/DB 100GB 19
Conclusions • DuraSSD – SSD with a battery-backed write cache • 10$ 20~30X performance improvement – Guarantees atomicity and durability of small pages • Benefits – Avoids redundant writes of database for atomicity – Implements durability without costly fsync operations – Utilizes internal parallelism of SSDs with buffering – Exploits the potential of SSD • 10~20 times performance improvement • Prolonged device lifetime 20
Conclusions • DuraCache in DuraSSD – Gap filler between the latency for the durability and the bandwidth • One DuraSSD can saturate Dell 32 core machine (when running LinkBench) – IOPS crisis is solved? – NVMe = Excessive IOPS/GB ? • MMDBMS vs. All-flash DBMS: Who wins? – 5 min rule (Jim Gray) • 3hr rule with hdd @ 2014 MMDBMS • 10 sec rule with NVMe @ 2014 All-flash DBMS with less DRAM 21
Contents • DuraSSD • Latency in WAL log – WAL paradigm is ubiquitous!!! – DuraSSD vs. Ideal Case in TPC-B – DuraSSD vs. Ideal Case in NoSQL YCSB • Future directions 22
Ubiquitous WAL Paradigm • OLTP DB • NoSQL and KV Store – BUFFER POOL WAL log in BigTable, MongoDB, Cassandra, Amazon Dynamo, Netflix Blitz4j, Yahoo Log Buffer WALNUT, Facebook, Twitter • Distributed Database – Two Phase Commit – SAP HANA, Hekaton MAIN MEMORY (Volatile) • Distributed System DISK DB – Eventual (Non- consistency LOG – Replication Volatile) 23
Ubiquitous WAL Paradigm • Append-only write pattern Redo Log File 512 Byte Block (include wastage) • Trade-off b/w performance and durability – DBMS, NoSQL: sync vs. async commit mode 24
TPC-B: Various WAL Devices • Intel Xeon E7-4850 – 40 cores: 4 sockets, 10 cores/socket, 2GHz/core – 32GB 1333MHz DDR3 DRAM • 15K rpm HDD vs. MLC SSD vs. DuraSSD 25
TPC-B: Various WAL Devices • Async Commit vs. RamDisk vs. DuraSSD • Polling vs. Interrupt 26
Distributed Main Memory DBMS State • State Two-phase commit in distributed DBMSs Coordinator Participant Prepare Active Active Local Prepare Local Prepare Write Prepare Record (lazy) Prepared In Log (force) Prepared yes Write Commit Commit Record In Log Committing (force) Local Commit Work Local Commit Committing Write Completion Record Work In Log (lazy) (lazy) Ack Ack when durable. Write Completion Record In Log Committed Committed (lazy) • “High Performance Transaction Processing in SAP HANA”, IEEE DE Bulletine, 2013 June 27
The Effect of Fast Durability on Concurrency in DBMS • Other TXs are waiting for the lock held by a committing TX • Source: Aether [VLDB 2011, VLDB J. 2013] 28
YCSB@RocksDB • Random update against 1M KV documents – Each document: 10B key + 800B value 29
Recommend
More recommend