in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, - PowerPoint PPT Presentation

High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman , and Rui Wang Microsoft Research

Overview Deuteronomy: componentized DB stack Separates transaction, record, and storage management Deployment flexibility, reusable in many systems and applications Conventional wisdom: layering incompatible with performance Build from the ground up for modern hardware Lock/latch-freedom, multiversion concurrency control, cache-coherence-friendly techniques Result: 1.5M TPS Performance rivaling in-memory database systems but clean separation & works even without in-memory data

The Deuteronomy Database Architecture Transactional TC guarantees ACID Component (TC) Logical concurrency control Logical recovery Control Record No knowledge of physical data storage Operations Operations (Exactly Once, (~CRUD) WAL, Checkpointing) DC provides record storage Physical data storage Data Component Atomic record modifications (DC) No knowledge of transactions, multiversioning

Deployment Flexibility TC TC DC DC DC Embeddable Embeddable Networked Key-Value Store Transactional Store Transactional Store TC TC Quorum TC TC DC DC DC DC DC DC DC DC DC DC DC DC Scale-out Transactional Store Fault-tolerant Scale-out Transactional Store

The First Implementation Transactional Component (TC) Operations per second 10,000,000 Lock Manager Log Manager 1,000,000 Record 100,000 250× Manager 10,000 1,000 100 10 Record Control 1 Operations Operations TC Bw-tree DC Data Bottlenecked on locked remote ops Component (DC)

The New Transactional Component

Key Mechanisms for Millions of TPS Multiversion concurrency control (MVCC) Transactions never block one another Multiversioning limited to TC only Eliminate blocking Lock and latch freedom throughout Buffer management, concurrency control, caches, allocators, … In-memory recovery log buffers as version cache Redo-only recovery doubles in-memory cache density Mitigate latency Only committed versions sent to DC, shipped in log buffer units TC and DC run on separate sockets (or machines) Maximize concurrency Task parallelism/pipelining to gain performance Data parallel when possible, but not at the expense of the user

TC Overview MVCC enforces serializability MVCC Version Manager Recovery log acts as version cache Recovery log Volatile Buffers In-memory Stable Buffers Read Cache Log buffers batch updates to DC DC Updates DC Data Reads Parallel log replay engine at DC Component (DC)

Latch-free Multiversion Concurrency Control

Timestamp MVCC Each transaction has a timestamp assigned on begin Transactions read, write, and commit at that timestamp Each version marked with create timestamp and last read timestamp

Latch-free MVCC T able Records chained in hash table buckets Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 40 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able Ordered version lists chained off each record Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 40 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able TxId gives version status and create timestamp Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 40 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Reads Read: find a visible, committed version; compare-and-swap read timestamp Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 40 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Reads Data is pointed to directly in in-memory recovery log buffers Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 40 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Reads All metadata entries cacheline sized 6 cache misses in the common case Work of indexing done by CC Version Manager In-memory recovery log buffers Miss Hash Table + cache + Key A Version List Read Time 50 Key Y Version List Read Time 50 DC Miss Miss . . . Miss Create TxID 10 Log Offset Miss Miss Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Writes Append new version to in-memory log Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 50 Key Y Version List Read Time 50 DC . . . Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Writes Create new version metadata that points to it Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 50 Key Y Version List Read Time 50 DC . . . Create TxID 4 Log Offset Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

Latch-free MVCC T able: Writes Install version atomically with compare-and-swap Version Manager In-memory recovery log buffers Hash Table + cache + Key A Version List Read Time 50 Key Y Version List Read Time 50 DC . . . Create TxID 4 Log Offset Compare and swap Create TxID 10 Log Offset Create TxID 18 Log Offset Create TxID 4 Log Offset

MVCC Garbage Collection Track Oldest active transaction (OAT) Version application progress at the DC Remove versions older than OAT and applied at the DC Later requests for most recent version of the record go to DC

Latch-free Log Buffer Allocation

Serialized Log Allocation, Parallel Filling Unallocated Allocated & filling Filled Log Buffer Tail = 80 Only allocation is serialized, not data copying

Fast Atomic Operations for Log Allocation Unallocated Filled Log Buffer Tail = 80 Thread 1: CompareAndSwap(&tail, 80, 90) → ok Thread 1: AtomicAdd(&tail, 10) → 90 Thread 2: CompareAndSwap(&tail, 80, 85) → fail Thread 2: AtomicAdd(&tail, 5) → 95 Wasted shared- mode load for ‘pre - image’ No need for load of ‘pre - image’ Dilated conflict window creates retries Order non-deterministic, but both succeed

TC Proxy DC-side multicore parallel redo-replay

Multicore Replay at the DC Each received log buffer replayed Incoming Log Buffers from TC by dedicated hw thread TC Proxy Fixed-size thread pool Backpressure if entire socket busy HW Threads “Blind writes” versions to DC “Delta chains” avoid read cost for Data Component writes (Bw-tree) Out-of-order and redo-only safe LSNs, only replay committed entries, shadow transaction table

Evaluation

Hardware for Experiments TC 4x Intel Xeon @ 2.8 GHz 64 total hardware threads Socket 0 Socket 1 Commodity SSD ~450 MB/s Socket 3 Socket 2 TC Proxy + DC (Bw-tree)

Experimental Workload YCSB-like More than half of all records access every 20 seconds 50 million 100-byte values Heavily stresses concurrency 4 ops/transaction control and logging overheads ~“80 - 20” Zipfian access skew DC on separate NUMA socket; also running periodic checkpoints

Evaluation: Transaction Throughput 84% reads 50% read-only transactions 1.5M TPS Competitive w/ in-memory systems

Evaluation: Impact of Writes ~350,000 TPS w/100% writes Disk close to saturation 90% disk bandwidth utilization DRAM latency limits write-heavy loads More misses for DC update than for “at TC” read

For lack of time; fun stuff in the paper Unapologetically racy Fast commit with read-only log-structured read-cache transaction optimization Fast async pattern Recovery log as queue for durable commit notification Eliminates context switch and memory allocation overhead Thread management & NUMA Lightweight pointer stability details Epoch protection for latch-free data structures free of atomic ops on the fast path

Related Work Modern in-memory database engines Hekaton [Diaconu et al] HANA HyPer [Kemper and Neumann] Silo [Tu et al] Multiversion Timestamp Order [Bernstein, Hadzilacos, Goodman] Strict Timestamp Order CC Hyper [Wolf et al]

Future Directions Dealing with ranges Timestamp concurrency control may be fragile More performance work More functionality Evaluating scale-out

Conclusions Deuteronomy: clean DB kernel separation needn’t be costly Separated transaction, record, and storage management Flexible deployment allows reuse in many scenarios Embedded, classic stateless apps, large-scale fault-tolerant Integrate the lessons of in-memory databases Eliminate all blocking, locking, and latching MVCC, cache-coherence-friendly techniques 1.5M TPS rivals in-memory database systems but clean separation & works even without in-memory data

in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, - PowerPoint PPT Presentation

High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman , and Rui Wang Microsoft Research Overview Deuteronomy: componentized DB stack Separates transaction, record, and storage management

Guest Speaker: Elbert White Deuteronomy: Preparation for Conquest March 26, 2017 Dean Bible

FIT - The Father who Comforts and Supports Slide 1 Cover Deuteronomy 6:7 (ESV) 7 You shall teach

Andrew Stepp | Deuteronomy THINGS I CAN INVITE MY FRIENDS TO Vision Gathering TONIGHT,

By Debbie Streicher Welcome and Intros Why are you here? Keep these words Deuteronomy 6:5-9

Parashat Ha'Azinu Deuteronomy 32:1 - 32:52 October 12 th , 2019 The Mystery Letter The Mystery

The DISOBEDIENCE of Israel and the SOVEREIGNTY of GOD! The ISSUES related to the DISOBEDIENCE

1) Lending and borrowing is allowed (Deuteronomy 15:7-8; Psalm 112:5; Matthew 5:42) 2)

PEACE TREATY INTRODUCTION HISTORY RULES BLESSINGS AND CURSES WITNESSES THE RULES

1. In Acts 22:3-21; Paul is giving his testimony of Salvation. Why then in verse 22 does the

~Class 16~ February 6, 2019 Books: Ezra, Nehemiah and Esther Theme: Public Reading of Scripture

K. The witnesses John 5:31 40 1. John 5:31 Jesus anticipated that people might object

[OUTREACH OF THE CHURCH OF GOD AT WOODSTOCK, ILLINOIS ] 2 3 4 5 6 7 8 UM NEVERMIND 9

Overview of the Sacred Scriptures Bible Timeline Dr. Jeff Cavins Bible Timeline Canonization

J Genesis 3:15 Seed of Woman Genesis 22:18 Abrahams Seed Genesis 49:10 Lawgiver from

HEBREW CANON HEBREW CANON TORAH NEVIIM KETHUVIM Psalms Joshua Genesis Proverbs Judges

OT Timeline: Temple @ Jerusalem 1000 BC 586 BC 1500 BC Josiahs Reform Moses receives the

Introduction to Government Relations and DC Trip Fernanda Psihas Joseph Zennamo Carrie McGivern

July 13, 2020 Washington, DC Information Security and Financial Institutions: An FTC Workshop on

Washington, DC, Area Dismissal and Closure Procedures 2014-2015 Winter Season Overview The

Jane Pan janepan@hbi-dc.org www.hbi-dc.org Phone: (571) 274-0021 } Was developed and produced in

Advances in Averaged Switch Modeling and Simulation Dragan Maksimovic * and Robert Erickson

M ATPOWER s Extensible Optimal Power Flow Architecture Ray Zimmerman, Cornell University

Cycle 2 2018 Applicant Town Hall Webinar Washington, DC August 8, 2018 at 12:00pm ET Agenda I.

TripS : Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment Kwangsung Oh