sled and rio Rust DB + io_uring = @sadisticsystems sled.rs who am - PowerPoint PPT Presentation

sled and rio Rust DB + io_uring = @sadisticsystems sled.rs

who am I ❖ building Rust databases since 2014 ❖ previously worked at some social media & infrastructure companies ❖ for fun, I build and destroy distributed databases ❖ also for fun, I teach Rust workshops ❖ lol work @sadisticsystems sled.rs

I like databases because they often involve many interesting engineering techniques @sadisticsystems sled.rs

common database techniques ❖ lock-free programming ❖ replication, consensus, eventual consistency ❖ correctness testing ❖ self-tuning systems ❖ performance work @sadisticsystems sled.rs

I started sled to have a single project where I could implement papers I read @sadisticsystems sled.rs

sled acts like a concurrent BTreeMap that saves data on disk @sadisticsystems sled.rs

Rust is the best DB language Rust will approach Fortran performance in many cases. 1. C/C++ is really limited by aliasing. More compile-time info => better optimizations. Correctness. When there's a segfault, I have a very small 2. set of unsafe blocks to audit to quickly narrow my search down. Compatibility with the great C/C++ perf/debugging tools 3. I can accept code in pull requests with a small fraction 4. of the mental energy as I would need to put into auditing C/C++ due to the compiler's strictness @sadisticsystems sled.rs

fast to compile, low friction dev @sadisticsystems sled.rs

built-in profiler ● easy to answer “why is this slow?” @sadisticsystems sled.rs

heavy use of flamegraph crate github.com/flamegraph-rs/flamegraph @sadisticsystems sled.rs

1 b illion operations in 57 seconds @ 95% reads / 5% writes / small working set @sadisticsystems sled.rs

seriously though, it’s beta @sadisticsystems sled.rs

never use a database less than 5 years old - site reliability engineering proverb @sadisticsystems sled.rs

sled turns 5 this year, so 2020 will be an exciting year for the project @sadisticsystems sled.rs

let’s see how it works! @sadisticsystems sled.rs

sled architecture ❖ lock-free index loosely based on the Bw-Tree ❖ lock-free pagecache loosely based on LLAMA ❖ log structured storage loosely based on Sprite LFS ❖ io_uring on huge buffers for writes ➢ io_uring functionality exported as rio crate ❖ cache based on W-TinyLFU ➢ exported (soon!) as berghain crate @sadisticsystems sled.rs

we avoid blocking while reading and writing @sadisticsystems sled.rs

setting a key to a new value 1. traverse tree to find the key’s leaf 2. modify the leaf to store the new key-value pair @sadisticsystems sled.rs

but, we can’t block readers or writers while updating @sadisticsystems sled.rs

latency @sadisticsystems sled.rs

we use a technique called RCU @sadisticsystems sled.rs

Read-Copy-Update (RCU) 1. read the old value through an AtomicPtr 2. make a local copy 3. modify the local copy with the desired changes 4. use the compare_and_swap method to install the new version. goto #1 if we fail. 5. use crossbeam_epoch to delay garbage collection until all threads that may have witnessed the old version are finished @sadisticsystems sled.rs

readers don’t wait for writers writers procede optimistically @sadisticsystems sled.rs

however, we need to also guarantee that our atomic operations are saved to disk in the same order @sadisticsystems sled.rs

buggy solution if the log message is 1. read delayed, other threads 2. mutate local may perform their updates copy between 3 & 4. if the database crashes, it will 3. CAS load the last item in the thread descheduled here 4. log to disk log. we have to guarantee our log order matches our in-memory order @sadisticsystems sled.rs

data loss @sadisticsystems sled.rs

good solution (LLAMA trick) 1. read by ordering our log reservations between the 2. mutate local copy read and the CAS, we 3. reserve log slot guarantee that the order 4. CAS on-disk will match what 5. only fill log actually happened in reservation if CAS memory, without using any locks. succeeded @sadisticsystems sled.rs

how to de get fast io? ● we only write when we have 8mb of data to write sequentially ● we support out-of-order writes ● io_uring @sadisticsystems sled.rs

io_uring is an interface for fully asynchronous linux syscalls @sadisticsystems sled.rs

the old AIO interface forces O_DIRECT, isn’t actually async sometimes, etc... @sadisticsystems sled.rs

io_uring began as a response to that, but is far more ambitious @sadisticsystems sled.rs

@sadisticsystems sled.rs

it’s 2 ring buffers ● submission ● completion @sadisticsystems sled.rs

after setup, it can be run with 0 syscalls (SQPOLL) @sadisticsystems sled.rs

io_uring is provided via the rio crate @sadisticsystems sled.rs

operations are executed out-of-order @sadisticsystems sled.rs

chained operations @sadisticsystems sled.rs

connect + send + recv @sadisticsystems sled.rs

PLs are DSLs for syscalls @sadisticsystems sled.rs

io_uring changes this conversation @sadisticsystems sled.rs

over time, BPF may be used to execute logic between chained calls, eg: accept -> read -> write @sadisticsystems sled.rs

userspace: control plane kernel: data plane @sadisticsystems sled.rs

rio is misuse resistant ● guarantees Completion events don’t outlive the ring, the buffers, or the files involved. ● automatically handles submissions ● prevents ring overflows that can happen by submitting too many items ● on Drop, the Completion waits for the backing operation to complete, to guarantee no use-after-frees. @sadisticsystems sled.rs

Basically all performance-conscious projects are getting ready to migrate to it, and they are measuring impressive results. @sadisticsystems sled.rs

Try them out :) docs.rs/rio docs.rs/sled @sadisticsystems sled.rs

Our Results To Date ● pure-rust io_uring functionality ● Modified Bw-Tree lock-free architecture (lock-free, log-structured) ● Millions of reads + writes per second (1 billion/minute) ● Minimal configuration ● Multiple keyspace support ● Reactive prefix subscription, replication-friendly ● Merge operators, CRDT-friendly ● Serializable transactions @sadisticsystems sled.rs

Where We Want To Go ❖ Support for all io_uring operations ❖ Typed trees: cutting deserialization costs for hot keys ❖ Replication ❖ Make it more efficient ➢ sled is currently a bit disk-hungry, we can dramatically improve this! ❖ Make it safer! This is the main point before 1.0 ➢ SQLite-style formal requirements specification & corresponding testing @sadisticsystems sled.rs

Help Us Get There! ● Sponsorship allows me to focus all of my time on open source: ○ https://github.com/sponsors/spacejam ● Want to contribute to a cutting-edge and industry-relevant DB? ○ https://github.com/spacejam/sled ○ We love to mentor and teach people about databases! ○ Also check out our active discord channel @sadisticsystems sled.rs

I also run Rust trainings! @sadisticsystems sled.rs

Thank you :) @sadisticsystems sled.rs

sled and rio Rust DB + io_uring = @sadisticsystems sled.rs who am - PowerPoint PPT Presentation

sled and rio Rust DB + io_uring = @sadisticsystems sled.rs who am I building Rust databases since 2014 previously worked at some social media & infrastructure companies for fun, I build and destroy distributed databases

CATHOLIC UNIVERSITY OF RIO DE JANEIRO (PUC-RIO) WELCOME TO PUC-RIO HISTORY AND PUC-Rio was

Smooth Modes as a Tool for Operational Modal Analysis: New Developments Rubens Sampaio PUC-Rio

MRI of the Placenta John G. Sled, Ph.D. MRI safety MRI interacts with the body in a number of

SLED: an update Supersymmetric Large Extra Dimensions Cliff Burgess Moriond 2007 Partners in

Sled gehammer Hell The Day after Jud gment Jasmin C. Blanchette TU Mnchen Larry Paulson Jia

areas and impacts of restoration Rio Coventions Pavilion RIO CONSERVATION AND CBD COP14

How the Rio Grande Compact Functions Water and Natural Resources Committee Meeting, Las Cruces

Free Transactions with WITH RIO VISTA Rio Vista David E. Lowell University of Michigan David

Employer Training: RIO Overview Slide 1 Employer Guide To RIO Self-Service Reporting - Always

Countdown Rio 2016 Games Agenda Atos in Rio Technology Update Welcome Atos Brazil Atos &

First Annual Forum on First Annual Forum on The Rio Grande Compact The Rio Grande Compact April

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

Laura Chioda World Bank Joo M. P. De Mello PUC-Rio Rodrigo R. Soares PUC-Rio and IZA

Ocea Oceans D ns Day at COP ay at COP21 Rio Rio Conventio Convention Pa n Pavi vili lion,

Six Middle Rio Grande Pueblo Prior and Paramount Water Accounting Brian Westfall

Environmental Water Needs in the Rio Grande - Rio Bravo Dr. Samuel Sandoval Solis In

Blockchain Tech UNSW COMP9243 18s1 Michael Sproul Warning Blockchain is all the rage, but

Software Development Methodologies Lecturer: Raman Ramsin Lecture 7 Integrated Object-Oriented

Understanding Blockchain Technology Teach-In & Introduction Tony Willenberg,

Blockchain Tech UNSW COMP9243 18s1 Michael Sproul Warning Blockchain is all the rage, but

LongRunningTransactionsinServiceOriented Environments infm3::SR

Applications and network performance Network Operations glen.turner@aarnet.edu.au How long?

Validation in Optimistic Concurrency Control ACM SIGMOD 2015 Programming Contest Alexey Karyakin

This course is important for... End users of DBS DB application programmers Database

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

sled and rio Rust DB + io_uring = @sadisticsystems sled.rs who am - PowerPoint PPT Presentation

sled and rio Rust DB + io_uring = @sadisticsystems sled.rs who am I building Rust databases since 2014 previously worked at some social media & infrastructure companies for fun, I build and destroy distributed databases

CATHOLIC UNIVERSITY OF RIO DE JANEIRO (PUC-RIO) WELCOME TO PUC-RIO HISTORY AND PUC-Rio was

Smooth Modes as a Tool for Operational Modal Analysis: New Developments Rubens Sampaio PUC-Rio

MRI of the Placenta John G. Sled, Ph.D. MRI safety MRI interacts with the body in a number of

SLED: an update Supersymmetric Large Extra Dimensions Cliff Burgess Moriond 2007 Partners in

Sled gehammer Hell The Day after Jud gment Jasmin C. Blanchette TU Mnchen Larry Paulson Jia

areas and impacts of restoration Rio Coventions Pavilion RIO CONSERVATION AND CBD COP14

How the Rio Grande Compact Functions Water and Natural Resources Committee Meeting, Las Cruces

Free Transactions with WITH RIO VISTA Rio Vista David E. Lowell University of Michigan David

Employer Training: RIO Overview Slide 1 Employer Guide To RIO Self-Service Reporting - Always

Countdown Rio 2016 Games Agenda Atos in Rio Technology Update Welcome Atos Brazil Atos &amp;

First Annual Forum on First Annual Forum on The Rio Grande Compact The Rio Grande Compact April

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

Laura Chioda World Bank Joo M. P. De Mello PUC-Rio Rodrigo R. Soares PUC-Rio and IZA

Ocea Oceans D ns Day at COP ay at COP21 Rio Rio Conventio Convention Pa n Pavi vili lion,

Six Middle Rio Grande Pueblo Prior and Paramount Water Accounting Brian Westfall

Environmental Water Needs in the Rio Grande - Rio Bravo Dr. Samuel Sandoval Solis In

Blockchain Tech UNSW COMP9243 18s1 Michael Sproul Warning Blockchain is all the rage, but

Software Development Methodologies Lecturer: Raman Ramsin Lecture 7 Integrated Object-Oriented

Understanding Blockchain Technology Teach-In &amp; Introduction Tony Willenberg,

Blockchain Tech UNSW COMP9243 18s1 Michael Sproul Warning Blockchain is all the rage, but

LongRunningTransactionsinServiceOriented Environments infm3::SR

Applications and network performance Network Operations glen.turner@aarnet.edu.au How long?

Validation in Optimistic Concurrency Control ACM SIGMOD 2015 Programming Contest Alexey Karyakin

This course is important for... End users of DBS DB application programmers Database

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Countdown Rio 2016 Games Agenda Atos in Rio Technology Update Welcome Atos Brazil Atos &

Understanding Blockchain Technology Teach-In & Introduction Tony Willenberg,