providing atomic sector updates in software for
play

Providing Atomic Sector Updates in Software for Persistent Memory - PowerPoint PPT Presentation

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2


  1. Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1

  2. Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 2

  3. NVDIMMs and Persistent Memory CPU caches Speed Capacity DRAM Persistent Memory Traditional Storage ● NVDIMMs are byte-addressable ● We won't talk of “Total System Persistence” ● But using persistent memory DIMMs for storage ● Drivers to present this as a block device - “pmem” 3

  4. Problem Statement • Byte addressability is great – But not for writing a sector atomically Userspace write() 'pmem' driver - /dev/pmem0 memcpy() 0 1 2 3 - - - NVDIMM 4

  5. Problem Statement • On a power failure, there are three possibilities 1.No blocks are torn (common on modern drives) 2.A block was torn, but reads back with an ECC error 3.A block was torn, but reads back without an ECC error (very rare on modern drives) • With pmem, we use memcpy() – ECC is correct between two stores – Torn sectors will almost never trigger ECC on the NVDIMM – Case 3 becomes most common! – Only file systems with data checksums will survive this case 5

  6. Naive solution • Full Data Journaling • Write every block to the journal first • 2x latency • 2x media wear 6

  7. Slightly better solution LBA Actual 0 42 0 - Free 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - LBA 3 table and an in-memory free 3 3 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from 0 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 7

  8. Slightly better solution LBA Actual 0 42 0 - Free 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - LBA 3 table and an in-memory free 3 3 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from write( to LBA 3 ) 0 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 8

  9. Slightly better solution LBA Actual 0 42 0 - LBA 3 1 5050 • Maintain an 'on-disk' indirection 2 314 3 - Free table and an in-memory free 3 0 block list • The map/indirection table has 42 - LBA 0 Map LBA -> actual block offset mappings • New writes grab a block from 3 free list 2 314 - LBA 2 • On completing the write, 12 atomically swap the free list entry and map entry NVDIMM Free List 9

  10. Slightly better solution • Easy enough to implement • Should be performant • Caveat: – The only way to recreate the free list is to read the entire map Consider a 512GB volume, bs=512 => reading 1073741824 map entries – – Map entries have to be 64-bit, so we end up reading 8GB at startup Could save the free list to media on clean shutdown – – But...clunky at best 10

  11. Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 11

  12. The Block T ranslation T able Arena Info Block (4K) • nfree: The number of free blocks in reserve. • Flog: Portmanteau of free list + log Arena 0 – Has nfree entries. 512G – Each entry has two 'slots' that 'flip-flop' Data Blocks – Each slot has: Block being Old New Sequence Arena 1 nfree reserved blocks written mapping mapping num 512G • Info block: Info about arena - offsets, lbasizes etc. BTT Map . • External LBA: LBA as visible to upper layers . • ABA: Arena Block Address - Block offset within an arena BTT Flog (8K) . • Premap/Postmap ABA: The block offset into the data Info Block Copy (4K) Backing Store area as seen prior to/post indirection from the map Arena 12

  13. What's in a lane? Free List Flog blk seq slot LBA old new seq LBA` old` new` seq` get_lane() = 0 CPU 0 Lane 0 5 32 2 0b10 XX XX XX XX 2 0b10 0 get_lane() = 1 CPU 1 Lane 1 XX XX XX XX 8 38 6 0b10 6 0b10 1 get_lane() = 2 42 42 14 0b01 XX XX XX XX 14 0b01 0 CPU 2 Lane 2 Map • The idea of “lanes” is purely logical 5 2 • num_lanes = min(num_cpus, nfree) 8 6 • lane = cpu % num_lanes 42 14 • If num_cpus > num_lanes, we need locking on lanes – But if not, we can simply preempt_disable() and need not take a lock 13

  14. Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 14

  15. BTT – Reading a block read() LBA 5 • Convert external LBA to Arena number + pre-map ABA CPU 0 • Get a lane (and take lane_lock if needed) • Read map to get the mapping Lane 0 Map • If ZERO flag is set, return zeroes pre post • If ERROR flag is set, return an error 5 10 • Read data from the block that the map points to • Release lane (and lane_lock) Read data from 10 Release Lane 0 15

  16. write() LBA 5 BTT – Writing a block CPU 0 • Convert external LBA to Arena number + pre-map ABA Lane 0 Map (old) Free List[ 0 ] • Get a lane (and take lane_lock if needed) pre post blk seq slot • Use lane to index into free list, write data to this free 5 10 2 0b10 0 block write data to 2 • Read map to get the existing mapping • Write flog entry: [premap_aba / old postmap_aba / new flog[0][0] = {5, 10, 2, 0b10} postmap_aba / seq] • Write new post-map ABA into map. map[5] = 2 • Write old post-map entry into the free list Map • Calculate next sequence number and write into the free free[0] = {10, 0b11, 1} pre post list entry 5 2 • Release lane (and lane_lock) Release Lane 0 16

  17. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 Opportunities for interruption/power failure 17

  18. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – No on-disk change had happened, everything comes back up as normal 18

  19. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – Map hasn't been updated – Reads will continue to get the 5 → 10 mapping – Flog will still show '2' as free and ready to be written to 19

  20. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: Read flog[0][0] = {5, 10, 2, 0b10} – Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old) – Since flog and map disagree, recovery routine detects an incomplete transaction – Flog is assumed to be “true” since it is always written before the map – Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10 – 20

  21. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • Special case, the flog write is torn : Bit sequence for flog.seq: 01->10->11->01 Old New ← → • On reboot: Read flog [ 0 ][ 0 ] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01} – – Since seq is written last, the half-written flog entry does not show up as “new” – Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case – map[5] remains '10', and '2' remains free. 21

  22. BTT – Analysis of a write Free List[ 0 ] Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Release write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Lane 0 Lane 0 • On reboot: – Since both flog and map were updated, free list reconstruction will happen as usual 22

  23. Introduction The Block Translation T able Read and Write Flows Synchronization Performance/Effjciency BTT vs. DAX 23

  24. Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 24

  25. Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 25

Recommend


More recommend