of universal constructions
play

of Universal Constructions Eurosys 2020 Andreia Correia University - PowerPoint PPT Presentation

Persistent Memory and the Rise of Universal Constructions Eurosys 2020 Andreia Correia University of Neuchtel Pascal Felber University of Neuchtel Pedro Ramalhete Cisco Systems Persistent Memory Persistent Memory (or Non-Volatile


  1. Persistent Memory and the Rise of Universal Constructions Eurosys 2020 Andreia Correia – University of Neuchâtel Pascal Felber – University of Neuchâtel Pedro Ramalhete – Cisco Systems

  2. Persistent Memory Persistent Memory (or Non-Volatile Main Memory) is a durable media that can be accessed through load and store instructions . Physically, it fits into a DIMM slot Solutions exist for several years by HPE, Micron and Viking, but all these are battery backed: https://www.vikingtechnology.com/products/nvdimm/ https://www.hpe.com/nl/en/servers/persistent-memory.html https://www.micron.com/campaigns/persistent-memory A year ago, Intel released the Optane DC Persistent Memory which does not require a battery. Capacities go up to 512 GiB per module, and 3 TB per CPU socket. https://arxiv.org/pdf/1903.05714.pdf

  3. Persistent data structures Some of the reasons that make persistent data structures a difficult topic, are: • Where to place the flushes (CLWBs) and fences (SFENCE) • How to write a correct recovery procedure • How to allocate and de-allocate persistent objects efficiently, without leaking • How to modify existing persistent data structures to suit novel business needs

  4. How to make a concurrent and persistent data structure Locks + cow/undo/redo log Complex to design and modify Many different papers Blocking Make a data structure Lock-free persistent queue Difficult to make other ADTs Lock-Free by hand Friedman et al, PPoPP 2018 Lock-Free pwb/pfence/psync recipe Slow Easy to deploy Izraelevitz et al, DISC 2018 Use a technique that transforms existing Capsules Lock-Free Difficult to deploy Blelloch et al, SPAA 2018 Lock-Free Lock-Free data structures Difficult to deploy NVTraverse Fast for reads Ben-David et al, PLDI 2020 Mnemosyne Scales for writes Unstable Volos et al, ASPLOS 2016 Easy to deploy Blocking Use a PTM that libpmemobj (PMDK) No concurrency transforms Sequential Easy to deploy Intel Slow data structures OneFile Wait-Free Writes don’t scale Ramalhete et al, DSN 2019 Easy to deploy

  5. How to make a concurrent and persistent data structure Make a data structure by hand Concurrent and Persistent DS Use a technique that 1. Precede every Lock-Free DS CAS with a flush transforms existing 2. Flush and fence Lock-Free after every load data structures 3. … Sequential DS Use a PTM that transforms Sequential data structures PTM

  6. Redo-PTM states[][] T0|0 T1|0 T2|0 announce[] T0|1 T1|1 toggle[] 0 1 1 T0|2 T1|2 T0|3 Ring Queue struct State { SeqTidIdx ticket; bool applied[t]; Combined Combined Combined Combined uint64_t results[t]; RedoLog wlog; rwlock rwlock rwlock rwlock } CL log CL log CL log CL log head head head head Persistent Replica Replica Replica Replica Memory curComb pointer

  7. write transaction Redo-PTM T2 T1 T0 Announce mutation if (add(a)) std::functions _curC = curComb remove(a) add(b) states[][] remove(c); Populate_newST.head T0|0 T1|0 T2|0 announce[] Help append _curC.head T0|1 T1|1 to Ring Queue toggle[] 0 1 1 1 0 0 T0|2 T1|2 remove(a); Acquire write-lock on _c T0|3 Redo add(b); if (add(a)) log remove(c); Apply redo-log Simulate all announced Ring Queue mutations on _c struct State { SeqTidIdx ticket; bool applied[t]; Downgrade to read-lock Combined Combined Combined Combined uint64_t results[t]; RedoLog wlog; rwlock rwlock rwlock rwlock } CL log CL log CL log CL log Flush all Cache Lines head head head head Persistent 1 st 2 nd Replica Replica Replica Replica Apply Return Memory CAS no no undo result yes curComb pointer Append _c.head to Ring

  8. Wait-Free PTM Comparison table OneFile PTM CX PTM Redo PTM Maximum number of instances in use 1 2 t t + 1 Herlihy’s Combining Turn Consensus Herlihy’s Combining Wait-Free Consensus consensus (DCAS variant) (in Turn Queue) Consensus (Ring Queue) Shared multi-instances + Shared multi-instances + Access DCAS strong tryrwlock strong tryrwlock Hazard Eras HP + ref count Hazard Pointers Memory Reclamation Scheme no yes yes Non-abortable reads Logging Persistent Physical Log Volatile Logical Log Volatile Physical Log t = total number of threads in the system 8

  9. What makes Redo-PTM fast 1. Volatile physical logging 2. Store aggregation 3. Flush aggregation 4. Flush deferral 5. Replica copies with non-temporal stores

  10. What makes Redo-PTM fast 1. Volatile Physical Logging Volatile memory (DRAM) In Redo-PTM, the curComb variable and the phy phy phy phy phy phy phy instances (replicas) associated with each Combined , log log log log log log log are located in persistent memory. Ring Queue All other components are in volatile memory (DRAM) which is much faster than PM: • Ring Queue and combining consensus … Combined Combined Combined • Physical log of modifications (and intrusive rwlock rwlock rwlock CL log CL log CL log hashmap) Persistent Memory • Combined instances: Log of modified cache lines, Replica Replica Replica … reader-writer lock, root pointer, head pointer (which points to an entry in the Ring Queue). curComb pointer

  11. What makes Redo-PTM fast 2. Store aggregation T2 T3 T1 T0 Classic redo-log PTMs (like Mnemosyne) transform the add(a) remove(b) add(c) remove(c) transaction from each thread into a physical redo log. 0x2222 0x1111 0x2222 0x1111 user’s 0x4444 0x3333 0x3333 0x3333 transaction In Redo-PTM, we use the combining consensus to 0x5555 0x5555 0x4444 0x4444 aggregate the operations from multiple in-flight threads, into a single redo/undo log. With a large number of threads, the likelihood remove(b); add(c); increases that many operations will touch the same Combining remove(c); addresses. consensus add(a); Each address is written into, a single time, reducing redo/undo log write amplification . 0x1111 0x2222 0x3333 Also, in classic redo-log the log is persistent. In Redo- 0x4444 PTM the redo-log is volatile. PTM 0x5555

  12. What makes Redo-PTM fast 3. Flush aggregation T2 T3 T1 T0 Classic redo-log PTMs (like Mnemosyne and OneFile) flush add(a) remove(b) add(c) remove(c) the persistent redo log, and later flush each modified cache line in memory. 0x2004 0x1001 0x2002 0x1003 user’s 0x4004 0x3002 0x3003 0x2001 transaction 0x5004 0x5001 0x4002 0x4003 In Redo-PTM, the combining consensus aggregates the operations from multiple in-flight threads, and the Redo PTM creates a volatile redo log and a volatile cache line log. remove(b); add(c); With a large number of threads, the likelihood that many Combining remove(c); operations will touch the same cache lines is higher. consensus add(a); Cache This is particularly true for allocator metadata Line log modifications. 0x1000 Each cache line is flushed a single time , improving 0x2000 performance. 0x3000 0x4000 PTM 0x5000

  13. What makes Redo-PTM fast 4. Flush deferral T2 T3 T1 T0 In Redo-PTM, a thread executes modifications on its own add(x) remove(x) add(y) remove(z) add(a) remove(b) add(c) remove(c) add(a) remove(c) add(c) remove(e) private instance and only issues the flushes immediately before attempting to change curComb with a CAS. remove(b); Physical Redo Logs remove(x); add(c); remove(c); add(y); 0x2002 0x1003 remove(c); add(c); Combining remove(z); 0x1001 If another thread has in the meantime changed curComb , add(a); remove(e); 0x3002 0x4003 add(x); consensus 0x2001 add(a); then no flushes are issued. The Cache Line log remains 0x4002 0x5003 associated with a replica, for another thread to later aggregate further modifications. This technique allows Redo-PTM to aggregate flushes across consecutive transactions . Ring Queue Cache Line log If the Cache Line log grows beyond 1/10 of the number of 0x1000 cache lines in the replica, we clear the log and set a flag to 0x2000 flush the entire replica (before becoming the next 0x3000 curComb ) . 0x4000 0x5000

  14. What makes Redo-PTM fast 5. Replica c opy with non-temporal stores In Redo-PTM, when a full copy of the replica needs to be made, instead of doing a memcpy() and then flushing the movntq entire range, we use non-temporal stores to execute the … Stale movntq Replica Replica Replica movntq copy and forego the need to issue CLWB instructions. Replica movntq This approach provides and extra improvement in curComb pointer performance for such (rare) large copies.

  15. Sequential Linked List Queue transformed into a Wait-Free Persistent Queue Even though Redo-PTM serializes write transactions, it is able to scale for writes in certain situations, due to the previously mentioned optimizations. FHMP : Friedman et al, PPoPP 2018 NormOpt : Ben-David et al, SPAA 2019

  16. Tree set and hash set Top plots show a transactional Red-Black Tree with three different PTMs. For 100%, 10% and 1% updates. Bottom plots show a transactional resizable hash set.

  17. Sequential queue annotated Handmade queue to be used with Redo-PTM ( lock-free and persistent) ( wait-free and persistent)

  18. How KV stores are made today… Two-Phase Locking (+ MVCC) KV Store Years of (blocking) Concurrent Indexing Redo Log development Data Structure Expert Developers in Concurrency, Durability and DBs

  19. How we made a KV store with Redo-PTM Months of Redo DB Redo-PTM Sequential Indexing development Data Structure PTM Expert DB Developer • Wait-Free progress • Null recovery • Non-abortable reads • ACID transactions

Recommend


More recommend