Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schöttner Institut für Informatik, Heinrich-Heine-Universität Düsseldorf
Outline • Motivation • The In-Memory Storage DXRAM • Asynchronous Logging • Fast Recovery • Reorganization • Conclusion
Motivation 1/13 • Large-scale interactive applications and online graph computations: • Billions of small data objects • Dynamically expanding • Read accesses dominate over write accesses • Short latency required • Example : Facebook • More than one billion users • More than 150 TB of data (2011) • 70% of all data objects are smaller than 64 byte (2011) Traditional databases are at their limits 26.September 2014
Motivation 2/13 • Common approach to meet discussed requirements: RAM-Caches • Must be synchronized with secondary storage • Refilling after failure very time consuming (Facebook outage 2011 -> 2,5h) • Cache misses are expensive • Another approach: Keeping all object always in RAM • RAMCloud: • Table-based data model • 64 bit global ID-mapping via Hashtable • Log-structured memory design • Optimized for large files 26.September 2014
The In-Memory Storage DXRAM 26.September 2014
The In-Memory Storage DXRAM Overview 3/13 • DXRAM is a distributed in-memory system: • Optimized to handle billions of small objects • Key-value data model with name service • Transparent backup to SSD(HDD) • Core Services: • For management, storage and transfer of key-value tuples (chunks) • Minimal interface • Extended Data Services: • General services and extended data models 26.September 2014
The In-Memory Storage DXRAM 4/13 Chunks • Variable sizes • Every chunk is initially stored on the creator, but can be migrated (hot spots) • Every chunk has a 64 bit globally unique chunk ID (CID) CID • First 16 bit: NodeID of the creator node NID LocalID • Last 48 bit: Locally unique sequential ID • Impact: • Locality: Chunks that are created at the same location adjacent in time have similar CIDs • Initial location is stored in CID: No lookup needed if chunks was not migrated • After migration: New location must be stored elsewhere • • Applications cannot specify own IDs Migrated CIDs are stored in ranges in a b-tree on dedicated nodes • • No entry -> chunk is still stored on creator Support for user-defined keys: • • Name service with a patricia-trie structure 26.September 2014
The In-Memory Storage DXRAM 5/13 Global meta-data management • Fast node lookup with a custom Chord-like super- peer overlay • 8 to 10% of all nodes are super-peers • Super-peers do not store data but meta-data • Meta-data is replicated on successors • Every super-peer knows every other super-peer -> Lookup with constant time complexity O(1) • Every peer is assigned to one super-peer • Fast node recovery • Super-peers also store backup locations • Distributed failure detection • Super-peer coordinated recovery with multiple peers 26.September 2014
Asynchronous Logging 26.September 2014
Asynchronous Logging 6/13 SSD Utilization • Characteristics of SSDs: • SSDs write at least one page (4KB), pages are clustered to be accessed in parallel • SSDs cannot overwrite a single flash page, but delete a block (64 to 128 pages) and write on another • It is faster to write sequentially than randomly on SSD • Mixing write and read accesses slows the SSD down • Life span: Limited number of program-erase cycles • Consequences: • Buffer write accesses • Use a log to avoid deletions and to write sequentially • Only read the log during recovery 26.September 2014
Asynchronous Logging 7/13 Architecture • Two-level log organization: One primary log and one secondary log for every node requesting backups • Idea: Store incoming backup requests as soon as possible on SSD to avoid data loss and at the same time write as much as possible at once • No need to store meta-data in RAM, because every entry is self describing Secondary Backup Time-Out / Primary Secondary Secondary Sort by NID Write Buffer Log 1 Requests Threshold Log Log 1 Log 1 26.September 2014
Asynchronous Logging 8/13 Architecture Write buffer: • The write buffer stores chunks from potentially RAM SSD every node: Is filled frequently • Bundles backup requests (4KB) Decouples network threads (sync possible) • Parallel access to write buffer • Backup Time-Out / Primary Write Buffer Requests Threshold Log Writer thread: • Flushes write buffer to primary log after time-out or (e.g. 0.5s) if threshold is reached (e.g. 16MB) Two bucket approach • X Producer 1 Consumer Network Threads Writer Thread Problem: To recover all data from one node the whole • primary log must be processed 26.September 2014
Asynchronous Logging 9/13 Architecture Backup Time-Out / RAM SSD Primary Log Write Buffer Requests Threshold Sec.Log Buffer 1 Secondary Log 1 Sec.Log Buffer 2 Secondary Log 2 ... ... Sec.Log Buffer X Secondary Log X 26.September 2014
Asynchronous Logging 10/13 Optimizations • The write buffer is sorted by NID before writing to SSD • If there is more than 4KB for one node, the data is written directly to the corresponding secondary log • Method: Combination of hashing and monitoring • Clearing the primary log: • Flush all secondary log buffers • Set read pointer to write pointer 26.September 2014
Fast Recovery 26.September 2014
Fast Recovery 11/13 • Super-peer overlay: • Fast and distributed failure detection (hierarchical heart beat protocol) • Coordinated and purposeful peer recovery (super-peer knows all corresponding backup locations) • Recovery modes: 1. Every contacted peer recovers chunks locally (fastest, no data transfer) 2. All chunks are recovered and sent to one peer (1:1) 3. All chunks are recovered and sent to several peers (faster, but less locality, used by RAMCloud) 4. 1 and 2 combined: Recover locally and rebuild failed peer gradually 26.September 2014
Reorganization 26.September 2014
Reorganization 12/13 • Write buffers and primary log are cleared periodically • Secondary logs are contiguously filled • To free space of deleted or outdated entries the secondary logs have to be reorganized • Every peer reorganizes his logs independently • Demands: • Space-efficiency • As little disruptive as possible • Incremental operation to guarantee fast recovery • Idea (inspired by LSF): • Divide log into segments with fixed size • Reorganize one segment after another • Distinguish segments by access frequency (hot and cold zones) • Decide which segment to reorganize by cost benefit ratio 26.September 2014
Conclusion 13/13 • Current status: • DXRAM memory management tested on cluster with more than 5 billion objects • Small object processing faster than RAMCloud • Multithread buffer implemented and examined under worst-case scenario • Logs fully functional with less complex reorganization scheme • Node failure detection and initialization of recovery process tested • Outlook: • Implementation of LSF-like reorganization scheme with adapted cost-benefit formula • Replica placement (Copysets) • Evaluation of complete recovery process 26.September 2014
Backup Slides 26.September 2014
The In-Memory Storage DXRAM 14/13 In-memory data management • Paging-like translation to local addresses instead of hast table • Space-efficient and fast • Minimized internal fragmentation • Small overhead: Only 7 bytes for chunks smaller than 256 bytes 26.September 2014
Recommend
More recommend