BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li , Quanlu Zhang, Zhi Yang and Yafei Dai Peking University
Outline Introduction and Motivation Our Design System and Implementation Evaluation
Outline Introduction and Motivation Our Design System and Implementation Evaluation
In-memory KV-Store A crucial building block for many systems – Data cache (e.g. Memcached and Redis in Facebook, Twitter) – In-memory database Availability is important for in-memory KV-Stores – Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory Data redundancy in distributed memory is essential for fast failover
Two redundancy schemes Replication is a classical way to provide data availability – E.g., Repcached, Redis Write request Client Data node High High Update bandwidth memory cost cost Backup Backup node node
Two redundancy schemes Erasure coding is a space-efficient redundancy scheme The increase of CPU speed enables fast data recovery – Encoding/Decoding rates can reach 40Gb/s on single core [1] Client Write request Data Data Data Node Node Node High Low bandwidth memory Update cost cost Parity Parity Node Node [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16
In-place Update A traditional mechanism for encoding small objects Update(obj4->obj4’) Delta(obj4, obj4’) Data node 1 obj4’ obj4 obj7 obj1 Update(obj8->obj8’) Update Data node 2 obj2 obj8’ obj8 obj5 (obj3->obj3’) Data node 3 Bandwidth cost obj6 obj9 obj3 obj3’ is the same as 3-replication Parity node 1 p p P p P P Parity node 2 p P P p P p Our goal: both memory efficiency and bandwidth efficiency
Outline Introduction and Motivation Our Design System and Implementation Evaluation
Our Design Aggregate write requests and encode objects in a new coding stripe invalid Batch coding Append Data node 1 obj7 obj1 obj4 obj4 obj4’ Data node 2 obj2 obj8 obj5 obj8 obj8’ Batch node Data node 3 obj3 obj6 obj9 obj3 obj3’ Parity node 1 P P P P Parity node 2 P P P P
Latency Analysis Batch coding induces extra request waiting time Formalize the waiting time W Request throughput W = f(T, k) number of data nodes Latency bound Ɛ K = 3
Garbage Collection Recycle updated or deleted blocks and release extra parity blocks Move-based garbage collection Original stripes Batched stripes Data Move nodes Parity nodes Much bandwidth cost for updating GC GC parity blocks
Garbage Collection How to reduce the GC bandwidth cost? – Intuition: GC the stripes with the most invalid blocks Greedy block moving Original stripes Batched stripes Data nodes Parity nodes Two block moves to release GC GC two coding stripes
Garbage Collection How to further reduce block move? – Intuition: make the updates focus on few stripes Popularity-based data arrangement Original stripes Batched stripes Hot Cold Cold Hot Data nodes Parity nodes Only one block move to GC GC release two coding stripes
Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper
Outline Introduction and Motivation Our Design System and Implementation Evaluation
System Architecture Data process Client Batch process Data process Batch coding Preprocessing Data process Client Garbage Metadata Parity process collection management Client Parity process Storage group
Handle Write Requests v2 Data process 1 Client Set(k1, v1) v1 Batch process Data process 2 Stripe Hash v3 index table Client Data process 3 Update set(k2, v2) Batch coding P1 Parity process 1 set(k3, v3) Client v2 P2 v1 Parity process 2 v3 P1 P2 b1
Handle Read Requests v2 Data process 1 v1 get(b1) Data process 2 Batch process get(k1) v3 Client Hash Stripe Data process 3 table index P1 Parity process 1 Key Stripe id k1 b1 P2 k2 b1 Parity process 2 k3 b1
Recovery v2 Data process 1 Recover the request data first 1. Get values according to stripe id from any k storage processes v1 Data process 2 Batch process get(k1) v3 Client Data process 3 Decoder P1 Parity process 1 v2 2. Recover the P2 P1 v1 lost blocks Parity process 2 P2
Outline Introduction and Motivation Our Design System and Implementation Evaluation
Evaluation Cluster configuration – 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs – 1Gb/s Ethernet Targets of comparison – In-place update EC (Cocytus[1]) – Replication (Rep) Workload – YCSB with different key distributions – 50%:50% read/write ratio [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST ’16
Bandwidth Cost Save up to 51% bandwidth cost Bandwidth cost for different coding schemes.
Throughput Up to 2.4x improvement Throughput performance for different coding schemes.
Memory Save up to 41% memory cost Memory consumption for different redundancy schemes
Latency Read latency Write latency
Conclusion Efficiency and availability are two crucial features for in- memory KV-Stores We build BCStore, an in-memory KV-Store which applies erasure coding for data availability We design batch coding mechanism to achieve high bandwidth efficiency for write workload We propose a heuristic garbage collection algorithm to improve memory efficiency
Thanks! Q&A
Severity of Bandwidth Cost Prevalence of write requests in large-scale web services – Peak load can easily run out of network bandwidth and degrade service performance Monetary cost of bandwidth becomes several times higher – Especially under the commonly used peak-load pricing model – Bandwidth amplification would be more serious with the increase of m (number of parity servers) Budget of bandwidth resource is usually limited in workload-sharing cluster Our goal: High memory efficiency and bandwidth efficiency
Our Design Batch write requests and append a new coding stripe Batch coding Append Data node 1 obj1 obj4 obj7 obj4’ Data node 2 obj2 obj8 obj5 obj8’ Data node 3 obj3 obj6 obj9 obj3’ Parity node 1 P P Parity node 2 P P
Challenges Recycle the memory space of data blocks which are deleted or updated – Data blocks and parity blocks are appended to the storage – Updated blocks can not be delete directly Encode variable-sized data efficiently – Variable-sized data can not be appended to previous storage space directly
Garbage Collection Popularity-based data arrangement Hot Data node 1 Data node 2 Sort Data node 3 Parity node1 Parity node2 Batched cold objects
Encoding Variable-size Data Virtual coding stripes (vcs) – Each virtual coding stripe has a large fixed-length space and is aligned in virtual address Physical space Data node 1 Data node 1 Data node 2 Data node 3 Parity node 1 Parity node 2 vcs1 vcs2 vcs3 Virtual space
Bandwidth Cost Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))
Throughput Throughput performance for moderate-skewed Zipfian workload
Throughput Throughput for recovery
In-place Update A traditional mechanism for coding small objects Data node 1 obj4 obj7 obj1 Data node 2 obj2 obj8 obj5 Data node 3 obj6 obj9 obj3 Parity node 1 P P P Parity node 2 P P P
Garbage Collection How to further reduce block move? – Intuition: make the updates focus on few stripes Popularity-based data arrangement Original stripes Batched stripes Hot Cold Cold Hot Data nodes Parity nodes GC GC
Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Original stripes Batched stripes Data nodes Parity nodes GC GC Worst case of GC bandwidth
Bandwidth Cost Bandwidth cost for different throughput. (RS(5,4))
Recovery Data process 1 1. Get latest batch id Data process 2 M Batch process Client Replication Data process 3 3. Serve requests M Batch process Parity process 1 2. Update the latest stable batch id and reconstruct metadata Parity process 2
Recommend
More recommend