PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabón - Red Hat
ABOUT ME ABOUT ME LUIS PABÓ N LUIS PABÓ N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon
QUESTIONS: QUESTIONS: What are the bene fi ts of client side persistent caching? How to effectively use the SSD? Compute Node Storage SSD
MERCURY* MERCURY* Use in memory Write Increase storage Cache must be data structures sequentially to backend availability persistent since to handle cache the SSD by reducing read warming could be misses as requests time consuming quickly as possible * S. Byan, et al., Mercury: Host-side fl ash caching for the data center
M E RC U RY Q E M U I N T EG RATION M E RC U RY Q E M U I N T EG RATION
PBLCACHE PBLCACHE
PBLCACHE PBLCACHE P ersistent BL ock Cache Persistent, block based, look aside cache for QEMU User space library/application Based on ideas described in the Mercury paper Requires exclusive access to mutable objects
GOAL: QEMU SHARED CACHE GOAL: QEMU SHARED CACHE
PBLCACHE ARCHITECTURE PBLCACHE ARCHITECTURE PBL Application Cache Map Log SSD
PBL APPLICATION PBL APPLICATION Sets up the cache map and log Decides how to use the cache (writethrough, read-miss) Inserts, retrieves, or invalidates blocks from the cache Pbl App Msg Queue Cache map Log
CACHE MAP CACHE MAP Composed of two data structures Maintains all block metadata Address Map Block Descriptor Array
ADDRESS MAP ADDRESS MAP Implemented using as a hash table Translates object blocks to Block Descriptor Array (BDA) indeces Cache misses are determined extremely fast Address Map Block Descriptor Array
BLOCK DESCRIPTOR ARRAY BLOCK DESCRIPTOR ARRAY Contains metadata for blocks stored in the log Length is equal to the maximum number of blocks stored in the log Handles CLOCK evictions Invalidations are extremely fast Address Map Block Descriptor Array Insertions always append
CACHE MAP I/O FLOW CACHE MAP I/O FLOW Block Descriptor Array
CACHE MAP I/O FLOW CACHE MAP I/O FLOW Get In address map No Yes Miss Hit Set CLOCK bit in BDA Read from log
CACHE MAP I/O FLOW CACHE MAP I/O FLOW Invalidate Free BDA index Delete from map
LOG LOG Block location determined by BDA CLOCK optimized with segment read-ahead Segment pool with buffered writes Contiguous block support Segments SSD
LOG SEGMENT STATE MACHINE LOG SEGMENT STATE MACHINE
LOG READ I/O FLOW LOG READ I/O FLOW Read In a segment? Yes No Read from segment Read from SSD
PERSISTENT METADATA PERSISTENT METADATA Save address map to a fi le on application shutdown Cache warm on application restart Not designed to be durable System crash will cause metadata fi le not to be created
PBLIO BENCHMARK PBLIO BENCHMARK PBL APPLICATION PBL APPLICATION
PBLIO PBLIO Benchmark tool Uses an enterprise workload workload generator from NetApp* Cache setup as write through Can be used with or without pblcache Documentation https://github.com/pblcache/pblcache/wiki/Pblio * S. Daniel et al., A portable, open-source implementation of the SPC-1 workload * https://github.com/lpabon/goioworkload
ENTERPRISE WORKLOAD ENTERPRISE WORKLOAD Synthetic OLTP enterprise workload generator Tests for maximum number of IOPS before exceeding 30ms latency Divides storage system into three logical storage units: ASU1 - Data Store - 45% of total storage - RW ASU2 - User Store - 45% of total storage - RW ASU3 - Log - 10% of total storage - Write Only BSU - Business Scaling Units 1 BSU = 50 IOPS
S IM P L E E XAM P L E S IM P L E E XAM P L E $ fallocate -l 45MiB file1 $ fallocate -l 45MiB file2 $ fallocate -l 10MiB file3 $ $ ./pblio -asu1=file1 \ -asu2=file2 \ -asu3=file3 \ -runlen=30 -bsu=2 ----- pblio ----- Cache : None ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2895 ms
RAW D EVICES E XAMPL E RAW D EVICES E XAMPL E $ ./pblio -asu1=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde \ -asu2=/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi \ -asu3=/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm \ -runlen=30 -bsu=2
CACHE EXAMPLE CACHE EXAMPLE $ fallocate -l 10MiB mycache $ ./pblio -asu1=file1 -asu2=file2 -asu3=file3 \ -runlen=30 -bsu=2 -cache=mycache ----- pblio ----- Cache : mycache (New) C Size : 0.01 GB ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2573 ms Read Hit Rate: 0.4457 Invalidate Hit Rate: 0.6764 Read hits: 1120 Invalidate hits: 347 Reads: 2513 Insertions: 1906 Evictions: 0 Invalidations: 513 == Log Information == Ram Hit Rate: 1.0000 Ram Hits: 1120 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 0 Wraps: 1 Segments Skipped: 0 Mean Read Latency: 0.00 usec Mean Segment Read Latency: 4396.77 usec Mean Write Latency: 1162.58 usec
L ATENCY OVER 30MS L ATENCY OVER 30MS ----- pblio ----- Cache : /dev/sdg (Loaded) C Size : 185.75 GB ASU1 : 673.83 GB ASU2 : 673.83 GB ASU3 : 149.74 GB BSUs : 32 Contexts: 1 Run time: 600 s ----- Avg IOPS:1514.92 Avg Latency:112.1096 ms Read Hit Rate: 0.7004 Invalidate Hit Rate: 0.7905 Read hits: 528539 Invalidate hits: 120189 Reads: 754593 Insertions: 378093 Evictions: 303616 Invalidations: 152039 == Log Information == Ram Hit Rate: 0.0002 Ram Hits: 75 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 445638 Wraps: 0 Segments Skipped: 0 Mean Read Latency: 850.89 usec Mean Segment Read Latency: 2856.16 usec Mean Write Latency: 6472.74 usec
EVALUATION EVALUATION
TEST SETUP TEST SETUP Client using 180GB SAS SSD (about 10% of workload size) GlusterFS 6x2 Cluster 100 fi les for each ASU pblio v0.1 compiled with go1.4.1 Each system has: Fedora 20 6 Intel Xeon E5-2620 @ 2GHz 64 GB RAM 5 300GB SAS Drives 10Gbit Network
CACHE WARMUP IS TIME CACHE WARMUP IS TIME COMSU MIN G COMSU MIN G 16 hours
I N C R E AS E D R ES PO NS E TIM E I N C R E AS E D R ES PO NS E TIM E 73% Increase
STO RAG E BAC K E N D I O PS STO RAG E BAC K E N D I O PS REDUCTION REDUCTION BSU = 31 or 1550 IOPS ~75% IOPS Reduction
CURRENT STATUS CURRENT STATUS
M I L ESTO N ES M I L ESTO N ES 1. Create Cache Map - COMPLETED 2. Create Log - COMPLETED 3. Create Benchmark application - COMPLETED 4. Design pblcached architecture - IN PROGRESS
NEXT: QEMU SHARED CACHE NEXT: QEMU SHARED CACHE Work with the community to bring this technology to QEMU Possible architecture: Some conditions to think about: VM migration Volume deletion VM crash
FUTURE FUTURE Hyperconvergence Peer-cache Writeback Shared cache QoS using mClock* Possible integrations with Ceph and GlusterFS backends * A. Gulati et al., mClock: Handling Throughput Variability for Hypervisor IO Scheduling
JOIN! JOIN! Github: https://github.com/pblcache/pblcache IRC Freenode: #pblcache Google Group: https://groups.google.com/forum/#!forum/pblcache Mail list: pblcache@googlegroups.com
FROM THIS... FROM THIS...
TO THIS TO THIS
Recommend
More recommend