BLUESTORE: A NEW STORAGE BACKEND FOR CEPH – ONE YEAR IN SAGE WEIL 2017.03.23
OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Future ● Status and availability ● Summary 2
MOTIVATION
CEPH Object, block, and fjle storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” ● 4
CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed fjle system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 5
OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 6
OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 7
OBJECTSTORE AND DATA MODEL ObjectStore Object – “fjle” ● ● abstract interface for storing data (fjle-like byte stream) – – local data attributes (small key/value) – EBOFS, FileStore – omap (unbounded key/value) – Collection – “directory” ● placement group shard (slice of – the RADOS pool) EBOFS ● All writes are transactions ● a user-space e xtent- b ased – – A tomic + C onsistent + D urable o bject f ile s ystem I solation provided by OSD – deprecated in favor of FileStore – on btrfs in 2009 8
FILESTORE FileStore /var/lib/ceph/osd/ceph-123/ ● ● current/ – PG = collection = directory – meta/ ● object = fjle – – osdmap123 – osdmap124 Leveldb ● 0.1_head/ ● large xattr spillover – – object1 – object12 object omap (key/value) data – 0.7_head/ ● – object3 – object5 Originally just for development... ● 0.a_head/ ● – object4 later, only supported backend – – object6 (on XFS) omap/ ● – <leveldb files> 9
POSIX FAILS: TRANSACTIONS Most transactions are simple ● [ { "op_name": "write", write some bytes to object (fjle) – "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", update object attribute (fjle – "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, append to update log (kv insert) { – "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 } Serialize and write-ahead txn to ● }, journal for atomicity { "op_name": "omap_setkeys", "collection": "0.6_head", We double-write everything! – "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, Lots of ugly hackery to make – "_info": 847 replayed events idempotent } } ] 10
POSIX FAILS: ENUMERATION Ceph objects are distributed by a 32-bit hash ● … Enumeration is in hash order ● DIR_A/ scrubbing – DIR_A/A03224D3_qwer “backfjll” (data rebalancing, recovery) – DIR_A/A247233E_zxcv enumeration via librados client API – … POSIX readdir is not well-ordered DIR_B/ ● DIR_B/DIR_8/ And even if it were, it would be a difgerent hash – DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar Need O(1) “split” for a given shard/range ● DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz Build directory tree by hash-value prefjx ● DIR_B/DIR_A/ split any directory when size > ~100 fjles – DIR_B/DIR_A/BA4328D2_asdf merge when size < ~50 fjles – … read entire directory, sort in-memory – 11
THE HEADACHES CONTINUE New FileStore problems continue to surface as we approach switch to ● BlueStore Recently discovered bug in FileStore omap implementation, revealed by new – CephFS scrubbing FileStore directory splits lead to throughput collapse when an entire pool’s – PG directories split in unison Read/modify/write workloads perform horribly – RGW index objects ● RBD object bitmaps ● QoS efgorts thwarted by deep queues and periodicity in FileStore throughput – Cannot bound deferred writeback work, even with fsync(2) – {RBD, CephFS} snapshots triggering ineffjcient 4MB object copies to create – object clones 12
BLUESTORE
BLUESTORE BlueStore = Bl ock + N ewStore ● consume raw block device(s) – ObjectStore key/value database (RocksDB) for metadata – BlueStore data written directly to block device – metadata data pluggable block Allocator (policy) – RocksDB pluggable compression – BlueRocksEnv checksums, ponies, ... – BlueFS We must share the block device with RocksDB BlockDevice ● 14
ROCKSDB: BLUEROCKSENV + BLUEFS class BlueRocksEnv : public rocksdb::EnvWrapper Map “directories” to difgerent block devices ● ● db.wal/ – on NVRAM, NVMe, SSD passes “fjle” operations to BlueFS – – – db/ – level0 and hot SST s on SSD BlueFS is a super-simple “fjle system” ● db.slow/ – cold SST s on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount – BlueStore periodically balances free space ● no need to store block free list – coarse allocation unit (1 MB blocks) – journal rewritten/compacted when it gets large – superblock journal … data data more journal … data data fjle 10 fjle 11 fjle 12 fjle 12 fjle 13 rm fjle 12 fjle 13 ... 15
MULTI-DEVICE SUPPORT Single device T wo devices ● ● HDD or SSD a few GB of SSD – – Bluefs db.wal/ + db/ (wal and sst fjles) bluefs db.wal/ (rocksdb wal) ● ● object data blobs bluefs db/ (warm sst fjles) ● ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● T wo devices Three devices ● ● 512MB of SSD or NVRAM – 512MB NVRAM – bluefs db.wal/ (rocksdb wal) ● bluefs db.wal/ (rocksdb wal) ● big device – a few GB SSD – bluefs db/ (sst fjles, spillover) ● bluefs db/ (warm sst fjles) ● object data blobs ● big device – bluefs db.slow/ (cold sst fjles) ● object data blobs ● 16
METADATA
BLUESTORE METADATA Everything in fmat kv database (rocksdb) ● Partition namespace for difgerent metadata ● S* – “superblock” properties for the entire store – B* – block allocation metadata (free block bitmap) – T* – stats (bytes used, compressed, etc.) – C* – collection name → cnode_t – O* – object name → onode_t or bnode_t – X* – shared blobs – L* – deferred writes (promises of future IO) – M* – omap (user key/value data, stored in objects) – 18
CNODE Collection metadata struct spg_t { ● uint64_t pool; Interval of object namespace – uint32_t hash; shard_id_t shard; shard pool hash name bits }; C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; shard pool hash name snap gen }; O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … Nice properties O<NOSHARD,12,3d3e 02c2 ,baz,NOSNAP,NOGEN> = … ● Ordered enumeration of objects – O<NOSHARD,12,3d3e 125d ,zip,NOSNAP,NOGEN> = … We can “split” collections by adjusting – O<NOSHARD,12,3d3e 1d41 ,dee,NOSNAP,NOGEN> = … collection metadata only O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … 19
ONODE Per object metadata struct bluestore_onode_t { ● uint64_t size; Lives directly in key/value pair – map<string,bufferptr> attrs; uint64_t flags; Serializes to 100s of bytes – Size in bytes struct shard_info { ● uint32_t offset; Attributes (user attr data) ● uint32_t bytes; }; Inline extent map (maybe) ● vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 20
BLOBS Blob struct bluestore_blob_t { ● vector<bluestore_pextent_t> extents; Extent(s) on device – uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; Lump of data originating from – uint32_t flags = 0; same object uint16_t unused = 0; // bitmap May later be referenced by – multiple objects uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; Normally checksummed – bufferptr csum_data; }; May be compressed – SharedBlob ● struct bluestore_shared_blob_t { uint64_t sbid; Extent ref count on cloned – bluestore_extent_ref_map_t ref_map; blobs }; In-memory bufger cache – 21
Recommend
More recommend