Ceph & RocksDB 변일수 (Cloud Storage 팀 )
Ceph Basics
Placement Group PG#1 PG#2 PG#3 myobject mypool hash(myobject) = 4% 3(# of PGs) = 1 ← Target PG
CRUSH PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12
Recovery PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12
OSD Peering, Replication, Heartbeat, ???, … … OSD ObjectStore FileStore BlueStore https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem
ObjectStore https://ceph.com/community/new-luminous-bluestore/
OSD Transaction
CRUSH Consistency is enforced here!
BlueStore Transaction • To maintain ordering within each PG, ordering within each shard should be guaranteed. sync transaction ROCKSDB kv_committing Request finisher queue pipe out_q op_wq Ack write Shard flusth SSD
RocksDB Group Commit • Metadata is stored in RocksDB. • After storing metadata atomically, data is available to users. Request #3 Thread #1 Thread #2 Thread Transaction Log JoinBatchGroup (leader) JoinBatchGroup JoinBatchGroup AwaitState AwaitState Logfile PreprocessWrite Memtable WriteToWAL MarkLogsSynced Flush ExitAsBatchGroupLeader Group commit leader SSTFile PreprocessWrite Write to memtable WriteToWAL LaunchParallelFollower follower MarkLogsSynced CompleteParallelWorker ExitAsBatchGroupFollower Concurrent write to memtable
Thread Scalibility Shard Scalability 60,000 50,000 40,000 IOPS 30,000 20,000 10,000 0 1 shard 10 shard WAL disableWAL PUT PUT WAL PUT PUT RocksDB
RadosGW
RadosGW • RadosGW is an application of RADOS RadosGW CephFS Rados OSD Mon Mgr
RadosGW Transaction • All atomic operations depen on RocksDB Put Object k v k v k v k v Prepare Index k v k v Index Object Data Object Write Data RADOS Complete Index RocksDB
BlueStore Transaction • To maintain ordering within each PG, ordering within each shard should be guaranteed. sync transaction ROCKSDB bstore_shard_finishers = true kv_committing Request finisher queue pipe out_q op_wq Ack write Shard flusth SSD
Performance Issue
Tail Latency
Performance Metrics
RocksDB Compaction Overhead • "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores" (ATC'19)
Conclusions • Ceph highly depends on RocksDB • Strong consistency of Ceph is implemented using RocksDB transactions • The performance of ceph also depends on RocksDB • Especially for Small IO • But RocksDB has some performance Issues • Flushing WAL • Compaction • ilsoobyun@linecorp.com
THANK YOU
Recommend
More recommend