ceph amp rocksdb
play

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group - PowerPoint PPT Presentation

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3 myobject mypool hash(myobject) = 4% 3(# of PGs) = 1 Target PG CRUSH PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12 Recovery PG#1 PG#2 PG#3


  1. Ceph & RocksDB 변일수 (Cloud Storage 팀 )

  2. Ceph Basics

  3. Placement Group PG#1 PG#2 PG#3 myobject mypool hash(myobject) = 4% 3(# of PGs) = 1 ← Target PG

  4. CRUSH PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12

  5. Recovery PG#1 PG#2 PG#3 mypool OSD#1 OSD#3 OSD#12

  6. OSD Peering, Replication, Heartbeat, ???, … … OSD ObjectStore FileStore BlueStore https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem

  7. ObjectStore https://ceph.com/community/new-luminous-bluestore/

  8. OSD Transaction

  9. CRUSH Consistency is enforced here!

  10. BlueStore Transaction • To maintain ordering within each PG, ordering within each shard should be guaranteed. sync transaction ROCKSDB kv_committing Request finisher queue pipe out_q op_wq Ack write Shard flusth SSD

  11. RocksDB Group Commit • Metadata is stored in RocksDB. • After storing metadata atomically, data is available to users. Request #3 Thread #1 Thread #2 Thread Transaction Log JoinBatchGroup (leader) JoinBatchGroup JoinBatchGroup AwaitState AwaitState Logfile PreprocessWrite Memtable WriteToWAL MarkLogsSynced Flush ExitAsBatchGroupLeader Group commit leader SSTFile PreprocessWrite Write to memtable WriteToWAL LaunchParallelFollower follower MarkLogsSynced CompleteParallelWorker ExitAsBatchGroupFollower Concurrent write to memtable

  12. Thread Scalibility Shard Scalability 60,000 50,000 40,000 IOPS 30,000 20,000 10,000 0 1 shard 10 shard WAL disableWAL PUT PUT WAL PUT PUT RocksDB

  13. RadosGW

  14. RadosGW • RadosGW is an application of RADOS RadosGW CephFS Rados OSD Mon Mgr

  15. RadosGW Transaction • All atomic operations depen on RocksDB Put Object k v k v k v k v Prepare Index k v k v Index Object Data Object Write Data RADOS Complete Index RocksDB

  16. BlueStore Transaction • To maintain ordering within each PG, ordering within each shard should be guaranteed. sync transaction ROCKSDB bstore_shard_finishers = true kv_committing Request finisher queue pipe out_q op_wq Ack write Shard flusth SSD

  17. Performance Issue

  18. Tail Latency

  19. Performance Metrics

  20. RocksDB Compaction Overhead • "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores" (ATC'19)

  21. Conclusions • Ceph highly depends on RocksDB • Strong consistency of Ceph is implemented using RocksDB transactions • The performance of ceph also depends on RocksDB • Especially for Small IO • But RocksDB has some performance Issues • Flushing WAL • Compaction • ilsoobyun@linecorp.com

  22. THANK YOU

Recommend


More recommend