CEPH DATA SERVICES IN A MULTI- AND HYBRID CLOUD WORLD Sage Weil - Red Hat 1 FOSDEM - 2019.02.02
OUTLINE Ceph ● Data services ● Block ● File ● Object ● Edge ● Future ● 2
UNIFIED STORAGE PLATFORM OBJECT BLOCK FILE RGW RBD CEPHFS S3 and Swift Virtual block device Distributed network object storage with robust feature set file system LIBRADOS Low-level storage API RADOS Reliable, elastic, highly-available distributed storage layer with replication and erasure coding 3
RELEASE SCHEDULE WE ARE HERE Luminous Mimic Nautilus Octopus Aug 2017 May 2018 Feb 2019 Nov 2019 12.2.z 13.2.z 14.2.z 15.2.z ● Stable, named release every 9 months ● Backports for 2 releases ● Upgrade up to 2 releases at a time ● (e.g., Luminous → Nautilus, Mimic → Octopus) 4
FOUR CEPH PRIORITIES Usability and management Container ecosystem Performance Multi- and hybrid cloud 5
MOTIVATION - DATA SERVICES 6
A CLOUDY FUTURE IT organizations today ● Multiple private data centers ○ Multiple public cloud services ○ It’s getting cloudier ● “On premise” → private cloud ○ Self-service IT resources, provisioned on demand by developers and business units ○ Next generation of cloud-native applications will span clouds ● “Stateless microservices” are great, but real applications have state ● Managing moving or replicated state is hard ● 7
“DATA SERVICES” Data placement and portability ● Where should I store this data? ○ How can I move this data set to a new tier or new site? ○ Seamlessly, without interrupting applications? ○ Introspection ● What data am I storing? For whom? Where? For how long? ○ Search, metrics, insights ○ Policy-driven data management ● Lifecycle management ○ Compliance: constrain placement, retention, etc. (e.g., HIPAA, GDPR) ○ Optimize placement based on cost or performance ○ Automation ○ 8
MORE THAN JUST DATA Data sets are tied to applications ● When the data moves, the application often should (or must) move too ○ Container platforms are key ● Automated application (re)provisioning ○ “Operators” to manage coordinated migration of state and the applications that consume it ○ 9
DATA USE SCENARIOS Multi-tier ● Different storage for different data ○ Mobility ● Move an application and its data between sites with minimal (or no) availability interruption ○ Maybe an entire site, but usually a small piece of a site (e.g., a single app) ○ Disaster recovery ● Tolerate a complete site failure; reinstantiate data and app in a secondary site quickly ○ Point-in-time consistency with bounded latency (bounded data loss on failover) ○ Stretch ● Tolerate site outage without compromising data availability ○ Synchronous replication (no data loss) or async replication (different consistency model) ○ Edge ● Small satellite (e.g., telco POP) and/or semi-connected sites (e.g., autonomous vehicle) ○ 10
SYNC VS ASYNC Synchronous replication Asynchronous replication Applications initiates a write Application initiates a write ● ● Storage writes to all replicas Storage writes to one (or some) replicas ● ● Application write completes Application write completes ● ● Storage writes to remaining (usually ● remote) replicas later Write latency may be high since we wait ● Write latency can be kept low ● for all replicas If initial replicas are lost, application write ● All replicas always reflect applications’ ● may be lost completed writes Remote replicas may always be somewhat ● stale 11
BLOCK STORAGE 12
HOW WE USE BLOCK Virtual disk device ● Exclusive access by nature (with few exceptions) ● Strong consistency required ● Performance sensitive ● Basic feature set ● Applications Read, write, flush, maybe resize ○ Snapshots (read-only) or clones (read/write) ○ XFS, ext4, whatever Point-in-time consistent ■ Often self-service provisioning ● Block device via Cinder in OpenStack ○ via Persistent Volume (PV) abstraction in Kubernetes ○ 13
RBD - TIERING WITH RADOS POOLS Multi-tier ✓ Mobility ❏ DR ❏ KVM Stretch ❏ Edge ❏ FS FS KRBD librbd SSD 2x POOL HDD 3x POOL SSD EC 6+3 POOL CEPH STORAGE CLUSTER 14
RBD - LIVE IMAGE MIGRATION Multi-tier ✓ New in Nautilus ● Mobility ✓ DR ❏ librbd only ● KVM Stretch ❏ Edge ❏ FS FS KRBD librbd SSD 2x POOL HDD 3x POOL SSD EC 6+3 POOL CEPH STORAGE CLUSTER 15
RBD - STRETCH Multi-tier ❏ Apps can move ● Mobility ❏ DR ✓ Data can’t - it’s already everywhere ● Stretch FS ✓ Performance is usually compromised ● Edge ❏ KRBD Need fat and low latency pipes ○ SITE A SITE B STRETCH POOL STRETCH CEPH STORAGE CLUSTER WAN link 16
RBD - STRETCH WITH TIERS Multi-tier ✓ Create site-local pools for performance ● Mobility ❏ DR ✓ sensitive apps Stretch FS ✓ Edge ❏ KRBD SITE A SITE B A POOL STRETCH POOL B POOL STRETCH CEPH STORAGE CLUSTER WAN link 17
RBD - STRETCH WITH MIGRATION KVM Multi-tier ✓ Live migrate images between pools ● Mobility ✓ DR ✓ Maybe even live migrate your app VM? ● FS Stretch ✓ librbd Edge ❏ SITE A SITE B A POOL STRETCH POOL B POOL STRETCH CEPH STORAGE CLUSTER WAN link 18
STRETCH IS SKETCH Network latency is critical ● Want low latency for performance ○ Stretch requires nearby sites, limiting usefulness ○ Bandwidth too ● Must be able to sustain rebuild data rates ○ Relatively inflexible ● Single cluster spans all locations; maybe ok for 2 ○ datacenters but not 10? Cannot “join” existing clusters ○ High level of coupling ● Single (software) failure domain for all sites ○ Proceed with caution! ● 19
RBD ASYNC MIRRORING Asynchronously mirror all writes ● Some performance overhead at primary ● KVM Mitigate with SSD pool for RBD journal ○ Configurable time delay for backup ● FS librbd Supported since Luminous ● WAN link PRIMARY BACKUP Asynchronous mirroring SSD 3x POOL HDD 3x POOL CEPH CLUSTER A CEPH CLUSTER B 20
RBD ASYNC MIRRORING Multi-tier ❏ On primary failure ● Mobility ❏ DR ✓ Backup is point-in-time consistent ○ KVM Stretch ❏ Lose only last few seconds of writes ○ Edge ❏ VM/pod/whatever can restart in new site ○ FS If primary recovers, ● librbd Option to resync and “fail back” ○ WAN link DIVERGENT PRIMARY Asynchronous mirroring SSD 3x POOL HDD 3x POOL CEPH CLUSTER A CEPH CLUSTER B 21
RBD MIRRORING IN OPENSTACK CINDER Ocata Gaps ● ● Cinder RBD replication driver Deployment and configuration tooling ○ ○ Queens Cannot replicate multi-attach volumes ● ○ Nova attachments are lost on failover ○ ceph-ansible deployment of rbd-mirror via ○ TripleO Rocky ● Failover and fail-back operations ○ 22
MISSING LINK: APPLICATION ORCHESTRATION Hard for IaaS layer to reprovision app in new site ● Storage layer can’t solve it on its own either ● Need automated, declarative, structured specification for entire app stack... ● 23
FILE STORAGE 24
CEPHFS STATUS Multi-tier ✓ Stable since Kraken ● Mobility ❏ DR ❏ Multi-MDS stable since Luminous ● Stretch ❏ Edge ❏ Snapshots stable since Mimic ● Support for multiple RADOS data pools ● Per-directory subtree policies for placement, striping, etc. ○ Fast, highly scalable ● Quota, multi-volumes, multi-subvolume ● Provisioning via OpenStack Manila and Kubernetes ● Fully awesome ● 25
CEPHFS CLIENT HOST or ceph-fuse, Samba, CEPH KERNEL MODULE nfs-ganesha metadata data 01 10 M M M RADOS CLUSTER 26
CEPHFS - STRETCH? Multi-tier ❏ We can stretch CephFS just like RBD pools ● Mobility ❏ DR ✓ It has the same limitations as RBD ● Stretch ✓ Edge ❏ Latency → lower performance ○ Limited by geography ○ Big (software) failure domain ○ Also, ● MDS latency is critical for file workloads ○ ceph-mds daemons will run in one site; clients in other sites will see higher latency ○ 27
CEPHFS - FUTURE OPTIONS What can we do with CephFS across sites and clusters? ● 28
CEPHFS - SNAP MIRRORING? SITE A SITE B Multi-tier ❏ CephFS snapshots provide ● Mobility ❏ DR ✓ point-in-time consistency ○ 1. A: create snap S1 Stretch ❏ S1 S1 granularity (any directory in the system) ○ Edge 2. rsync A→B ❏ 3. B: create snap S1 CephFS rstats provide ● rctime = recursive ctime on any directory ○ 4. A: create snap S2 S2 S2 We can efficiently find changes ○ 5. rsync A→B time 6. B: create S2 rsync provides ● efficient file transfer ○ 7. A: create snap S3 S3 Time bounds on order of minutes ● 8. rsync A→B Gaps and TODO 9. B: create S3 ● “rstat flush” coming in Nautilus ○ Xuehan Xu @ Qihoo 360 ■ rsync support for CephFS rctime ○ scripting / tooling ○ easy rollback interface ○ Matches enterprise storage feature sets ● 29
Recommend
More recommend