geo replication and disaster recovery for cloud object
play

Geo replication and disaster recovery for cloud object storage with - PowerPoint PPT Presentation

Geo replication and disaster recovery for cloud object storage with Ceph rados gateway Orit Wasserman Senior Software engineer owasserm@redhat.com Vault 2017 AGENDA What is Ceph? Rados Gateway (radosgw) architecture Geo


  1. Geo replication and disaster recovery for cloud object storage with Ceph rados gateway Orit Wasserman Senior Software engineer owasserm@redhat.com Vault 2017

  2. AGENDA What is Ceph? • Rados Gateway (radosgw) architecture • Geo replication in radosgw • Questions •

  3. Ceph architecture

  4. Ceph Open source • Software defjned storage • Distributed • No single point of failure • Massively scalable • Self healing • Unifjed storage: object, block and fjle • IRC: OFTC #ceph,#ceph-devel • Mailing lists: ceph-users@ceph.com and ceph-devel@ceph.com •

  5. Ceph APP HOST/VM CLIENT RGW RBD CEPHFS A reliable, fully- A distributed fjle system A web services distributed block device with POSIX semantics gateway for object with cloud platform and scale-out metadata storage, compatible integration management with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

  6. Rados Reliable Distributed Object Storage • Replication • Erasure coding • Flat object namespace within each pool • Difgerent placement rules • Strong consistency (CP system) • Infrastructure aware, dynamic topology • Hash-based placement (CRUSH) • Direct client to server data path •

  7. OSD node 10s to 1000s in a cluster • One per disk (or one per • SSD, RAID group…) Serve stored objects to • clients Intelligently peer for • replication & recovery

  8. Monitor node Maintain cluster membership • and state Provide consensus for • distributed decision-making Small, odd number • These do not serve stored • objects to clients

  9. Librados API Effjcient key/value storage inside an object • Atomic single-object transactions • update data, attr, keys together • atomic compare-and-swap • Object-granularity snapshot infrastructure • Partial overwrite of existing data • Single-object compound atomic operations • RADOS classes (stored procedures) • Watch/Notify on an object •

  10. Rados Gateway

  11. Rados Gateway APP HOST/VM CLIENT RGW RBD CEPHFS A web services A reliable, fully- A distributed fjle gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

  12. Rados Gateway APPLICATION APPLICATION REST RADOSGW RADOSGW LIBRADOS LIBRADOS socket M M M RADOS CLUSTER

  13. RESTful OBJECT STORAGE Data • Users • Buckets APPLICATION APPLICATION • Objects • SWIFT REST S3 REST ACLs • Authentication • RADOSGW APIs • LIBRADOS S3 • Swift • Librgw (used for • NFS) RADOS CLUSTER

  14. RGW vs RADOS object RADOS • Limited object sizes • Mutable objects • Not indexed • No per-object ACLs • RGW • Large objects (Up to a few TB per object) • Immutable objects • Sorted bucket listing • Permissions •

  15. RGW objects Head • Single rados object • Object metadata (acls, user attributes, • manifest) Optional start of data • T ail • Striped data • OBJECT 0 or more rados objects • HEAD TAIL

  16. RGW Objects OBJECT: foo BUCKET: boo head head head 123_foo BUCKET ID: 123 tail 1 123_28faPd3Z.1 123_28faPd.1 tail 1 123_28faPd3Z.2

  17. RGW bucket index BUCKET INDEX Shard 1 Shard 2 aaa aab abc bbb def (v2) eee def (v1) fff zzz zzz

  18. RGW object creation When creating a new object we need to: • Update bucket index • Create head object • Create tail objects • All those operations need to be consist •

  19. RGW object creation Write tail prepare TAIL aab complete bbb aab eee bbb fff (prepare) eee Write head zzz fff HEAD zzz

  20. Geo replication

  21. Geo replication Data is replicated on difgerent physical locations • High and unpredictable latency between those location •

  22. Why do we need Geo replication? Disaster recovery • Distribute data across geographical • locations

  23. Geo replication europe us-east us-west us-east us-west us-east us-west brazil singapore aus brazil europe aus aus primary singapore singapore dr backup

  24. Sync agent (old implementation) CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (US-EAST-1) SYNC AGENT CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (US-EAST-2)

  25. Sync agent (old implementation) External python implementation • No Active/Active support • Hard to confjgure • Complicate failover mechanism • No clear sync status indication • A single bucket synchronization could dominate the entire sync • process Confjguration updates require restart of the gateways •

  26. New implementation part of the radosgw (written in c++) • Active/active support for data replication • Simpler confjguration • Simplify failover/failback • Dynamic reconfjguration • Backward compatibility with the sync agent •

  27. Multisite confjguration Realm • Namespace • contains the multisite confjguration and status • Allows running difgerent confjgurations in the same cluster • Zonegroup • Group of zones • Used to be called region in old multisite • Each realm has a single master zonegroup • Zone • One or more Radosgw instances all running on the same Rados • cluster Each zonegroup has a single master zone •

  28. Disaster recovery example Zonegroup: us (master) Zonegroup: us (master) Zone: us-west (secondary) Zone: us-east (master) RADOSGW RADOSGW CEPH STORAGE CEPH STORAGE CLUSTER CLUSTER (US-EAST) (US-WEST) Realm: Gold

  29. Multisite environment example ZoneGroup: us (master) ZoneGroup: eu (secondary) Zone: us-east (master) Zone: eu-west (master) RADOSGW RADOSGW CEPH STORAGE CEPH STORAGE CLUSTER CLUSTER (US-EAST) (EU-WEST) RADOSGW CEPH STORAGE Zonegroup: us (master) Realm: Gold CLUSTER Zone: us-west (secondary) (US-WEST)

  30. Confjguration change Period: • Each period has a unique id • Contains: all the realm confjguration, an epoch and it's predecessor period • id (except for the fjrst period) Every realm has an associated current period and a chronological list • of periods Git like mechanism: • User confjguration changes are stored locally • Confjguration updated are stored in a stagging period (using radosgw- • admin period update command) Changes are applied only when the period is commited (using radosgw- • admin period commit command)

  31. Confjguration change – no master change Each zone can pull the period information (using radosgw- • admin period pull command) Period commit will only increment the period epoch. • The new period information will be pushed to all other zones • We use watch/notify on the realm Rados object to detect • changes on the local radosgw

  32. Changing master zone Period commit will results in the following actions: • A new period is generated with a new period id and epoch of 1 • Realm's current period is updated to point to the newly generated • period id Realm's epoch is incremented • New period is pushed to all other zones by the new master • We use watch/notify on the realm rados object to detect • changes and apply them on the local radosgw

  33. Sync process Metadata change operations: • Bucket ops (Create, Delete and enable/disable versioning , change • ACLS ...) Users ops • Data change operations: create/delete objects • Metadata changes have wide system efgect • Metadata changes are rare • Data change are large •

  34. Metadata sync Metadata changes are replicated synchronously across the • realm Log for metadata changes ordered chronologically, stored • locally in each Ceph cluster Each realm has a single meta master, the master zone in the • master zonegroup Only the meta master can executes metadata changes • If the meta master is down the user cannot perform metadata • updates till a new meta master is assigned (cannot create/delete buckets or users but can read/write objects)

  35. Metadata sync Metadata update: • Update metadata log on pending operation • Execute • Update metadata log on operation completion • Push the metadata log changes to all the remote zones • Each zone will pull the updated metadata log and apply • changes locally updates to metadata originating from a difgerent zone are • forwarded to the meta master

  36. Data sync Data changes are replicated asynchronously (eventual • consistency) Default replication is Active/Active • User can confjgure a zone to be read only for Active/Passive • Data changes are execute locally and logged chronologically to • a data log. We fjrst complete a full sync and than continue doing an • incremental sync

Recommend


More recommend