ceph a scalable high performance
play

Ceph: A Scalable, High-Performance Distributed File System Sage A. - PowerPoint PPT Presentation

Ceph: A Scalable, High-Performance Distributed File System Sage A. Weil, Scott A. Brandt, Ethan L. Milner, Darrel D. E. Long Presenter: Md Rajib Hossen Ceph- A single, Open, and Unified platform Horizontally scalable Interoperability


  1. Ceph: A Scalable, High-Performance Distributed File System Sage A. Weil, Scott A. Brandt, Ethan L. Milner, Darrel D. E. Long Presenter: Md Rajib Hossen

  2. Ceph- A single, Open, and Unified platform  Horizontally scalable  Interoperability  No single point of failure  Workload include tens of thousands of client concurrently read/write to same file/directory  Handle allocation and mapping with dynamic algorithm-CRUSH  Enhanced local disk with object storage device(OSD)

  3. Architecture Components  MDS-metadata server: perform file operations(open, rename), manage namespace, ensure consistency, security, and safety  OSD-Object Storage Device: store files, maintain replication, data serialization, recovery  Client: Support three different client types: Object, Block and Posix File System  Monitors: keep track of active and failed cluster nodes  Store files as object in storage level, stripped files into several objects. object size, stripe width, and stripe count are configurable.  CRUSH: removes allocation tables, dynamically maps object(strip of files) to storage devices, retrieve location of objects, load balance among the nodes.

  4. Design Features Ceph provides scalability as well as high performance, reliability, and availibility.  To achieve these, ceph has three design features  Decoupled Data and metadata  Metadata operations(open, rename) managed by MDS, and OSD perform file I/O  Moreover, CRUSH distribute file objects to storage devices algorithmically  Dynamic Distributed Metadata Management  Uses Dynamic Subtree partitioning to distribute responsibilities among several MDS.  Dynamic hierarchical partition also preserve locality, and distribution is based on access pattern  Reliable Autonomic Distributed Object Storage  Delegates responsibilities to OSD and give intelligence to OSD to utilize Memory and CPU  Q3. “ Ceph directly addresses the issue of scalability while simultaneously achieving high  performance, reliability and availability through three fundamental design features:… ” What are the Ceph’s design features? Compare Figure 1 with “ Figure 1: GFS Architecture ” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “ …Figure 1: GFS Architecture”, “ Ceph utilizes a novel metadata cluster architecture… ”, “ Ceph delegates responsibility for data migration, replication,failure detection, and failure recovery to the cluster of OSDs… ”]

  5. Q3. “ Ceph directly addresses the issue of scalability while simultaneously achieving high performance, reliability and availability through three fundamental design features:… ” What are the Ceph’s design features? Compare Figure 1 with “ Figure 1: GFS Architecture ” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “ …Figure 1: GFS Architecture”, “ Ceph utilizes a novel metadata cluster architecture… ”, “ Ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs… ”]

  6. Q3. “ Ceph directly addresses the issue of scalability while simultaneously achieving high performance, reliability and availability through three fundamental design features:… ” What are the Ceph’s design features? Compare Figure 1 with “ Figure 1: GFS Architecture ” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them? [Hint: “ …Figure 1: GFS Architecture”, “ Ceph utilizes a novel metadata cluster architecture… ”, “ Ceph delegates responsibility for data migration, replication,failure detection, and failure recovery to the cluster of OSDs… ”]  GFS has single master to coordinate and maintain all works except whereas ceph has metadata cluster  Ceph distribute replication, failure detection to osd whereas gfs master manage these tasks.  GFS use fixed size chunk i.e. 64MB whereas ceph has variable object size.  GFS use file mapping table and kept them in memory whereas ceph uses CRUSH  GFS depends on local file system whereas Ceph make intelligent OSD on top of local fs/customize fs  Ceph doesn’t require metadata locks or leases to clients whereas GFS does.

  7. OSD Replaced traditional hard disk with intelligent object storage device  C lient can read/write continuously in OSD which isn’t present on traditional HDD  Client can perform continuous reading/writing of large and variable size object  Object sizes are configurable(2MB, 4MB, etc)  OSD distribute low-level block allocation decisions to devices  Moreover, reliance on traditional fs principles i.e. allocation lists and inode table limited  scalability and performance. Intelligence present on OSD can utilize CPU and memory in storage nodes  Q1. OSDs replace the traditional block-level interface with one in which clients can read or  write byte ranges to much larger (and often variably sized) named objects, distributing low-level block allocation decisions to the devices themselves. ” What are the major differences between OSD (Object Storage Device) and conventional hard disk?

  8. Distributed Workload  Ceph delegates some responsibility to osd and reduce dependency on MDS  MDS manage file system namespace and file operations  OSD performs data access, data serialization, replication, reliability  It also removes allocation table and provide CRUSH algorithm to dynamically mapping between object and storage device

  9. Distributed Workload  On the other hand, GFS have file mapping information stored in memory of Master.  Master maintains file system metadata i.e. namespaces, files to chunk mapping, location of replicas.  It also perform chunk lease management, garbage collection, chunk replication and migrations,  “ Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery .” Does GFS have a file allocation table? Who is responsible for managing “ data access, update serialization, replication and reliability, failure detection, and recovery ” in GFS ?

  10. Distributed Object Storage  To store a named object, client only need pool id and object id  Ceph takes these & hashes  Ceph calculate the hash modulo the # of PGs(e.g. 58) to get PG ID  Ceph gets the POOL ID given pool name(e.e . ” juventus ” = 4)  Ceph prepends the pool id to get PG ID (e.g. 4.58)

  11. Find Data  First file is stripped into several objects  Maps objects into PGs using a hash function and adjustable bit mask to control # of PGs  Each osd to the order of 100 PGs for balance in osd utilization  PGs are then mapped to OSD via CRUSH  To locate an object, CRUSH requires PG ID and cluster map  Q4. “ Figure 3: Files are striped across many objects, grouped into placement groups (PGs), and distributed to OSDs via CRUSH, a specialized replica placement function. ”. Describe how to find the data associated with an inode and an in- file object number (“ ino , ono”).

  12. Distributed Object Storage  CRUSH is introduced to remove mapping table that requires significant memory and overhead to keep the list consistent  Moreover, any entity can calculate object location and map can be updated infrequently.  Mapping relying on block or object list metadata have several drawbacks  We need to exchange distribution related metadata  Upon removal of node, we need to make the block/object list consistent and make several changes.  But now, pg will be mapped with new osd, and we will find pg id dynamically  Same approach also help in data rebalancing and new osd node.  These remove the dependency on underlying storage node  Q5. Does a mapping method (from an object number to its hosting storage server) relying on “ block or object list metadata ” (a table listing all object - server mappings) work as well? What’s its drawback?

  13. Placement Group  PG aggregates series of object into a group, and maps them to series of OSD  Tracking per-object placement and metadata is highly expensive  PG reduce # of process and per object metadata to track when storing and retrieving data  There are other advantage of having logical placement group on top of osd cluster  Can apply placement rules on some specific PGs that belong to pool  Easy to provide distribution policy like SSD group/HDD group/ Same RACK/Different Rack  OSD can self-report and monitor peers within its same PG which reduce load from master  Mapping oid directly to OSD will lose these benefits  Q6. Why are placement groups (PGs) introduced? Can we construct a hash function mapping an object (“ oid ”) directly to a list of OSDs?

  14. CRUSH Function  Determines how to store and retrieve data by computing data storage locations  CRUSH maps takes placement group, cluster map and placement rules as input  It produce list of osd to maps each pg  CRUSH map also consider placement constriant i.e. place each pg on osd such that it can reduce inter-row replication traffic and minimal exposure to power/switch failure.  Q7. What are inputs of a CRUSH hash function? What can be included in an OSD cluster map? [Hint: read the last paragraph Section 5.1 for the second question.]

Recommend


More recommend