scalable file storage
play

Scalable File Storage Jeff Chase Duke University Why a - PowerPoint PPT Presentation

D D u k e S y s t t e m s Scalable File Storage Jeff Chase Duke University Why a shared network file service? Data sharing across people and their apps common name space (/usr/project/stuff )


  1. D D u k e S y s t t e m s Scalable ¡File ¡Storage ¡ Jeff ¡Chase ¡ Duke ¡University ¡

  2. Why a shared network file service? • Data sharing across people and their apps – common name space (/usr/project/stuff … ) • Resource sharing – fluid mapping of data to storage resources – incremental scalability – diversity of demands, not predictable in advance – statistical multiplexing, central limit theorem • Obvious? – how is this different from opendht?

  3. Network File System (NFS) server client syscall layer user programs VFS syscall layer NFS VFS server UFS NFS UFS client network Virtual File System (VFS) enables pluggable file system implementations as OS kernel modules (“drivers”).

  4. Google File System (GFS) • SOSP 2003 • Foundation for data-intensive parallel cluster computing at Google – MapReduce OSDI 2004, 2000+ cites • Client access by RPC library, through kernel system calls (via FUSE) • Uses Chubby lock service for consensus – e.g., Master election • Hadoop HDFS is an “open-source GFS”

  5. Google File System (GFS) Similar : Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

  6. GFS (or HDFS) and MapReduce • Large files • Streaming access (sequential) • Parallel access • Append-mode writes • Record-oriented • (Sorting.)

  7. MapReduce: Example Handles failures automatically, e.g., restarts tasks if a node fails; Runs multiple copies of a task so a slow node does not limit the job.

  8. HDFS Architecture

  9. GFS Architecture Separate data (chunks) from metadata (names etc.). Centralize the metadata; spread the chunks around.

  10. Chunks • Variable size, up to 64MB • Stored as a file, named by a handle • Replicated on multiple nodes, e.g., x3 – chunkserver == datanode • Master caches chunk maps – per-file chunk map: what chunks make up a file – chunk replica map: which nodes store each chunk

  11. GFS Architecture • Single master, multiple chunkservers What could go wrong?

  12. Single master • From distributed systems we know this is a: – Single point of failure – Scalability bottleneck • GFS solutions: – Shadow masters – Minimize master involvement • never move data through it, use only for metadata – and cache metadata at clients • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) • Simple, and good enough!

  13. GFS Read

  14. GFS Read

  15. GFS Write The client asks the master for a list of replicas, and which replica holds the lease to act as primary. If no one has a lease, the master grants a lease to a replica it chooses. ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed).

  16. GFS Write

  17. GFS Write

  18. GFS writes: control and data flow

  19. Google File System (GFS) Similar : Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

  20. GFS Scale

  21. GFS: leases • Primary must hold a “lock” on its chunks. • Use leased locks to tolerate primary failures. We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary . The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunk is being mutated, the primary can request and typically receive exten- sions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. … Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.

  22. Leases (leased locks) • A lease is a grant of ownership or control for a limited time. • The owner/holder can renew or extend the lease. • If the owner fails, the lease expires and is free again. • The lease might end early. – lock service may recall or evict – holder may release or relinquish

  23. A lease service in the real world acquire acquire grant X x=x+1 A grant x=x+1 release B

  24. Leases and time • The lease holder and lease service must agree when a lease has expired. – i.e., that its expiration time is in the past – Even if they can’t communicate! • We all have our clocks, but do they agree? – synchronized clocks • For leases, it is sufficient for the clocks to have a known bound on clock drift. – |T(C i ) – T(C j )| < ε – Build in slack time > ε into the lease protocols as a safety margin.

  25. A network partition Cr ashed route r A network partition is any event that blocks all message traffic between subsets of nodes.

  26. Never two kings at once acquire acquire grant x=x+1 ??? A grant x=x+1 release B

  27. Lease callbacks/recalls • GFS master recalls primary leases to give the master control for metadata operations – rename – snapshots ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires … . ...When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.....

  28. GFS: who is the primary? • The master tells clients which chunkserver is the primary for a chunk. • The primary is the current lease owner for the chunk. • What if the primary fails? – Master gives lease to a new primary. – The client’s answer may be cached and may be stale. Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file.

  29. Lease sequence number • Each lease has a lease sequence number . – Master increments it when it issues a lease. – Client and replicas get it from the master. • Use it to validate that a replica is up to date before accessing the replica. – If replica fails/disconnects, its lease number lags. – Easy to detect by comparing lease numbers. • The lease sequence number is a common technique. In GFS it is called the chunk version number .

  30. GFS chunk version number • In GFS, the sequence number for a chunk lease acts as a chunk version number. • Master passes it to the replicas after issuing a lease, and to the client in the chunk handle . – If a replica misses updates to a chunk, its version falls behind – Client checks for stale chunks on reads. – Replicas report chunk versions to master: master reclaims any stale chunks, creates new replicas. Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas … before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks...

  31. Consistency

  32. GFS consistency model Records appended atomically at least once. Easy. … but file may contain duplicates and/or padding. Primary chooses an arbitrary total Anything can happen if writes fail: order for concurrent writes. Writes a failed write may succeed at that cross chunk boundaries are not some replicas but not others. atomic: the two chunk primaries Reads from different replicas may choose different orders. may return different results.

  33. PSM: a closer look • The following slides are from Roxana Geambasu – Summer internship at Microsoft Research – Now at Columbia • Goal: specify consistency and failure formally • For primary/secondary/master (PSM) systems – GFS – BlueFS (MSR scalable storage behind Hotmail etc.) • The study is useful to understand where PSM protocols differ, and the implications.

  34. GFS Master: • Maintains replica group config. • Monitoring Master • Reconfiguration Client • Recovery value/error ACK/error write write read read ACK write Primary ACK write ACK write [Roxana Geambasu] 35

Recommend


More recommend