the google file system
play

The Google File System Presented by: Alexa Leal Architecture the - PowerPoint PPT Presentation

The Google File System Presented by: Alexa Leal Architecture the basic idea Question: 1. GFS does not cache file data. Why does this design choice not lead to performance loss? Single Master Clients never read or write to the master


  1. The Google File System Presented by: Alexa Leal

  2. Architecture – the basic idea Question: 1. GFS does not cache file data. Why does this design choice not lead to performance loss?

  3. Single Master • Clients never read or write to the master • metadata is kept in memory • It has an Operation Log • Communicates with chunkservers in HeartBeat messages Question: 1. What’s the benefit of having only a single master? What’s its potential performance risk? How does GFS minimize such a risk? 2. Why is GFS’s master able to keep the metadata in memory?

  4. Chunks & Chunkservers • Chunks are 64MB • Chunkservers communicate with client • Chunkservers keep track of their chunks and present to them to master (HeartBeat) • Allocation of new chunks uses Lazy space allocation method Questions: 1. How does GFS collaborate with chunkserver’s local file system to store file chunks? What’s lazy space allocation and what’s its benefit ? 2. How does chunkserver communication help improve the system’s performance?

  5. Chunk Leases & Mutations • Mutation is changing of contents like a write or an append • Leases maintain a consistent mutation order across chunks for 60 seconds *example of a write

  6. Atomic Record Appends • Client only specifies data & GFS chooses offset when appending data to file then returns that offset to the client • Appending cannot exceed chunk size • If it fails, the client will have to retry the operation

  7. Snapshot • Instantaneously makes a copy • Master will duplicate its metadata • The snapshot will point to the same chunk as source files • Used to make branch copies

  8. Chunk creation, re-replication, rebalancing • Chunk replicas are used for these three things • Creation – master chooses where to place the initially empty replica • Master re-replicates a chunk if available replicas fall under a specified goal • Master rebalances periodically Questions: 1. What are criteria for choosing where to place the initially empty replicas? 2. When a new chunkserver is added into the system, the master mostly uses chunk rebalancing rather than using writing new chunks to fill up it. Why?

  9. Garbage Collection Any replica not known to master is garbage Master will remove hidden files if they have existed for 3 days Question:. How are files and chunks are deleted? What’s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion?

  10. Stale Replica Detection • A chunk replica will become stale if chunkserver fails or misses mutations • Master will remove this stale replica during garbage collection when version chunk numbers do not match

  11. Fault tolerance & Diagnosis • Fast recovery and replication • Data integrity by checksum

  12. Conclusion • Optimized for huge files – appending is the norm and then read sequentially • Component failures are treated as the norm • Fault tolerance by constant monitoring, replication, and recovery

  13. Questions?

Recommend


More recommend