file systems and storage
play

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - PowerPoint PPT Presentation

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 Why GFS? Store the web and other very large datasets Peculiar requirements Huge files Files can span multiple servers Coarse granularity blocks to


  1. File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14

  2. 2

  3. Why GFS? • Store “the web” and other very large datasets • Peculiar requirements • Huge files • Files can span multiple servers • Coarse granularity blocks to keep metadata manageable • Failures • Many servers à many failures • Workload • Concurrent append-only writes, reads mostly sequential • Q: Why is this workload common in a search engine? 3 3

  4. Design Choices • Focus on analytics • Optimized for bandwidth not latency • Weak consistency • Supports multiple concurrent appends to a file • Best-effort attempt to guarantee atomicity of each append • Minimal attempts to “fix” state after failures • No locks • How to deal with weak consistency • Application-level mechanisms to deal with inconsistent data • Clients cache only metadata 4 4

  5. Implementation • Distributed layer on top of Linux servers • Use local Linux file system to actually store data 5 5

  6. Master-Slave Architecture • Master • Keeps file chunk metadata (e.g. mapping to chunkservers) • Failure detection of chunkservers • Procedure • Client contacts master to get metadata (small size) • Client contacts chunkserver(s) to get data (large size) • Master is not bottleneck 6 6

  7. Architecture 7 7

  8. Advantages of Large Chunks • Small metadata • All metadata fits in memory at the master à no bottleneck • Clients cache lots of metadata à low load on master • Batching when transferring data 8 8

  9. Master Metadata • Persisted data • File and chunk namespaces • File to chunks mapping • Operation log • Stored externally for fault tolerance • Q: Why not simply restart master from scratch? • This is what MapReduce does, after all • Non-persisted data: Location of chunks • Fetched at startup from chunkservers • Updated periodically 9 9

  10. Operation Log • Persists state • Memory mapped file • Log is a WAL - we will discuss it • Trimmed using checkpoints 10 10

  11. Chunkserver Replication • Mutations are sent to all replicas • One replica is primary for a lease – time interval • Within that lease, it totally orders and sends to backups • After old lease expires, master assigns new primary • Separation of data and control flow • Data dissemination to all replicas (data flow) • Ordering through primary (control flow) 11

  12. Replication Protocol • Client • Finds replicas and primary (1,2) • Disseminates data to chunkservers (3) • Contacts primary replica for ordering (4) • Primary • Determines write offset and persists it to disk • Sends offset to backups (5) • Backups • Apply write and ack back to primary (6) • Primary • Acks to client (7) • Q: Quorums? • Q: Primary election and recovery? 12 12

  13. Weak Consistency • In presence of failures, • There can be inconsistencies (e.g. failed backup) • Client simply retries the write • Successful write (acknowledged back to client) is • Atomic: all data written • Consistent: same offset at all replica • This is because the primary proposes a specific offset • File contains • Stretches of “good” data with successful writes data • Stretches of “dirty” data inconsistent and/or duplicate data 13 13

  14. Implications for Applications • Applications must deal with inconsistency • Add checksums to data to detect dirty writes • Add unique record ids to detect duplication • Atomic file renaming after finishing a write (single writer) • More difficult to program! • But “good enough” for this use case 14 14

  15. Other Semantics Beyond FSs • Object store (e.g. AWS S3) • Originally conceived for web objects • Write-once objects • Offset reads • Often offer data replication • Block store (e.g. AWS EBS) • Mounted locally like a remote volume • Typically accessed using a file system • Not replicated 15 15

  16. Data Structures for Storage 16

  17. Storing Tables • How Good are B+ trees? • Q: Are they good for reading? Why? • Q: Are they good for writing? Why? 17 17

  18. Log Structured Merge Trees • Popular data structure for key-value stores • Bigtable, H-Base, RocksDB, LevelDB • Goals • Fast data ingestion • Leverage large memory for caching • Problems • Write and read amplification 18

  19. LSMT Data Structures • Memtable • Binary tree or skiplist à sorted by key • Receives writes and serves reads • Persistency through a Write Ahead Log • Log files ( runs ) arranged over multiple levels • L 0 : dump of memtable • L i : merge of multiple L i-1 runs • Goal: make disk accesses sequential • Writes are sequential • Merges of sorted data are sequential 19 19

  20. Write Operations • Store updates instead of modifying in place • New writes go to memtable • Periodically write memtable to L 0 in sorted key order • When level L i becomes too large, merge its runs • Take two L i runs and merge (sequential) • Create new run L i+1 • Iterate if needed ( L i+1 full) • Runs at each level store overlapping keys • Each level has fewer and larger runs 20 20

  21. Read Operations • Search memtables and read caches (if available) • If not found, search runs level by level • Bloom filters – indices in each run • Binary search in each run or index 21 21

  22. Leveled LSMTs (e.g. RocksDB) • Difference with standard LSMT • Fixed number of runs per level, increasing for lower levels • From L 1 downwards, every run stores a partition of keys • Goals • Split the cost of merging • Reads only need to access one run • New merge process •Take two L i runs and merge with the relevant L i+1 runs • Create new run L i+1 to replace the merged one • If new run too large, split and create a new L i+1 run • Iterate if needed ( L i+1 full) 22 22

  23. Providing Durability 23

  24. Write Ahead Log • Goals • Atomicity: transactions are all or nothing • Durability (Persistency): completed transactions are not lost • Principle • Append modifications to a log on disk • Then apply them • After crash • Can redo transactions that committed • Can undo transactions that did not commit 24 24

  25. Example: WAL in LSMTs • Transactions • Create/Read/Update/Delete (CRUD) on a key-value pair • Append CUD operations to the WAL • Trimming the WAL? • Execute checkpoint • All operations reflected in the checkpoint removed from WAL • Recovery? • Read from the checkpoint, re-execute operations in WAL • ARIES: WAL in DBMSs (more complex than this) 25 25

Recommend


More recommend