File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14
2
Why GFS? • Store “the web” and other very large datasets • Peculiar requirements • Huge files • Files can span multiple servers • Coarse granularity blocks to keep metadata manageable • Failures • Many servers à many failures • Workload • Concurrent append-only writes, reads mostly sequential • Q: Why is this workload common in a search engine? 3 3
Design Choices • Focus on analytics • Optimized for bandwidth not latency • Weak consistency • Supports multiple concurrent appends to a file • Best-effort attempt to guarantee atomicity of each append • Minimal attempts to “fix” state after failures • No locks • How to deal with weak consistency • Application-level mechanisms to deal with inconsistent data • Clients cache only metadata 4 4
Implementation • Distributed layer on top of Linux servers • Use local Linux file system to actually store data 5 5
Master-Slave Architecture • Master • Keeps file chunk metadata (e.g. mapping to chunkservers) • Failure detection of chunkservers • Procedure • Client contacts master to get metadata (small size) • Client contacts chunkserver(s) to get data (large size) • Master is not bottleneck 6 6
Architecture 7 7
Advantages of Large Chunks • Small metadata • All metadata fits in memory at the master à no bottleneck • Clients cache lots of metadata à low load on master • Batching when transferring data 8 8
Master Metadata • Persisted data • File and chunk namespaces • File to chunks mapping • Operation log • Stored externally for fault tolerance • Q: Why not simply restart master from scratch? • This is what MapReduce does, after all • Non-persisted data: Location of chunks • Fetched at startup from chunkservers • Updated periodically 9 9
Operation Log • Persists state • Memory mapped file • Log is a WAL - we will discuss it • Trimmed using checkpoints 10 10
Chunkserver Replication • Mutations are sent to all replicas • One replica is primary for a lease – time interval • Within that lease, it totally orders and sends to backups • After old lease expires, master assigns new primary • Separation of data and control flow • Data dissemination to all replicas (data flow) • Ordering through primary (control flow) 11
Replication Protocol • Client • Finds replicas and primary (1,2) • Disseminates data to chunkservers (3) • Contacts primary replica for ordering (4) • Primary • Determines write offset and persists it to disk • Sends offset to backups (5) • Backups • Apply write and ack back to primary (6) • Primary • Acks to client (7) • Q: Quorums? • Q: Primary election and recovery? 12 12
Weak Consistency • In presence of failures, • There can be inconsistencies (e.g. failed backup) • Client simply retries the write • Successful write (acknowledged back to client) is • Atomic: all data written • Consistent: same offset at all replica • This is because the primary proposes a specific offset • File contains • Stretches of “good” data with successful writes data • Stretches of “dirty” data inconsistent and/or duplicate data 13 13
Implications for Applications • Applications must deal with inconsistency • Add checksums to data to detect dirty writes • Add unique record ids to detect duplication • Atomic file renaming after finishing a write (single writer) • More difficult to program! • But “good enough” for this use case 14 14
Other Semantics Beyond FSs • Object store (e.g. AWS S3) • Originally conceived for web objects • Write-once objects • Offset reads • Often offer data replication • Block store (e.g. AWS EBS) • Mounted locally like a remote volume • Typically accessed using a file system • Not replicated 15 15
Data Structures for Storage 16
Storing Tables • How Good are B+ trees? • Q: Are they good for reading? Why? • Q: Are they good for writing? Why? 17 17
Log Structured Merge Trees • Popular data structure for key-value stores • Bigtable, H-Base, RocksDB, LevelDB • Goals • Fast data ingestion • Leverage large memory for caching • Problems • Write and read amplification 18
LSMT Data Structures • Memtable • Binary tree or skiplist à sorted by key • Receives writes and serves reads • Persistency through a Write Ahead Log • Log files ( runs ) arranged over multiple levels • L 0 : dump of memtable • L i : merge of multiple L i-1 runs • Goal: make disk accesses sequential • Writes are sequential • Merges of sorted data are sequential 19 19
Write Operations • Store updates instead of modifying in place • New writes go to memtable • Periodically write memtable to L 0 in sorted key order • When level L i becomes too large, merge its runs • Take two L i runs and merge (sequential) • Create new run L i+1 • Iterate if needed ( L i+1 full) • Runs at each level store overlapping keys • Each level has fewer and larger runs 20 20
Read Operations • Search memtables and read caches (if available) • If not found, search runs level by level • Bloom filters – indices in each run • Binary search in each run or index 21 21
Leveled LSMTs (e.g. RocksDB) • Difference with standard LSMT • Fixed number of runs per level, increasing for lower levels • From L 1 downwards, every run stores a partition of keys • Goals • Split the cost of merging • Reads only need to access one run • New merge process •Take two L i runs and merge with the relevant L i+1 runs • Create new run L i+1 to replace the merged one • If new run too large, split and create a new L i+1 run • Iterate if needed ( L i+1 full) 22 22
Providing Durability 23
Write Ahead Log • Goals • Atomicity: transactions are all or nothing • Durability (Persistency): completed transactions are not lost • Principle • Append modifications to a log on disk • Then apply them • After crash • Can redo transactions that committed • Can undo transactions that did not commit 24 24
Example: WAL in LSMTs • Transactions • Create/Read/Update/Delete (CRUD) on a key-value pair • Append CUD operations to the WAL • Trimming the WAL? • Execute checkpoint • All operations reflected in the checkpoint removed from WAL • Recovery? • Read from the checkpoint, re-execute operations in WAL • ARIES: WAL in DBMSs (more complex than this) 25 25
Recommend
More recommend