the design and implementation of the w arp t ransactional
play

The Design and Implementation of the W arp T ransactional F ilesystem - PowerPoint PPT Presentation

The Design and Implementation of the W arp T ransactional F ilesystem Robert Escriva, Emin Gn Sirer Cornell University Symposium on Networked Systems Design and Implementation March 18, 2016 The Design and Implementation of WTF 1 / 28


  1. The Design and Implementation of the W arp T ransactional F ilesystem Robert Escriva, Emin Gün Sirer Cornell University Symposium on Networked Systems Design and Implementation March 18, 2016 The Design and Implementation of WTF 1 / 28

  2. Common Trends in Distributed Filesystems Compromises or limitations are often introduced in search of higher performance: ✪ Weak guarantees: Eventual consistency “Consistent, but undefined” ✪ Narrow interfaces: Writes must be sequential Concurrent writes prohibited ✪ Unscalable design: Full-bisection bandwidth Large “master” server The Design and Implementation of WTF Motivation 2 / 28

  3. Warp Transactional Filesystem (WTF) WTF represents a new design point in the space of distributed filesystems WTF employs the file slicing abstraction to provide applications with strong guarantees and zero-copy filesystem interfaces ✦ Strong guarantees: transactionally access and modify the filesystem ✦ Expanded interface: traditional POSIX APIs and new zero-copy APIs ✦ Scalable Design: avoids centralized master or expensive network bottlenecks The Design and Implementation of WTF Design 3 / 28

  4. Zero-Copy File Slicing APIs Traditional APIs transfer bytes back and forth through the filesystem interface File-slicing APIs deal in references to data already in the filesystem yank Obtain references to data in the filesystem Analogous to read paste Write referenced data back to the filesystem Analogous to write append Append referenced data to the end of a file Optimized for concurrency concat Merge one or more files to create a new file Does not read or write data from the input files The Design and Implementation of WTF Design 4 / 28

  5. The File Slicing Abstraction The central abstraction is a slice : an immutable, byte-addressable, arbitrarily sized sequence of bytes A file is represented by a sequence of slices that, when overlaid, comprise the file’s contents Overlaid Slices File Contents The Design and Implementation of WTF Design 5 / 28

  6. WTF Architecture Metadata Storage End User Client Application Library Storage Servers The Design and Implementation of WTF Design 6 / 28

  7. WTF Architecture Metadata Storage End User Client Application Library Storage Servers The metadata storage provides transactional operations over the metadata The Design and Implementation of WTF Design 6 / 28

  8. WTF Architecture Metadata Storage End User Client Application Library Storage Servers The client library extends these transactional guarantees to the end user The Design and Implementation of WTF Design 6 / 28

  9. Slices and Slice Pointers A B c 1 c 2 s 0 c 3 c 4 s 1 Slices reside on storage servers, while pointers to slices reside in HyperDex The Design and Implementation of WTF Design 7 / 28

  10. Slices and Slice Pointers A B Slice Pointer A: Slice Pointer B: server: s 0 server: s 1 chunk: c 1 chunk: c 4 c 1 start: 1,073,816,936 start: 10,737,389,932 end: 8,589,788,476 end: 13,958,442,063 c 2 s 0 c 3 c 4 s 1 Slice pointers directly indicate a slice’s location in the system The Design and Implementation of WTF Design 7 / 28

  11. 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB An empty file has no metadata and occupies no space on storage servers The Design and Implementation of WTF Design 8 / 28

  12. A 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor @ 0 MB A A s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB A 2 MB write writes to the storage servers and metadata The Design and Implementation of WTF Design 8 / 28

  13. A B 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor @ 0 MB B @ 2 MB A A B s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Another 2 MB write The Design and Implementation of WTF Design 8 / 28

  14. A B 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor @ 0 MB B @ 2 MB A A B s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB WTF supports writes at arbitrary offsets within files The Design and Implementation of WTF Design 8 / 28

  15. A C B 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor @ 0 MB B @ 2 MB @ 1 MB A C A B C s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB A 2 MB write that overwrites part of both prior writes The Design and Implementation of WTF Design 8 / 28

  16. Metadata Compaction Compaction reduces the size of the metadata list by removing references to unused portions of slices Because slice pointers directly reference the location of files, they can be modified in the metadata list using local computation Consequently, compaction occurs entirely at the metadata level The Design and Implementation of WTF Design 9 / 28

  17. A C B 0 MB 1 MB 2 MB 3 MB 4 MB ⇑ cursor @ 0 MB B @ 2 MB @ 1 MB A C A B C s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB The Design and Implementation of WTF Design 10 / 28

  18. A C B 0 MB 1 MB 2 MB 3 MB 4 MB @ 0 MB B @ 2 MB @ 1 MB A C A B C s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Compaction eliminates references to overwritten or erased data The Design and Implementation of WTF Design 10 / 28

  19. Garbage Collection Garbage collection cleans up the slices no longer referenced by any slice pointer WTF periodically scans the filesystem and collects all slice pointers Storage servers use the scan, along with their local data, to determine which data is garbage The Design and Implementation of WTF Design 11 / 28

  20. A C B 0 MB 1 MB 2 MB 3 MB 4 MB @ 0 MB B @ 2 MB @ 1 MB A C A B C s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB The Design and Implementation of WTF Design 12 / 28

  21. A C B 0 MB 1 MB 2 MB 3 MB 4 MB @ 0 MB B @ 2 MB @ 1 MB A C A B C s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Garbage is freed from the underlying filesystem The Design and Implementation of WTF Design 12 / 28

  22. Locality-Aware Slice Placement Locality-aware slice placement prevents fragmentation when writing sequentially Slices placed contiguously on storage servers improve locality when reading files Consistent hashing across storage servers in the system on a per-file basis increases probability that sequentially written slices are adjacent The metadata for adjacent slices may be represented in a more compact form The Design and Implementation of WTF Design 13 / 28

  23. A B 0 MB 1 MB 2 MB 3 MB 4 MB @ 0 MB B @ 2 MB A A B s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Locality-aware slice placement reduces fragmentation The Design and Implementation of WTF Design 14 / 28

  24. Slice Pointer A: Slice Pointer B: Slice Pointer D: A B server: s 0 server: s 0 server: s 0 chunk: c chunk: c chunk: c 0 MB 1 MB 2 MB 3 MB 4 MB start: 0MB start: 2MB start: 0MB end: 2MB end: 4MB end: 4MB @ 0 MB B @ 2 MB D @ 0 MB A A B s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Adjacent slices may be represented by a new, merged slice pointer The Design and Implementation of WTF Design 14 / 28

  25. 0 MB 1 MB 2 MB 3 MB 4 MB D @ 0 MB s 0 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB The new slice pointer represents the contiguous range on the storage servers The Design and Implementation of WTF Design 14 / 28

  26. WTF Applications MapReduce Sort: concat enables an efficient bucket-based merge sort Work Queue: append units of work are appended to the file; all contention happens in the metadata layer Video editor: yank and paste enable the editor to reorder scenes without rewriting the movie Fuse Bindings: transactional behavior exposed to the user for easy data exploration The Design and Implementation of WTF Design 15 / 28

  27. Application: MapReduce Sort . . . . . . . . . . . . Input File Buckets Sorted Buckets Output File The Design and Implementation of WTF Design 16 / 28

  28. Application: MapReduce Sort WTF concat . . . . . . . . . . . . Input File Buckets Sorted Buckets Output File The Design and Implementation of WTF Design 16 / 28

  29. Application: MapReduce Sort 80 Execution Time (minutes) 60 40 20 0 HDFS WTF The Design and Implementation of WTF Design 17 / 28

  30. Application: MapReduce Sort 2000 HDFS 1800 WTF Execution Time (s) 1600 1400 1200 1000 800 600 400 200 0 Bucket Sort Merge The Design and Implementation of WTF Design 18 / 28

  31. Application: Work Queue 200 Throughput (ops/s) 150 100 50 0 HDFS WTF The Design and Implementation of WTF Design 19 / 28

  32. Application: Video Editor Chronological Order Final Cut The Design and Implementation of WTF Design 20 / 28

  33. Application: Video Editor 100000 10000 Execution Time (s) 1000 100 10 1 HDFS WTF WTF can rewrite 377 GB of raw movie footage in 16 s using file slicing—effectively 23 GB / s , as opposed to rewriting the footage using traditional APIs, which requires approximately three hours The Design and Implementation of WTF Design 21 / 28

  34. Application: Interactive Transactions # wtf begin-transaction # ls ./data.0000 ./data.0001 ./data.0002 ./data.0003 .... # rm -rf * # ls # wtf abort-transaction # ls ./data.0000 ./data.0001 ./data.0002 ./data.0003 .... The Design and Implementation of WTF Design 22 / 28

  35. Microbenchmark: Baseline Performance 125 POSIX HDFS 100 WTF Throughput (MB/s) 75 50 25 0 Write Read Seq.Read Rand. The Design and Implementation of WTF Design 23 / 28

  36. Microbenchmark: Write Sequential 500 Throughput (MB/s) 400 300 200 100 HDFS WTF 0 64B 2KB 64KB 2MB 64MB Block Size (bytes) The Design and Implementation of WTF Design 24 / 28

Recommend


More recommend