dadi block level image service for agile and elastic
play

DADI Block-Level Image Service for Agile and Elastic Application - PowerPoint PPT Presentation

DADI Block-Level Image Service for Agile and Elastic Application Deployment Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu Alibaba Group The Problem Container deployment (cold startup) is slow Long-tail latency reaches


  1. DADI Block-Level Image Service for Agile and Elastic Application Deployment Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu Alibaba Group

  2. The Problem • Container deployment (cold startup) is slow • Long-tail latency reaches 10s of minutes • The essential reasons are image downloading and unpacking • Only 6.4% [Slacker] of the image is used for startup • A regression to a decade ago, when VM images were also downloaded to hosts • P2P downloading [Dragonfly, Kraken, Borg, Tupperware, FID] is not enough • Deals with only half the reason • Little effect for small clusters • Slimming the images [DockerSlim, Cntr] is not universal • Hard to automatically find all dependencies for all applications • Hard to support ad-hoc operations

  3. Remote Image • is the trend • [CRFS, Teleport, CernVM-FS, Slacker, Wharf, CFS, Cider] • Optionally with P2P transfer for large clusters • Container image (tarball) is, however, NOT viable for remote image • Designed for unpacking, not seekable • Hard to support advanced features, such as xattr, cross-layer reference, etc. • We’d better to design a new one • Type of image • File-system-based image? • Block-device-based image?

  4. Type of Image: Block! Features Existing Sys Complexity Universality Security Overall • Work together with a need the App can choose a Low regular file system, courage to best-match file Cider Block- such as ext4 walk alone stability ↑ system, e.g. NTFS, small (based on Ceph; Device- (almost) optimization ↑ attack 
 no layering • Viable for container, and pack it into Based advanced surface format.) the image as a secure container and TODO: features ↑ dependence. virtual machine layering • Provides a file-system Fixed features; 
 interface directly High may not match all CRFS, Teleport, • “Natural” extension of File- Technical applications. stability ↓ CernVM-FS, large container image advantage is System- optimization ↓ Slacker, Wharf, attack 
 (e.g. a Windows insignificant. Based advanced • Less mental friction CFS surface container on a features ↓ (due to inertia and Linux host) following the crowd)

  5. Background: Layered Image of Container docker registry download untar

  6. Background: Layered Image of Container docker registry Each layer is a change set compared to the previous state 
 download (files added, modified, deleted) (read-only, shared) untar

  7. Background: Layered Image of Container Container layer is a change set compared to the image 
 (files added, modified, deleted) docker (read-write, private) registry Each layer is a change set compared to the previous state 
 download (files added, modified, deleted) (read-only, shared) untar

  8. Background: Layered Image of Container Container layer is a change set Usually the layers are compared to the image 
 stored in separate (files added, modified, deleted) directories, and a merged docker (read-write, private) view is created with a registry kernel module overlayfs . Each layer is a change set compared to the previous state 
 download (files added, modified, deleted) (read-only, shared) untar

  9. Background: I/O Path container Docker download, App Processes Registry ungzip & untar user space kernel space directories layers overlayfs directories (directories)

  10. DADI Remote Image • A layered image format • based on virtual block device • work together with a regular file system, e.g. ext4 • a general solution for container ecology • Compression • and seekable decompression (online) • Scalability • peer-to-peer (P2P) transfer

  11. DADI Remote Image • A layered image format Overlay Block Device • based on virtual block device • work together with a regular file system, e.g. ext4 • a general solution for container ecology • Compression • and seekable decompression (online) • Scalability • peer-to-peer (P2P) transfer

  12. DADI Remote Image • A layered image format Overlay Block Device • based on virtual block device • work together with a regular file system, e.g. ext4 • a general solution for container ecology • Compression ZFile • and seekable decompression (online) • Scalability • peer-to-peer (P2P) transfer

  13. DADI Remote Image • A layered image format Overlay Block Device • based on virtual block device • work together with a regular file system, e.g. ext4 • a general solution for container ecology • Compression ZFile • and seekable decompression (online) P2P on-demand read in • Scalability a tree-structured topology • peer-to-peer (P2P) transfer

  14. DADI I/O Path container lsmd daemon for new layers P2P App Processes OverlayBD ZFile ZFile ZFile RPC (layer blobs) user space kernel space for downloaded layers regular file system file system virtual block device (ext4, etc.) (ext4, etc.)

  15. Overlay Block Device • Each layer is a change set of overwritten blocks • no concept of file or file system • 512 bytes block size (granularity) offset pread length • An index for fast reading Segment 0 2 15 30 15 87 150 • variable-length entries to 1 4 10 50 10 hole hole save memory by combining • non-overlapping entries sorted by logical offsets raw data raw data to read to read • range query by binary search raw data raw data to read

  16. Index Merge Merged Index Size of Productional Images 5K # of Segments in Merged Index 5 10 100 4K 5 10 10 offset length + Segment 3K 0 2 15 30 15 87 150 1 4 10 50 10 2K => 1K 0 2 5 10 20 30 15 87 13 100 110 27 150 1 3 5 10 5 10 10 0K 0 5 10 15 20 25 30 35 40 45 Layers Depth 4.5K * 16 bytes = 72KB

  17. Index Performance 9M 120K Thin LVM IOPS (bs=8KB, non-cached) DADI w/o comp DADI - ZFile 90K Queries / Second 6M 60K 3M 30K 0M 0K 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K 1 2 4 8 16 32 64 128 256 I/O Queue Depth Size of Index (# of Segments) > 6M QPS for productional images

  18. Writable Layer • Log-structured design • appending index and raw data to separate logs • Maintaining an in-memory index • red-black-tree • Commit only useful data blocks (in o ff set order) • combine index entries Index (R/W) Index append Header Raw Data commit Index Header Trailer Data (R/W) Raw Data append Layer (RO) Header

  19. ZFile • A seekable compression format • random reading, and online decompression • Compressed by fixed-sized chunks • Decompressed only needed chunks • Not tied to DADI Compressed [Dict] Index ZFile Trailer Header Chunks Underlay file Raw Data Index Trailer Header (DADI layer blob)

  20. On-Demand P2P Transfer • In a tree-structured topology • Each P2P node caches recently used data blocks. • A request is likely to hit parent’s cache, • or the parent will forward the request upward, recursively. Registry DADI-Root DADI-Root HTTP(S) request DADI request DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent Datacenter 1 Datacenter 2

  21. Evaluations

  22. Startup Latency with DADI 20 2.4 App Launch NVMe SSD Cloud Disk Image Pull Warm Startup Latency (s) Cold Start Latency (s) 15 1.8 10 1.2 5 0.6 0 0 .tgz + 
 CRFS pseudo 
 DADI from 
 DADI from 
 overlay2 Thin LVM 
 DADI overlay2 Slacker Registry P2P Root (device mapper)

  23. Startup Latency with DADI 3.0 2.4 app launch with prefetch pseudo-Slacker app launch Cold Startup Latency (s) DADI 1.8 Startup Latency (s) 2.0 1.2 1.0 0.6 0.0 0.0 0 10 20 30 40 Warm 
 Cold 
 # of Hosts (and Containers) Cache Cache

  24. Scalability with DADI Projected Hyper-Scale Startup of Agility Large-Scale Startup of Agilityon 1,000 hosts (by evaluating a single branch of the P2P tree) 10K 3.5 # of Container Instances Started Estimated Startup Latencies (s) 2-ary tree 3-ary tree Cold Startup 1 4-ary tree 5-ary tree Cold Startup 2 8K 3.0 Cold Startup 3 Warm Startup 5K 2.5 2.0 3K 1.5 0K 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 0 1 2 3 4 # of Containers Time (s) ( Agility is a small application specifically written in Python to assist the test)

  25. I/O Performance Image Scanning with du Image Scanning with tar 1.6 12 NVMe SSD NVMe SSD Cloud Disk Cloud Disk Time to tar All Files (s) Time to du All Files (s) 1.2 9 0.8 6 0.4 3 0 0 overlay2 Thin LVM DADI overlay2 Thin LVM DADI

  26. Thanks!

Recommend


More recommend