performance analysis of containerized applications on
play

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - PowerPoint PPT Presentation

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1 Docker


  1. PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1

  2. Docker Becomes Very Popular Software container platform with many desirable features Ease of deployment, developer friendliness and light virtualization Mainstay in cloud platforms Google Cloud Platform, Amazon EC2, Microsoft Azure Storage Hierarchy is the key component High Performance SSDs NVMe, NVMe over Fabrics 2

  3. Agenda Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container? Best configuration performs similar to raw performance Where do the performance anomalies come from? Do Docker containers scale well on NVMe SSDs? Exemplify using Cassandra Best strategy to divide the resources Scaling Docker containers on NVMe-over-Fabrics 3

  4. What is Docker Container? Each virtualized application includes an entire OS (~10s of GB) Docker container comprises just application and bins/libs Shares the kernel with other container Much more portable and efficient figure from https://docs.docker.com 4

  5. Non-Volatile Memory Express (NVMe) A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard Since 2014 (Intel, Samsung) Enterprise and consumer variants NVMe SSDs leverage the interface to deliver superior perf 5X to 10X over SATA SSD [1] [1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world databases.” SYSTOR’15 5

  6. Why NVMe over Fabrics (NVMf)? Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum 6

  7. Storage Architecture in Docker Storage Options: 1 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 2 3. Through Docker Data Volume (-v) 3 Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 7

  8. Optimize Storage Configuration for Single Container 
 Experimental Environment Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD Docker Samsung XS1715 FIO kernel v4.6.0 Docker v1.11.2 ? fio used for traffic generation NVMe SSD Asynchronous IO engine, libaio (XS1715) 32 concurrent jobs and iodepth is 32 Measure steady state performance 8

  9. Performance Comparison — Host Backing Filesystems 800 K RAW RAW 3000 Average BW Average IOPS EXT4 EXT4 600 2500 (MB/s) XFS XFS 400 2000 200 1500 0 1000 RR RW SR SW EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW 9

  10. Tuning the Performance Gap —Random Reads 700K IOPS 800 K IOPS of RR 600 400 200 0 Default dioread_nolock EXT4 XFS allows multiple processes to read a file at once Uses allocation groups which can be accessed independently EXT4 requires mutex locks even for read operations 10

  11. Tuning the Performance Gap —Random Writes 250 K RAW EXT4 XFS IOPS of RW 200 150 100 50 0 1 2 4 8 16 24 28 32 48 64 # of Jobs XFS performs poorly with high thread count Contention in exclusive locking kills the write performance Used by extent look up and write checks Patch available but not for Linux 4.6 [1] [1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/ 11

  12. Storage Architecture in Docker Storage Options: 1 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 2 3. Through Docker Data Volume (-v) 3 Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 12

  13. Docker Storage Options Option 1: Through Docker File System Aufs (Advanced multi-layered Unification FileSystem): A fast reliable unification file system Btrfs (B-tree file system): A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration Overlayfs: Another modern unification file system which has simpler design and potentially faster than Aufs 13

  14. Performance Comparison Option 1: Through Docker File System 800 K Raw Raw 3000 Average IOPS Aufs Aufs Average BW 600 Btrfs Btrfs 2500 (MB/s) Overlay Overlay 400 2000 200 1500 1000 0 SR SW RR RW Aufs and Overlayfs performs close to raw block device for most cases Btrfs has the worst performance for random workloads 14

  15. Tuning the Performance Gap of Btrfs —Random Reads 4000 BW (MB/s) of RR RAW EXT4 Btrfs 3000 2000 1000 0 Block Size Btrfs doesn’t work well for small block size yet Btrfs must read the file extent before reading the file data. Large block size reduces the frequency of reading metadata 15

  16. Tuning the Performance Gap of Btrfs —Random Reads 200 K IOPS of RW 150 100 50 0 Default nodatacow Btrfs Btrfs doesn’t work well for random writes due to CoW overhead 16

  17. Storage Architecture in Docker Storage Options: 1 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 2 3. Through Docker Data Volume (-v) 3 Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 17

  18. Docker Storage Configurations Option 2: Through Virtual Block Device Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Device Mapper Framework Loop-lvm uses sparse files to build the thin-provisioned pools Direct-lvm uses block device to directly create the thin pools (Recommended by Docker) 18

  19. Docker Storage Configurations Option 3: Through Docker Data Volume (-v) Data persists beyond the lifetime of the container and can be shared and accessed from other containers * figure from https://github.com/libopenstorage/openstorage 19

  20. Performance Comparison Option 2 & Option 3 800 K RAW RAW 3000 Average IOPS Average BW Direct-lvm 600 Direct-lvm Loop-lvm -v 2500 Loop-lvm -v (MB/s) Aufs -v 400 Aufs -v 2000 Overlay -v Overlay -v 200 1500 0 1000 SR SW RR RW Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads 20

  21. Application Performance Cassandra Database NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence uses docker volume to store data [1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13 21

  22. Scaling Docker Containers on NVMe multiple containerized Cassandra Databases Experiment Setup Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16 Workloads Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution 22

  23. Results-Throughput Workload D, directly attached SSD C1 C2 C3 C4 C5 C6 C7 C8 Cgroups 50000 Throughput (ops/sec) 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 # of Cassandra Containers Aggregated throughput peaks at 4 containers Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth 23

  24. Strategies for Dividing Resources CPU MEM CPU+MEM BW All Uncontrolled 50000 Throughput (ops/sec) 40000 30000 20000 10000 0 0 1 2 3 4 5 6 7 8 9 # of Cassandra Containers MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups Assign 6 CPU cores for each container, leave other resource uncontrolled 24

  25. Scaling Containerized Cassandra using NVMf Experiment Setup Cassandra + Docker 10Gbe 40Gbe Application NVMf Target YCSB Clients Server Storage Server 25

  26. Results-Throughput DAS_A NVMf_A DAS_D NVMf_D 4 Relative TPS 3 2 1 0 1 2 3 4 5 6 7 8 # of Cassandra Instances The throughput of NVMf is within 6% - 12% compared to directly attached SSDs 26

  27. Results-Latency DAS_A NVMf_A DAS_D NVMf_D 8 Relative Latency 6 4 2 0 1 2 3 4 5 6 7 8 # of Cassandra Instances NVMF incurs only 2% - 15% longer latency than direct attached SSD. 27

  28. Results-CPU Utilization NVMF incurs less than 1.8% CPU Utilization on Target Machine 28

  29. SUMMARY Best Option in Docker for NVMe Drive Performance Overlay FS + XFS + Data Volume Best Strategy for Dividing Resources using Cgroups Control only the CPU resources Scaling Docker Containers on NVMf Throughput: within 6% - 12% vs. DAS Latency: 2% - 15% longer than DAS THANK YOU! QIUMIN@USC.EDU 29

Recommend


More recommend