Virtualization-based Bandwidth Management for Parallel Storage Systems Yiqi Xu , Lixi Wang, Dulcardo Arteaga, Yonggang Liu, Dr. Ming Zhao Dr. Renato Figueiredo School of Computing and Department of Electrical and Computer Engineering Information Sciences University of Florida, Florida International University, Gainesville, FL Miami, FL
Motivation The lack of QoS differentiation in HPC storage systems Unable to recognize different application I/O workloads Unable to satisfy users’ different I/O performance needs APP1 APPn Compute Generic Storage APP2 parallel I/Os nodes nodes APPn APPn PDSW 2010 2
Motivation The need for different I/O QoS from HPC applications Diverse I/O demands and performance requirements Examples: WRF: Hundreds of MBs of inputs and outputs mpiBLAST: GBs of input databases S3D: GBs of restart files on a regular basis This mismatch will become even more serious in future ultra-scale HPC systems PDSW 2010 3
Objective Problem: Lack of per-application I/O bandwidth allocation Static partition of storage nodes is inflexible Compute nodes based partition is insufficient Proposed Solution: Per-application storage resource allocation Parallel file system (PFS) virtualization Per-application virtual PFSes PDSW 2010 4
Outline Background Design Implementation Evaluation PDSW 2010 5
Proxy-based PFS Virtualization (Before) HPC App application 1 App Generic parallel I/Os PFS HPC App application 2 Compute Data nodes servers Storage nodes are shared without any isolation No distinction of I/Os from different applications Lack of native scheduling support for bandwidth allocation PDSW 2010 PDSW 2010 6 6
Proxy-based PFS Virtualization (After) HPC Virtual PFS1 App application 1 Queue 1 App PFS Proxy HPC App Queue 2 Virtual PFS2 application 2 Compute Data nodes servers Indirection of application I/O access Creation of per-application virtual PFS Dynamically spawned on the server PDSW 2010 7
Virtualization Benefits and Costs Benefits Enable various scheduling algorithms SFQ(D) – proportional sharing algorithm EDF – deadline based scheduling Transparent to the existing PFSes No change in existing implementation needed Support different parallel storage systems Costs Overhead involved in user-level proxy Extra processing of data and communication PDSW 2010 8
Prototype Implementation A PVFS 2.8.2 (Parallel Virtual File System) proxy Deployed on every data server Intercepts and forwards PVFS2 messages Asynchronous I/O for less overhead Identifies I/Os from different applications Dynamically configured by a configuration file Proxy implements SFQ(D) scheduling Supports a generic scheduling interface for other algorithms PDSW 2010 9
PVFS2 Background System Interface Application … MPI-IO or Linux VFS Job Interface Interconnection Network (TCP) BMI/Flows State Machines PVFS2 Server … Job Interface BMI/Flows/Trove PDSW 2010 10
PVFS2 Proxy Virtual PFS1 HPC App application 1 Data HPC App application 2 Data Proxy Virtual PFS2 Meta-data PFS HPC App Virtual PFS3 application 3 Compute nodes Data servers Non-I/O messages are not scheduled Extra processing for I/O messages at proxy side PDSW 2010 11
SFQ – Start Time Fair Queuing SFQ Originally designed for network packet scheduling Work-conserving, proportional sharing-based scheduling Adapts to variation in server capacity SFQ(D) Extended from SFQ Adds depth to allow and control concurrency Multiplexes storage bandwidth and enhance utilization PDSW 2010 12
Evaluation A virtual machine based testbed A cluster of 8 DELL PowerEdge 2970 servers Ubuntu 8.04, Kernel 2.6.18.8 Up to 64 PVFS2 clients and 4 PVFS2 servers (version 2.8.2) 2 competing parallel applications Benchmark: IOR version 2.10.3 with sequential writes Proxy implements performance monitoring Experiments: Virtualization overhead Effectiveness of proportional sharing PDSW 2010 13
Virtualization Overhead 35 Virtual Overhead:2.3% Native 30 Throughput (MB/S) 25 20 Overhead:1.6% 15 Overhead:2.9% 10 5 0 16:1 32:2 64:4 # Clients : # Servers Throughput overhead is small About 3% Proxy CPU and 1MB RAM usage on each node PDSW 2010 14
Proportional Sharing in a Symmetric Setup Virtual PFS1 App 32 clients Queue 1 PFS Proxy 32 clients App Virtual PFS2 Queue 2 Compute 4 Data nodes servers PDSW 2010 15
Proportional Sharing with Varying Ratios 30 Ratio: 5.84:1 app1 Ratio: 4.33:1 app2 25 Throughput (MB/S) Ratio: 3.32:1 Ratio: 1.90:1 20 15 10 5 0 2:1 4:1 8:1 16:1 Desired Ratio Good proportional sharing can be achieved The actual ratio drops when the desired ratio is high PDSW 2010 16
Proportional Sharing with Smaller l/Os 30 Previous Request Size: 256KB/Server app1 Current Request Size: 64KB/Server app2 25 Ratio: 10.13:1 Throughput (MB/S) Ratio: 3.74:1 Ratio: 6.33:1 20 Ratio: 1.96:1 15 10 5 0 2:1 4:1 8:1 16:1 Desired Ratio Increasing request rate improves the actual ratio Can be further improved by increasing number of clients PDSW 2010 17
Proportional Sharing in an Asymmetric Setup Virtual PFS1 HPC App application 1: Queue 1 m clients Proxy PFS App HPC Queue 2 Virtual PFS2 application 2: 4 Data servers Compute nodes n clients PDSW 2010 18
Proportional Sharing in an Asymmetric Setup 25 app1 app2 20 Ratio: 1.23:1 Throughput (MB/S) Ratio: 1.01:1 Ratio: 1.03:1 Ratio: 1.01:1 15 10 5 0 40:20 48:12 56:7 48:3 # of App1 Clients (m) : # of App2 Clients (n) Almost perfect fairness can be achieved PDSW 2010 19
Conclusions and Future Work Conclusions Proxy-based PFS virtualization is feasible with negligible overhead Effective proportional bandwidth sharing using SFQ(D) Future work More scheduling algorithms Global proportional sharing Deadline based scheduling Evaluate on applications’ access patterns and I/O requirements with more diversities PDSW 2010 20
References [1] PVFS2. URL: http://www.pvfs.org/pvfs2/. [2] P. Goyal , H. M. Vin, and H. Cheng, ― Start Time Fair Queuing: A Scheduling Algorithm For Integrated Services Packet Switching Networks, IEEE/ACM Trans. Networking, vol. 5, no. 5, 1997. [3] Yin Wang and Arif Merchant. 2007. Proportional-share scheduling for distributed storage systems. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 4-4. [4] W. Jin, J. S. Chase, and J. Kaur, “Interposed Proportional Sharing For A Storage Service Utility”, SIGMETRICS, 2004. [5] IOR HPC Benchmark, http://sourceforge.net/projects/ior-sio/. [6] P. Welsh, P. Bogenschutz , ―Weather Research and Forecast (WRF) Model: Precipitation Prognostics from the WRF Model during Recent Tropical Cyclones, Interdepartmental Hurricane Conference, 2005. [7] A. Darling, L. Carey, and W. Feng , ―The Design, Implementation, and Evaluation of mpiBLAST, ClusterWorld Conf. and Expo, 2003. [8] R. Sankaran , et al., ―Direct Numerical Simulations of Turbulent Lean Premixed Combustion, Journal of Physics Conference Series, 2006. PDSW 2010 21
Acknowledgement Research team VISA lab at FIU Yiqi Xu, Dulcardo Clavijo, Lixi Wang, Dr. Ming Zhao ACIS lab at UF Yonggang Liu, Dr. Renato Figueiredo Industry collaborator Dr. Seetharami Seelam (IBM T.J. Watson) Sponsor: NSF HECURA CCF-0937973/CCF-0938045 More information: http://visa.cis.fiu.edu/hecura Thank You! PDSW 2010 22
Recommend
More recommend