Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE’16 04/02/2016
CPU Consolidation in the Cloud CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization. Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a 2 period of 3 months [Barroso 13].
Problems with Preempted vCPUs Preempted B B A B Running A A B A P P P P pCPU vCPU of VM-A vCPU of VM-B P A B Performance problems: Busy-waiting based kernel synchronization operations Lock Holder Preemption problem Lock Waiter Preemption problem TLB Shootdown Preemption problem 3
Lock Holder Preemption Lock holder preemption [Uhlig 04, Friebel 08] A preempted vCPU is holding a spinlock Causes dramatically longer lock waiting time context switch latency + CPU shares allocated to other vCPUs Scheduling Techniques co-scheduling, relaxed co-scheduling [VMware 10] Adaptive co-scheduling [Weng HPDC11] Balanced scheduling [Sukwong EuroSys11] Demand-based coordinated scheduling [Kim ASPLOS13] Hardware Assisted Techniques Intel Pause-Loop Exiting (PLE) [Riel 11] 4
Lock Waiter Preemption [Ouyang VEE13] Linux uses a FIFO order fair spinlock, named ticket spinlock i i+1 i+2 i+3 Timeout: T 2T 3T 0 Lock waiter preemption A lock waiter is preempted, and blocks the queue P(waiter preemption) > P(holder preemption) Preemptable Ticket Spinlock Key idea: proportional timeout 5
TLB Shootdown Preemption KVM Paravirt Remote Flush TLB [kvmtlb 12] VMM maintains vCPU preemption states and shares with the guest. Use conventional approach if the remote vCPU is running. Defer TLB flush if the remote vCPU is preempted. Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U Goal: eliminate the problem Key idea: invalidate guest TLB entries from the VMM 6
Contributions An analysis of the impact that various low level synchronization operations have on system benchmark performance. Shoot4U: A novel virtualized TLB architecture that ensures consistently low latencies for synchronized TLB operations. An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches. 7
Performance Analysis 8
Overhead of CPU Consolidation max ideal slowdown 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 PARSEC Runtime with co-located VM over running alone (12-core VMs, measured on Linux/KVM, with PLE disabled) 9
CPU Usage Profiling (perf) k:lock k:tlb k:other u:* 1VM 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 10
CDF of TLB Shootdown Latency (ktap) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1VM 2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 11
How TLB Shootdown Works in VMs TLB (Translation Lookaside Buffer) a per-core hardware cache for page table translation results TLB coherence is managed by the OS TLB shootdown operations: IPI + invlpg Linux TLB shootdown is busy-waiting based B B A B 6. invalidation & ACK 1. vIPIs 5 . vCPU is scheduled A A B A (TLB Shootdown Preemption) 2 . trap VMM 4 . inject virtual interrupts P P P P 12 3. pIPIs
Shoot4U 13
Shoot4U Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid ) Key idea: invalidate guest TLB entries from the VMM Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) The VMM invalidates and returns, no interrupt injection and waiting B B A B A A B A 1. hypercall 3 . invalidation and ACK <vcpu set, addr range> VMM 2 . pIPIs P P P P 14
Implementation Shoot4U API kvm_hypercall3( unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr); KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) https://github.com/ouyangjn/shoot4u Guest use hypercall for TLB shootdowns VMM hypercall handler: vCPU set => pCPU set, and send IPIs IPI handler: invalidate guest TLB entries with invpid 15
Evaluation Dual-socket Dell R450 server 6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading 24 GB RAM split across two NUMA domains. CentOS 7 (Linux 3.16) Virtual Machines 12 vCPUs, 4G RAM on the same socket Fedora 19 (Linux 4.0) VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test Schemes baseline: unmodified Linux kernel kvmtlb [kvmtlb 12] Shoot4U Pause-Loop Exiting (PLE) [Riel 11] Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’ 13] 16
TLB Shootdown Latency (Cycles) Order of magnitude lower latency 17
TLB Shootdown Latency (CDF) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 shoot4u-1VM 0.3 shoot4u-2VM kvmtlb-1VM 0.2 kvmtlb-2VM 0.1 baseline-1VM baseline-2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 18
Parsec Performance (2-VMs) baseline pmt+kvmtlb ple pmt+shoot4u pmt ple+pmt+shoot4u Normalized Execution Time 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b b c d f f r s s v x l o a e e r a t w i 2 a d n d r e y r a p 6 c y n u r q t e p s 4 k t e p e m r a t s r a t i a m i c a l n c c o h c e e l n o k u s l s e t s e r 19
Revisiting Performance Slowdown baseline ple+pmt+shoot4u 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 20
Revisiting CPU Usage Profiling k:lock k:tlb k:other u:* baseline 2VM ple+pmt+shoot4u 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 21
Conclusions We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads. We Shoot4U, an optimization for TLB shootdown operations that internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs. Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques. 22
https://github.com/ouyangjn/shoot4u 23
Q & A Kitten Lightweight Kernel Jiannan Ouyang Ph.D. Candidate University of Pittsburgh Pisces Co-Kernel ouyang@cs.pitt.edu http://www.cs.pitt.edu/~ouyang/ The Prognostic Lab University of Pittsburgh http:// www.prognosticlab.org Palacios VMM 24
References [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013. [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, 2004. [Friebel 08] Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit North America, 2008. [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2013. 25
[VMware 10] VMware(r) vSphere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010. [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture , 2013. [Weng HPDC’11] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC) , 2011. [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co- scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys) , 2011. 26
[Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/. [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/ Articles/500188/. 27
Recommend
More recommend