shoot4u using vmm assists to optimize tlb operations on
play

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted - PowerPoint PPT Presentation

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE16 04/02/2016 CPU Consolidation in the Cloud CPU Consolidation: multiple virtual


  1. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang , John Lange Haoqiang Zheng University of Pittsburgh VMware Inc. VEE’16 04/02/2016

  2. CPU Consolidation in the Cloud CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization. Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a 2 period of 3 months [Barroso 13].

  3. Problems with Preempted vCPUs Preempted B B A B Running A A B A P P P P pCPU vCPU of VM-A vCPU of VM-B P A B Performance problems: Busy-waiting based kernel synchronization operations — Lock Holder Preemption problem — Lock Waiter Preemption problem — TLB Shootdown Preemption problem 3

  4. Lock Holder Preemption Lock holder preemption [Uhlig 04, Friebel 08] — A preempted vCPU is holding a spinlock — Causes dramatically longer lock waiting time — context switch latency + CPU shares allocated to other vCPUs Scheduling Techniques — co-scheduling, relaxed co-scheduling [VMware 10] — Adaptive co-scheduling [Weng HPDC11] — Balanced scheduling [Sukwong EuroSys11] — Demand-based coordinated scheduling [Kim ASPLOS13] Hardware Assisted Techniques — Intel Pause-Loop Exiting (PLE) [Riel 11] 4

  5. Lock Waiter Preemption [Ouyang VEE13] Linux uses a FIFO order fair spinlock, named ticket spinlock i i+1 i+2 i+3 Timeout: T 2T 3T 0 Lock waiter preemption — A lock waiter is preempted, and blocks the queue — P(waiter preemption) > P(holder preemption) Preemptable Ticket Spinlock — Key idea: proportional timeout 5

  6. TLB Shootdown Preemption KVM Paravirt Remote Flush TLB [kvmtlb 12] — VMM maintains vCPU preemption states and shares with the guest. — Use conventional approach if the remote vCPU is running. — Defer TLB flush if the remote vCPU is preempted. — Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U — Goal: eliminate the problem — Key idea: invalidate guest TLB entries from the VMM 6

  7. Contributions — An analysis of the impact that various low level synchronization operations have on system benchmark performance. — Shoot4U: A novel virtualized TLB architecture that ensures consistently low latencies for synchronized TLB operations. — An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches. 7

  8. Performance Analysis 8

  9. Overhead of CPU Consolidation max ideal slowdown 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 PARSEC Runtime with co-located VM over running alone (12-core VMs, measured on Linux/KVM, with PLE disabled) 9

  10. CPU Usage Profiling (perf) k:lock k:tlb k:other u:* 1VM 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 10

  11. CDF of TLB Shootdown Latency (ktap) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1VM 2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 11

  12. How TLB Shootdown Works in VMs — TLB (Translation Lookaside Buffer) — a per-core hardware cache for page table translation results — TLB coherence is managed by the OS — TLB shootdown operations: IPI + invlpg — Linux TLB shootdown is busy-waiting based B B A B 6. invalidation & ACK 1. vIPIs 5 . vCPU is scheduled A A B A (TLB Shootdown Preemption) 2 . trap VMM 4 . inject virtual interrupts P P P P 12 3. pIPIs

  13. Shoot4U 13

  14. Shoot4U Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid ) Key idea: invalidate guest TLB entries from the VMM — Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) — The VMM invalidates and returns, no interrupt injection and waiting B B A B A A B A 1. hypercall 3 . invalidation and ACK <vcpu set, addr range> VMM 2 . pIPIs P P P P 14

  15. Implementation Shoot4U API kvm_hypercall3( unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr); KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) — https://github.com/ouyangjn/shoot4u Guest — use hypercall for TLB shootdowns VMM — hypercall handler: vCPU set => pCPU set, and send IPIs — IPI handler: invalidate guest TLB entries with invpid 15

  16. Evaluation Dual-socket Dell R450 server — 6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading — 24 GB RAM split across two NUMA domains. — CentOS 7 (Linux 3.16) Virtual Machines — 12 vCPUs, 4G RAM on the same socket — Fedora 19 (Linux 4.0) — VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test Schemes — baseline: unmodified Linux kernel — kvmtlb [kvmtlb 12] — Shoot4U — Pause-Loop Exiting (PLE) [Riel 11] — Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’ 13] 16

  17. TLB Shootdown Latency (Cycles) Order of magnitude lower latency 17

  18. TLB Shootdown Latency (CDF) 1 0.9 0.8 Cumulative Percent 0.7 0.6 0.5 0.4 shoot4u-1VM 0.3 shoot4u-2VM kvmtlb-1VM 0.2 kvmtlb-2VM 0.1 baseline-1VM baseline-2VM 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency (us) 18

  19. Parsec Performance (2-VMs) baseline pmt+kvmtlb ple pmt+shoot4u pmt ple+pmt+shoot4u Normalized Execution Time 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b b c d f f r s s v x l o a e e r a t w i 2 a d n d r e y r a p 6 c y n u r q t e p s 4 k t e p e m r a t s r a t i a m i c a l n c c o h c e e l n o k u s l s e t s e r 19

  20. Revisiting Performance Slowdown baseline ple+pmt+shoot4u 70.6 20 18 16 14 Slowdown 12 10 8 6 4 2 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 20

  21. Revisiting CPU Usage Profiling k:lock k:tlb k:other u:* baseline 2VM ple+pmt+shoot4u 2VM 100 90 80 Percentage (%) 70 60 50 40 30 20 10 0 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 21

  22. Conclusions — We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads. — We Shoot4U, an optimization for TLB shootdown operations that internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs. — Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques. 22

  23. https://github.com/ouyangjn/shoot4u 23

  24. Q & A Kitten Lightweight Kernel Jiannan Ouyang Ph.D. Candidate University of Pittsburgh Pisces Co-Kernel ouyang@cs.pitt.edu http://www.cs.pitt.edu/~ouyang/ The Prognostic Lab University of Pittsburgh http:// www.prognosticlab.org Palacios VMM 24

  25. References — [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013. — [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, 2004. — [Friebel 08] Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit North America, 2008. — [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2013. 25

  26. — [VMware 10] VMware(r) vSphere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010. — [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture , 2013. — [Weng HPDC’11] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC) , 2011. — [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co- scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys) , 2011. 26

  27. — [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/. — [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/ Articles/500188/. 27

Recommend


More recommend