SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG 1
WHO ARE THOSE GUYS … Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling and container technologies. Mail: accela.zhao@emc.com Layne Peng, Principal Technologist at EMC OCTO, experienced cloud architect, one of the earliest contributors to Cloud Foundry in China, 9 patents owner and a book author. Mail: layne.peng@emc.com Twitter: @layne_peng 2
WHAT IS RESOURCE UTILIZATION? This is what we buy A gap of $$$ wasted This is what we use 3
ENERGY AND RESOURCE UTILIZATION Real world resource utilization is usually low: around 20% or less An idle server consumes even 70% as much energy as running in full- speed Energy-related costs 42% of total (including buy new machines) Low resource utilization is energy inefficient Waste energy, waste money 4
A CLOSER LOOK TO CLOUD The key advantage - cloud consolidation Improved resource utilization Less machines, more apps. Energy- efficient and saves money. 5
RESOURCE UTILIZATION ON CLOUD • Scheduling - choose the best resource placement when app starts – Examples: Green Cloud, Paragon. And the schedulers in Openstack, Kubernetes, Mesos , … • Migration - continuously optimize the resource placement when app is running – Examples: Openstack Watcher, VMware DRS • Soft Container - elastic, and dynamically adjust Soft Container resource constraints in response to co-located apps – Related: Google Heracles 6
RESOURCE UTILIZATION ON CLOUD Apps Manages resource Scheduler utilization at app kick-off Soft Container Manages resource Migration utilization cross hosts Manages resource while app running utilization at fine granularity inside host 7
RESOURCE UTILIZATION ON CLOUD A battle of putting more apps in each host vs. guaranteed app SLA The key problem: resource interference 8
THE KEY PROBLEM: RESOURCE INTERFERENCE • What is resource interference? – Apps co-located in one host share resources like CPU, cache, memory, … – They interfere with each other, result in poor performance compared to running standalone – Resource interference make SLA unenforceable • Related readings – Google Heracles: an analysis of resource interference – Paragon: resource interference-aware scheduling – Bubble-up: to measure resource interference 9
RESOURCE INTERFERENCE: HOW IT LOOKS? MySQL standalone running vs co-located with a CPU & disk hungry task 10
RESOURCE INTERFERENCE: HOW TO MEASURE? • Bubble-up – The setup • Run app co-located with resource benchmarks, each benchmark stresses one type of resource – App tolerated resource interference • Slowly increase resource benchmark stress until app fails its SLA. • The critical point shows how much resource interference the app can tolerate. – App caused resource interference • Run app at what its SLA requires. • The stress it causes on each type of resource is the app’s caused resource interference. • Where to use it? – Better resource utilization management – Scheduling, Migration, S oft Container, … 11
RESOURCE INTERFERENCE: HOW TO MEASURE? MySQL standalone running, vs co-located with CPU stress, vs disk stress. In my case, MySQL is much more sensitive to CPU interference. 12
INTRODUCING TO SOFT CONTAINER • Motivations – Increase resource utilization by co-locating more apps • E.g. Business services is critical but may not use all resources on the host. Add the low priority hadoop batching tasks to fill what is left. – Respond to the dynamic nature of time-varying workload • E.g. Business service may become more idle at lunch time, hadoop tasks can then expand its resource bubble and utilize the leftover. – Guarantee the SLA of critical apps • E.g. When the business service suddenly requires more resource for processing, hadoop tasks will shrink instantly to give out resources. • Challenges – Resource control and isolation of interference – Respond to dynamic workload change 13
INTRODUCING TO SOFT CONTAINER • What does “Soft” mean? – Varying container resources needs based upon neighbors and SLAs. (The container becomes elastic) – “Expanding” (bubble up) resources when idle resources exist – Shrinking resources on a specific container, when another critical app demands more resources Resource Container resource bubble Time 14
THE FEEDBACK CONTROL LOOP Controller Soft Container Watcher Limiter Containers 15
RESOURCES TO LIMIT • CPU • Memory • Disk I/O – Core – Size – IOPS – Time Quota – Bandwidth – Throughput – … – … – … 16
RESOURCES TO LIMIT - MISSING • CPU • Memory • Disk I/O – Core – Size – IOPS – Time Quota – Bandwidth* – Throughput – … – … – … • Cache • Network – LLC – Ulimit – … – Bandwidth – … • GPU • Device* – … – … … Kernel 3.6, most supports can be found in the community… 17
ISOLATION THE RESOURCES - NAMESPACE clone(): create a new process and attached to a new namespace • unshare(): create a new namespace and attaches to a existed process • setns(): Set a a process to a existing namespace • /proc/<pid>/ns: lrwxrwxrwx 1 root root 0 Jun 21 18:38 ipc -> ipc:[4026532509] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 mnt -> mnt:[4026532507] • lrwxrwxrwx 1 root root 0 Jun 16 18:24 net -> net:[4026532512] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 pid -> pid:[4026532510] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 user -> user:[4026531837] • lrwxrwxrwx 1 root root 0 Jun 21 18:38 uts -> uts:[4026532508] • We are still waiting … security namespace • security keys namespace • device namespace • time namespace • 18
LIMIT THE RESOURCE - CGROUP Task, Control Group & Hierarchy Subsystem – Control options blkio freezer • • cpu memory • • cpuacct net_cls • • cpuset net_prio • • devices ns • • Usage Create a cgroup subsystem Change the limitation … # echo 524288000 > /sys/fs/cgroup/memory/foo/memory.limit_in_b ytes 19
MISSING - NETWORK Isolation, does not means resource controlling Suppose two containers in a machine, totally 100Gbps b/w 100Gbps 10 80 20
MISSING - NETWORK Isolation, does not means resource controlling Suppose two containers in a machine, totally 100Gbps b/w 100Gbps 10 80 If the GREEN container consumes the majority of b/w, which may have a negative impact on the BLUE one … How we can avoid this case from happening? 100Gbps 95 21
MISSING - NETWORK Nightmare of the PaaS providers … Community attempts: Base on Traffic Control (tc) 22
MISSING - NETWORK Nightmare of the PaaS providers … Community attempts: Base on Traffic Control (tc) 23
MISSING - GPU Nvidia’s efforts: a. GPU exposed as separated normal devices in /dev b. devices cgroup: Allow/Deny/List • Access • i. R ii. W iii. M Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation 24
MISSING - GPU Nvidia’s efforts: a. GPU exposed as separated normal devices in /dev b. devices cgroup: Allow/Deny/List • Access • i. R ii. W iii. M Usable, but insufficient … 1. Launch multiple jobs in parallel, each one us a subset of avaiable GPUs; 2. How about share GPU between Jobs with proper isolation? Can we share a GPU like we can a CPU? Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation 25
MISSING - CACHE Intel’s efforts: Cache Allocation Technology (CAT) Cache Monitor Technology (CMT) • The ability to enumerate the CAT capability and For an OS or VMM to indicate a software- • the associated LLC allocation support via defined ID for each of applications or VMs that CPUID. are scheduled to run on a core. This ID is called • Interfaces for the OS/hypervisor to group the Resource Monitoring ID (RMID). applications into classes of service (CLOS) and To Monitor cache occupancy on a per RMID • indicate the amount of last-level cache basis available to each CLOS. These interfaces are For an OS or VMM to read LLC occupancy for a • based on MSRs (Model-Specific Registers). given RMID at any time. Code and Data Prioritization (CDP) Extension to CAT • a new CPUID feature flag is added within the • CAT sub-leaves at CPUID.0x10.[ResID=1]:ECx[bit 2] to indicate support 26
MISSING – MEMORY BANDWIDTH Monitor Memory Bandwidth Monitoring (MBM) Mechanisms in hardware to monitor cache • occupancy and bandwidth statistics as applicable to a given product generation on a per software-id basis. Mechanisms for the OS or hypervisor to read • back the collected metrics such as L3 occupancy or Memory Bandwidth for a given software ID at any point during runtime. Control Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf Code: https://github.com/heechul/memguard 27
Recommend
More recommend