Bringing Security and Multi- tenancy to Kubernetes Lei (Harry) Zhang
About Me • Lei (Harry) Zhang @resouer #CNCF member, #Microsoft MVP • Previous: VMware, Baidu • • Feature Maintainer of Kubernetes • HyperCrew: https://hyper.sh/ Publication: Docker & Kubernetes Under the Hood • • Phd Candidate #Large-scale cluster scheduling and management
A survey about “boundary” • Are you comfortable with Linux containers as an effective boundary? • Yes , I use containers in my private/safe environment • No , I use containers to serve the public cloud
As long as we care security… • We have to wrap containers inside full-blown virtual machines • But we lose cloud-native deployment reality • Slow startup time • Huge resources wasting dream • Memory tax for every container • …
namespace cgroups Revisit container /bin /dev /etc /home /lib / lib64 /media /mnt /opt /proc / root /run /sbin /sys /tmp / read-write layer usr /var /data /temp.txt • Container Runtime “echo hello” init layer Read-Write Layer & /data • The dynamic view and boundary of your running process /etc/hosts /etc/hostname /etc/resolv.conf json CMD [“echo hello"] json read-only layer • Container Image t x VOLUME /data t . p m e t • The static view of your program, data, / ADD temp.txt / dependencies, files and directories FROM busybox FROM busybox ADD temp.txt / VOLUME /data CMD [“echo hello"] Docker Container
HyperContainer Secure Kubernetes from runtime level
HyperContainer • Container Runtime • RunV https://github.com/hyperhq/runv • A OCI compatible hypervisor based runtime implementation • • Control daemon https://github.com/hyperhq/hyperd • • Container Image • Docker Image Spec
Combine the best parts • Portable and behaves like a Linux container • $ hyperctl run -t busybox echo helloworld • sub-second startup time*, ~12MB memory cost • Fully isolated sandbox with an independent guest kernel • $ hyperctl exec -t busybox uname -r 4.4.12-hyper (or your provided kernel) • • security, backward compatibility, maturity See: http://hypercontainer.io/why-hyper.html
HyperContainer is a Pod • That’s how HyperContainer fits into the Kubernetes philosophy • Wait, why Pod is so important?
Pod: lesson learned from Borg • Should sample.war be packaged with Tomcat ?
Pod: lesson learned from Borg • InitContainers: one or more containers started in sequence before the pod's normal containers are started. • Share volumes, perform network operations, and perform computation prior to the app containers.
So, Pod is • The group of super-affinity containers Pod • The atomic scheduling unit log app • The process group in container cloud • Do right things infra container init container • without modifying your container image • Kubernetes = Spring Framework volume • Pod = IoC
Pod is not easy to simulate • log super affinity app • Requirement: app : 1G, log : 0.5G • • Available: Node_A : 1.25G, Node_B : 2G • • What happens if app scheduled to Node_A ?
HyperContainer is a Pod • Linux container based runtimes • wraps and encapsulates several app containers into a logical group • Hypervisor container based runtime • hypervisor serves as a natural boundary of Pod
HyperContainer is a Pod • Container Runtime Interface • create sandbox Foo --> create container C --> start container C • stop container C --> remove container C --> delete sandbox Foo • Sandbox • Normally: the infra container • HyperContainer: hypervisor with HyperKernel • a HyperStart process as PID 1 • setup mnt namespace, launch apps from the images etc •
Hypernetes Kubernetes with HyperContainer Runtime
Hypernetes • Also: h8s • Kubernetes + HyperContainer runtime • officially supported by using kubernetes / frakti • Multi-tenant network and persistent volumes • battle tested Neutron + Cinder plugin
Multi-tenant Network • Goal: • leveraging tenant-aware neutron network for Kubernetes • following the network plugin workflow • Non-goal: • break k8s network model or hack k8s code
Define the Network • Network • a top class api object • each tenant (created by Keystone) has its own Network • Network mapping to Neutron “net” • a Network Controller is responsible to manage Network lifecycle
Example proxy network pod Call Neutron to replica Desired World kubelet create/delete namespace Real World SyncLoop network service job deployment volume petset controller-manager … ControlLoop proxy etcd api-server kubelet scheduler SyncLoop
Kubernetes Network Model • Container reach container • all containers can communicate with all other containers without NAT • Node reach container • all nodes can communicate with all containers (and vice-versa) without NAT • IP addressing • Pod in cluster can be addressed by its IP
How h8s fits that? • Network can be assigned to one or more Namespaces • Pods belonging to the same Network can reach each other directly through IP • a Pod’s network mapping to Neutron “port” • kubelet network plugin is responsible for Pod network setup
Example proxy kubelet SyncLoop 1 Pod created scheduler proxy etcd api-server kubelet SyncLoop
Example proxy kubelet SyncLoop 2 Pod object added scheduler proxy etcd api-server kubelet SyncLoop
Example proxy kubelet SyncLoop 3.1 New pod object detected 3.2 Bind pod with node scheduler proxy etcd api-server kubelet SyncLoop
Example proxy kubelet SyncLoop scheduler proxy etcd api-server kubelet SyncLoop 4.1 Detected pod bind with me 4.2 Start containers in pod
Design of kubelet Choose Runtime � docker, rkt, hyper/remote � NodeStatus Network Status status Manager PLEG InitNetworkPlugin InitNetworkPlugin volume SyncLoop Manager image Manager Pod Update Worker (e.g.ADD) • generale Pod status • check volume status (talk later) PodUpdate • call runtime to start containers HandlePods • set up Pod network (see next slide) { Add, Update, Remove, Delete, … }
Set Up Pod Network
kubestack A standalone gRPC daemon 1. to “translate” the SetUpPod request to the Neutron network API 2. handling multi-tenant Service proxy
OnServiceUpdate Service $ iptables-save | grep my-service -A KUBE-SERVICES -d 10.0.0.116/32 -p tcp -m comment --comment "default/my-service: cluster IP" -m tcp --dport 8001 -j KUBE-SVC-KEAUNL7HVWWSEZA6 -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-6XXFWO3KTRMPKCHZ -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-57KPRZ3JQVENLNBRZ OnEndpointsUpdate -A KUBE-SEP-6XXFWO3KTRMPKCHZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.2:80 -A KUBE-SEP-57KPRZ3JQVENLNBRZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.3:80 portal 10.10.0.116:8001 backend rule_1 172.17.0.2.:80 random mode rules backend rule_2 172.17.0.3.:80
Multi-tenant Service • Default iptables-based kube-proxy is not tenant aware • Endpoint Pods and Nodes with iptables rules are isolated into different networks • Hypernetes uses a built-in HAproxy as the Service portal • to proxy all Service instances within same namespace • the same OnServiceUpdate and OnEndpointsUpdate process • ExternalProvider • a OpenStack LB will be created as Service • e.g. curl 58.215.33.98:8078
Kubernetes Persistent Volume • Get mountedVolume from actualStateOfWorld • Unmount volumes in mountedVolume but not in desiredStateOfWorld Host mount • AttachVolume() if vol in desiredStateOfWorld and not attached • MountVolume() if vol in desiredStateOfWorld and not Pod Pod in mountedVolume mountPath mountPath • Verify devices that should be detached/unmounted are detached/unmounted attach path Volume • Tips: desired Manager World Cinder volume plugin 1. -v host:path 2. attach VS mount 3. Totally independent from container reconcile management
Persistent Volume with HyperContainer Host • Enhanced Cinder volume plugin • Linux container: Pod Pod mountPath mountPath 1. full OpenStack cluster 2. query Nova to find node vol vol attach 3. attach Cinder volume to host path Enhanced 4. bind mount host path to Pod containers Cinder volume plugin • HyperContainer : Volume desired • directly attach block devices to Pod Manager World • thanks to the hypervisor based Pod boundary • eliminates extra time to query Nova reconcile
PV Example • Create a Cinder volume • Claim volume by reference its volumeID
Container Runtime Interface
Future of CRI • Keep Docker as the only one default container runtime • ocid, rktlet, hyperd • Frakti: the Remote Container Runtime Kit • https://github.com/kubernetes/frakti • welcome to tryout, star and fork
“if image becomes non-standard” • e.g. Docker image becomes somehow Docker specific • Don’t worry, kubelet . imageManager is moving to runtime specific • but then k8s will probably choose • NO DEFAULT runtime
Recommend
More recommend