container mechanics in linux and rkt
play

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy - PowerPoint PPT Presentation

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy github.com/alban Jonathan Boulle github.com/jonboulle @baronboulle a modern, secure, composable container runtime an implementation of appc (image format, execution


  1. Container mechanics in Linux and rkt FOSDEM 2016

  2. Alban Crequy github.com/alban Jonathan Boulle github.com/jonboulle @baronboulle

  3. a modern, secure, composable container runtime

  4. an implementation of appc (image format, execution environment)

  5. rkt simple CLI tool golang + Linux self-contained

  6. simple CLI tool no (mandatory) daemon: apps run directly under spawning process

  7. bash/runit/systemd rkt application(s)

  8. rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

  9. bash/runit/systemd rkt application(s)

  10. bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

  11. bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

  12. stage0 ( rkt binary) discover, fetch, manage application images set up pod filesystems commands to manage pod lifecycle

  13. bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

  14. bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

  15. stage1 "the container" execution environment for pods process lifecycle management resource constraints (isolators)

  16. stage1 (swappable) ● binary ABI with stage0 ● rkt's stage0 calls exec(stage1, args...) ● default implementation ○ based on systemd-nspawn + systemd ○ Linux namespaces + cgroups for isolation ● kvm implementation ○ based on lkvm + systemd ○ hardware virtualisation for isolation

  17. bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

  18. bash/runit/systemd/... (invoking process) rkt (stage0) systemd-nspawn (stage1) systemd cached (stage2) workerd (stage2)

  19. bash/runit/systemd/... (invoking process) rkt (stage0) systemd-nspawn (stage1) systemd cached (stage2) workerd (stage2) container

  20. Containers on Linux namespaces cgroups chroot

  21. Linux namespaces

  22. hostname: hostname: containers thunderstorm sunshine hostname: host rainbow

  23. Containers: no guest kernel system calls: example: sethostname() app app app rkt rkt kernel API Host Linux kernel Hardware

  24. Containers with an example Getting and setting the hostname: The system calls for getting and setting the hostname are older than containers ✤ int uname(struct utsname *buf); int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len);

  25. Processes in namespaces 1 3 6 2 9 gethostname() -> “rainbow” gethostname() -> “thunderstorm”

  26. Linux Namespaces Several independent namespaces ✤ uts (Unix Timesharing System) namespace ✤ mount namespace ✤ pid namespace ✤ network namespace ✤ user namespace

  27. Creating new namespaces unshare(CLONE_NEWUTS); 1 6 2 “rainbow”

  28. Creating new namespaces 1 6 6 2 “rainbow” “rainbow”

  29. PID namespace

  30. Hiding processes and PID translation ✤ the host sees all processes ✤ the container only its own processes

  31. Hiding processes and PID translation ✤ the host sees all processes ✤ the container only its own processes Actually pid 30920

  32. Initial PID namespace 1 2 6 7

  33. Creating a new namespace clone(CLONE_NEWPID, ...); 1 2 6 7

  34. Creating a new namespace 1 2 6 1 7

  35. rkt rkt run ... ✤ uses clone() to start the first process in the container with a new pid namespace ✤ uses unshare() to create a new network namespace rkt enter ... ✤ uses setns() to enter an existing namespace

  36. Joining an existing namespace 1 setns(...,CLONE_NEWPID); 2 6 1 7

  37. Joining an existing namespace 1 2 6 1 4

  38. Mount namespaces

  39. / container /my-app / /etc /var /home host user

  40. Storing the container data (Copy-on-write) Container filesystem Overlay fs “upper” directory /var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512- .../upper/ Application Container Image /var/lib/rkt/cas/tree/sha512-...

  41. rkt directories /var/lib/rkt ├─ cas │ └─ tree │ ├─ deps-sha512-19bf... │ └─ deps-sha512-a5c2... └─ pods └─ run └─ e0ccc8d8 └─ overlay/sha512-19bf.../upper └─ stage1/rootfs/

  42. unshare(..., CLONE_NEWNS); 1 7 3 / /etc /var /home user

  43. 7 1 7 3 / / /etc /var /home /etc /var /home user user

  44. Changing root with MS_MOVE / / / mount($ROOTFS, “/”, MS_MOVE) my-app ... ... rootfs rootfs my-app $ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs

  45. Relationship between the Mount propagation events two mounts: - shared - master / slave - private / / /etc /var /home /etc /var /home user user

  46. Mount propagation events /home /home /home /home /home /home Private Shared Master and slave

  47. How rkt uses mount propagation events ✤ / in the container namespace is recursively set as slave: mount(NULL, "/", NULL, / MS_SLAVE|MS_REC, NULL) /etc /var /home user

  48. Network namespace

  49. Network isolation Goal: each container has their own ✤ container1 container2 network interfaces Cannot see the network ✤ eth0 eth0 traffic outside the container (e.g. tcpdump) host eth0

  50. Network tooling ✤ Linux can create pairs of virtual net container1 container2 interfaces ✤ Can be linked in a eth0 eth0 bridge veth1 veth2 bridge eth0 IP masquerading via iptables

  51. rkt networking ✤ plugin based ✤ Container Network Interface (CNI) ✣ rkt ✣ Kubernetes ✣ Calico

  52. Container Runtime (e.g. rkt) Container Networking Interface (CNI) veth macvlan ipvlan OVS

  53. How does rkt do it? rkt uses the network plugins implemented by the Container Network Interface (CNI, ✤ https://github.com/appc/cni) network systemd-nspawn namespace configure create, exec() via setns + join netlink exec network plugins rkt /var/lib/rkt/pods/run/$POD_UUID/netns

  54. User namespaces

  55. History of Linux namespaces 1991: Linux ✓ 2002: namespaces in Linux 2.4.19 ✓ 2008: LXC ✓ 2011: systemd-nspawn ✓ 2013: user namespaces in Linux 3.8 ✓ 2013: Docker ✓ 2014: rkt ✓ … development still active

  56. Why user namespaces? ✤ Better isolation ✤ Run applications which would need more capabilities ✤ Per user limits ✤ Future: ✣ Unprivileged containers: possibility to have container without root

  57. User ID ranges 4,294,967,295 (32-bit range) 0 host 0 65535 container 1 0 65535 container 2

  58. User ID mapping /proc/$PID/uid_map: “0 1048576 65536” unmapped unmapped 1048576 host 65536 container 65536 unmapped

  59. Problems with container images web server container 1 container 2 Application Container Container Container Image ( ACI ) filesystem filesystem Overlayfs Overlayfs “upper” “upper” directory directory downloading Application Container Image ( ACI )

  60. Problems with container images ✤ Files UID / GID ✤ rkt currently only supports user namespaces without overlayfs ✣ Performance loss: no COW from overlayfs ✣ “chown -R” for every file in each container

  61. Problems with volumes bind mount / (rw / ro) /data /data /my-app ✤ mounted in several / containers ✤ No UID /data /var /home translation ✤ Dynamic UID maps user

  62. User namespace and filesystem problem ✤ Possible solution: add options to mount() to apply a UID mapping ✤ rkt would use it when mounting: ✣ the overlay rootfs ✣ volumes

  63. Isolators

  64. Isolators in rkt ✤ specified in an image manifest ✤ limiting capabilities or resources

  65. Isolators in rkt Currently implemented Possible additions capabilities block-bandwidth ✤ ✤ cpu block-iops ✤ ✤ memory network-bandwidth ✤ ✤ disk-space ✤

  66. cgroups

  67. What’s a control group (cgroup) ✤ group processes together ✤ organised in trees ✤ applying limits to them as a group

  68. cgroup API /sys/fs/cgroup/*/ /proc/cgroups /proc/$PID/cgroup

  69. cgroups

  70. List of cgroup controllers /sys/fs/cgroup/ ├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

  71. Memory isolator [Service] write to ExecStart= “limit”: memory.limit_in_ MemoryLimit=500M bytes “500M” Application systemd service file Image Manifest systemd action

  72. CPU isolator [Service] ExecStart= write to “limit”: cpu.share CPUShares=512 “500m” Application systemd service file Image Manifest systemd action

  73. Unified cgroup hierarchy ✤ Multiple hierarchies: ✣ one cgroup mount point for each controller (memory, cpu...) ✣ flexible but complex ✣ cannot remount with a different set of controllers ✣ difficult to give to containers in a safe way ✤ Unified hierarchy: ✣ cgroup filesystem mounted only one time ✣ soon to be stable in Linux (mount option “ __DEVEL__sane_behavior ” being removed) ✣ initial implementation in systemd-v226 (September 2015) ✣ no support in rkt yet

  74. Questions? Join us! github.com/coreos/rkt

Recommend


More recommend