Rethinking the Linux kernel Thomas Graf Cilium Project, Co-Founder & CTO, Isovalent
Remember GeoCities? 2 Cameron Askin: Cameron’s World
What enabled this evolution? Programmable Platform Markup Only (HTML) 3
Programmability Essentials Continuous Performance Safety Delivery Untrusted code runs Allow evolution of Programmability must in the browser of the logic without requiring be provided with user. to constantly ship new minimal overhead. browser versions. → Sandboxing → Native Execution → Deploy anytime with (JIT compiler) seamless upgrades 4
Kernel Architecture Space Process Admin Process User write() read() sendmsg() recvmsg() Configuration Syscall Syscall (sysfs,netlink,procfs,...) Kernel Sockets File Descriptor Linux TCP/IP VFS Block Device Network Device HW Storage Network Hardware Hardware 5
Kernel Development 101 Option 1 Option 2 Native Support Kernel Module Change kernel source code Write kernel module ● ● Expose configuration API Every kernel release will break it ● ● ● Wait 5 years for your users Cons: to upgrade You likely need to ship a different ● Cons: module for each kernel version Might crash your kernel ● 6
How about we add JavaScript-like capabilities to the Linux Kernel? 7
8
Process execve() Syscall Kernel Linux Scheduler 9
eBPF Runtime Controller Process bytecode BPF bpf() sendmsg() recvmsg() Program Syscall Syscall Kernel Sockets Verifier Linux approved TCP/IP BPF x86_64 Program BPF Program Network Device JIT Compiler Safety & Security Continuous Delivery Performance The verifier will reject any Programs can be exchanged The JIT compiler ensures unsafe program and without disrupting workloads. native execution provides a sandbox. performance. 10
eBPF Hooks Process Process write() read() sendmsg() recvmsg() Syscall Syscall Kernel Sockets File Descriptor Linux TCP/IP VFS Block Device Network Device Storage Network Hardware Hardware Where can you hook? kernel functions (kprobes), userspace functions (uprobes), system calls, fentry/fexit, tracepoints, network devices (tc/xdp), network routes, TCP congestion algorithms, 11 sockets (data level)
eBPF Maps Controller Admin Process sendmsg() recvmsg() Syscall Syscall Syscall Kernel Linux Sockets BPF Map TCP/IP Network Device What are Maps used for? Map Types: - Hash tables, Arrays ● Program state - LRU (Least Recently Used) Program configuration ● - Ring Buffer ● Share data between programs - Stack Trace Share state, metrics, and ● - LPM (Longest Prefix match) statistics with user space 12
eBPF Helpers Process sendmsg() recvmsg() Syscall Kernel Linux Sockets [...] TCP/IP num = bpf_get_prandom_u32(); [...] Network Device What helpers exist? ● Access socket data ● Random numbers Perform tail call ● Get current time ● ● Access process stack Map access ● Access syscall arguments ● Get process/cgroup context ● ● Manipulate network packets and ● ... 13 forwarding
eBPF Tail and Function Calls Kernel Linux What are Tail Calls used for? What are Functions Calls used for? ● Chain programs together ● Reuse functionality inside of a Split programs into independent program ● logical components ● Reduce program size (avoid Make BPF programs composable inlining) ● 14
Community 287 contributors: (Jan 2016 to Jan 2020) ● 466 Daniel Borkmann (Cilium; maintainer) 290 Andrii Nakryiko (Facebook) ● ● 279 Alexei Starovoitov (Facebook; maintainer) 217 Jakub Kicinski (Facebook) ● ● 173 Yonghong Song (Facebook) 168 Martin KaFai Lau (Facebook) ● ● 159 Stanislav Fomichev (Google) 148 Quentin Monnet (Cilium) ● ● 148 John Fastabend (Cilium) 118 Jesper Dangaard Brouer (Red Hat) ● ● [...] 15
eBPF Projects Cilium bcc, bpftrace Networking, security and et al. High-performance L4 Performance load-balancing for k8s Loadbalancer troubleshooting & profiling cilium/cilium facebookincubator/katran iovisor/bcc Traffic Optimization Falco Android & Security DDoS mitigation, QoS, Container runtime kernel runtime security traffic optimization, security, behavior instrumentation (KRSI), load balancer analysis Android BPF loader , eBPF traffic monitor cloudflare/bpftools falcosecurity/falco 16
Tracing & Profiling with Python BPF Program Process BCC sendmsg() recvmsg() Syscall Syscall Kernel BPF Linux Sockets Verifier Maps TCP/IP JIT Compiler # tcptop Tracing... Output every 1 secs. Hit Ctrl-C to end <screen clears> BCC: 19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681 github.com/iovisor/bcc PID COMM LADDR RADDR RX_KB TX_KB 16648 16648 100.66.3.172:22 100.127.69.165:6684 1 0 16647 sshd 100.66.3.172:22 100.127.69.165:6684 0 2149 14374 sshd 100.66.3.172:22 100.127.69.165:25219 0 0 14458 sshd 100.66.3.172:22 100.127.69.165:7165 0 0 17
bpftrace - DTrace for Linux bpftrace Program Process bpftrace open() Syscall Syscall Kernel BPF Linux File Descriptors Verifier Maps VFS JIT Compiler # bpftrace -e 'kprobe:do_sys_open { printf("%s: %s\n", comm, str(arg1)) }' bpftrace: Attaching 1 probe... github.com/iovisor/bpftrace git: .git/objects/da git: .git/objects/pack git: /etc/localtime systemd-journal: /var/log/journal/72d0774c88dc4943ae3d34ac356125dd DNS Res~ver #15: /etc/hosts ^C 18
Networking, load-balancing and security for Kubernetes Kubernetes Container Container Clium Syscall Syscall Syscall Kernel BPF Linux Sockets Sockets Verifier Maps TCP/IP TCP/IP JIT Compiler Network Device Network Device Network Hardware 19
Container Security Container Networking Identity-based network security ● ● Highly efficient and flexible networking ● API-aware security (HTTP, gRPC, Kafka, Routing, Overlay, Cloud-provider native ● Cassandra, memcached, ..) ● IPv4, IPv6, NAT46 DNS-aware policies ● ● Multi cluster routing Encryption ● ● SSL data visibility via kTLS Service Load balancing: Visibility ● Highly scalable L3-L4 load balancing ● Kubernetes services (replaces Service topology map & live visualization ● kube-proxy) ● Advanced network metrics & alerting Multi-cluster ● ● Service affinity (prefer zones) Servicemesh: Minimize overhead when injecting ● servicemesh sidecar proxies ● Istio integration 20
Hubble: eBPF Visibility for Kubernetes # hubble observe --since=1m -t l7 -j \ | jq 'select(.l7.dns.rcode==3) | .destination.namespace + "/" + .destination.pod_name' \ | sort | uniq -c | sort -r 42 "starwars/jar-jar-binks-6f5847c97c-qmggv" 21
Go Development Toolchain C source bytecode BPF BPF Program Program clang -target bpf Program Maps Process Development Go Library sendmsg() recvmsg() Syscall Syscall Kernel BPF Linux Sockets Verifier Map TCP/IP JIT Compiler Runtime 22 Go Library: https:/ /github.com/cilium/ebpf
Outlook: Future of could enable the Linux kernel is turning the Linux hotpatching we always dreamed about. kernel into a microkernel. An increasing amount of new kernel Problem: ● functionality is implemented with eBPF. ● Linux kernel vulnerability requires to 100% modular and composable. patch kernel. ● ● New additions can evolve at a rapid pace. ● Rebooting 20’000 servers takes a very Much quicker than normal kernel long time without risking extensive development. downtime. Example: The linux kernel is not aware of Kernel Function containers and microservices (it only knows Linux about namespaces). Cilium is making the Function Hotfix Linux kernel container and Kubernetes aware. Function 23
Thank You eBPF Maintainers Daniel Borkmann, Alexei Starovoitov Cilium Team André Martins, Jarno Rajahalme, Joe Stringer, John Fastabend, Maciej Kwiek, Martynas Pu mputis, Paul Chaignon, Quentin Monnet, Ray Bejjani, Tobias Klauser Facebook Team Andrii Nakryiko, Andrey Ignatov, Jakub BPF Getting Started Guide ● Kicinski, Martin KaFai Lau, Roman Gushchin, Song Liu, Yonghong Song BPF and XDP Reference Guide Google Team ● Cilium Chenbo Feng, KP Singh, Lorenzo Colitti, Maciej Żenczykowski, Stanislav Fomichev, github.com/cilium/cilium BCC & bpftrace Alastair Robertson, Brendan Gregg, Brenden ● Twitter Blanco @ciliumproject Kernel Team Björn Töpel, David S. Miller, Edward Cree, Contact the speaker ● Jesper Brouer, Toke Høiland-Jørgensen @tgraf__ 24 All images: Pixabay
Recommend
More recommend