Replacing iptables with eBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal Rostecki Swaminathan Vasudevan Software Engineer Software Engineer mrostecki@suse.com svasudevan@suse.com mrostecki@opensuse.org
What’s wrong with iptables? 2
What’s wrong with legacy iptables? IPtables runs into a couple of significant problems: ● Iptables updates must be made by recreating and updating all rules in a single transaction. ● Implements chains of rules as a linked list, so all operations are O(n). ● The standard practice of implementing access control lists (ACLs) as implemented by iptables was to use sequential list of rules. ● It’s based on matching IPs and ports, not aware about L7 protocols. ● Every time you have a new IP or port to match, rules need to be added and the chain changed. ● Has high consumption of resources on Kubernetes. Based on the above mentioned issues under heavy traffic conditions or in a system that has a large number of changes to iptable rules the performance degrades. Measurements show unpredictable latency and reduced performance as the number of services grows. 3
Kubernetes uses iptables for... ● kube-proxy - the component which implements Services and load balancing by DNAT iptables rules ● the most of CNI plugins are using iptables for Network Policies 4
And it ends up like that 5
6
What is BPF? 7
Linux Network Stack Process Process Process ● The Linux kernel stack is split into multiple abstraction layers. System Call Interface ● Strong userspace API compatibility in Linux for years. Sockets ● This shows how complex the linux kernel is and its years of evolution. TCP UDP Raw ● This cannot be replaced in a short term. Netfilter ● Very hard to bypass the layers. IPv4 IPv6 ● Netfilter module has been supported by linux for more Ethernet than two decades and packet filtering has to applied to packets that moves up and down the stack. Traffic Shaping Netdevice / Drivers OVS . HW Bridge 8
BPF kernel hooks Process Process Process System Call Interface BPF System calls BPF Sockmap and Sockets Sockops TCP UDP Raw Netfilter IPv4 IPv6 BPF cGroups Ethernet BPF TC hooks Traffic Shaping BPF XDP Netdevice / Drivers OVS . HW Bridge 9
Mpps 10
BPF replaces IPtables Local Processes PREROUTING INPUT FORWARD OUTPUT POSTROUTING IPTables Routing Decision netfilter FILTER hooks eBPF TC hooks NAT XDP hooks Routing FILTER FILTER Decision eBPF eBPF Code Code Routing NAT Decision NAT Netdev Netdev (Physical or (Physical or virtual Device) virtual Device) 11
BPF based filtering architecture To Linux From Linux Stack Stack TC/XDP Ingress TC Egress hook hook NetFilter NetFilter Connection Tracking Update Store Ingress [local dst] INGRESS session session Chain CHAIN Label Packet Selector [remote dst] Update FORWARD Store session CHAIN session Label Packet [local src] Update Store Egress Chain OUTPUT session session Selector CHAIN Netdev Label Packet Netdev (Physical or (Physical or virtual Device) virtual Device) [ r e m o t e s r c ] 12
BPF based tail calls LBVS is implemented Each eBPF program is Each eBPF program can exploit a per cpu _array shared across the entire program chain with a chain of eBPF injected only if there are different matching algorithm (e.g., programs, connected rules operating on that exact match, longest prefix match, through tail calls. field. Packet header offsets etc). eBPF Program #1 eBPF Program #2 eBPF Program #3 eBPF Program #N eBPF Tail call Program Search IP1 bitv1 Tail call * bitv1 rule1 act1 Headers IP.dst IP.proto first IP2 bitv2 udp bitv2 rule2 act2 Matching parsing lookup lookup Tail call IP3 bitv3 tcp bitv3 rule3 act3 rule Tail call …. Update rule1 cnt1 counters Bitwise rule2 cnt2 AND bit-vectors ACTION Header parsing is done (drop/ once and results are kept accept) in a shared map for performance reasons Packet out Packet in Bitvector with temporary result To eBPF hook From eBPF hook per cpu _array shared across the entire program chain 13
BPF goes into... ● Load balancers - katran ● perf ● systemd ● Suricata ● Open vSwitch - AF_XDP ● And many many others 14
BPF is used by... 15
Cilium 16 16
What is Cilium? 17
CNI Functionality CNI is a CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers ● ADD/DEL General container runtime considerations for CNI: The container runtime must ● create a new network namespace for the container before invoking any plugins ● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network ● not invoke parallel operations for the same container. ● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL. ● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container). When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs. 18
Cilium CNI Plugin control Flow Kubernetes API Server Kubectl Kubelet Userspace K8s Pod CRI-Containerd cni-add().. Container2 CNI-Plugin (Cilium) Container1 Cilium Agent bpf_syscall() BPF Maps Linux Kernel Network Stack 000 c1 FE 0A BPF 001 54 45 31 Hook 002 A1 B1 C1 004 32 66 AA Kernel eth0 19
Cilium Components with BPF hook points and BPF maps shown in Linux Stack Orchestrator U CILIUM POD (Control Plane) S VM’s and Containers Apps E R CILIUM CLI CILIUM MONITOR PLUGIN S Cont Cont Cont App VM1 P CILIUM HEALTH NAMESPACE 1 3 2 A CILIUM AGENT DAEMON C CILIUM CILIUM HEALTH E OPERATOR CILIUM HOST_NET BPF Bpf_create_map K s SO_ATTACH_BPF AF-INET AF-RAW (sockmap, E sockopts Virtual B R TCP/UDP Layer N Net P E A BPF-Cilium Devices Bpf_lookup_elements L IP Layer F F BPF-Cont1 - S BPF-Cont2 P X BPF-Cont3 A D m C TC BPF P Queueing and Forwarding E a NETWORK STACK with BPF hook points p Device Driver Build sk_buff XDP s PHYSICAL LAYER ( NETWORK HARDWARE 20
Cilium as CNI Plugin K8s cluster K8s node K8s node K8s pod K8s pod K8s pod container A container B container C eth0 eth0 eth0 lxc0 lxc0 lxc1 Cilium Networking CNI eth0 eth0 21
Networking modes Encapsulation Direct routing Use case: Use case: Cilium handling routing between nodes Using cloud provider routers, using BGP routing daemon Node A Node A V X L A N VXLAN Node B Cloud or BGP VXLAN routing Node B Node C Node C 22
Pod IP Routing - Overlay Routing ( Tunneling mode) 23
Pod IP Routing - Direct Routing Mode 24
L3 filtering – label based, ingress Pod Pod Labels: role=frontend Labels: role=backend IP: 10.0.0.1 IP: 10.0.0.3 allow Pod Pod Labels: role=frontend Labels: role=frontend IP: 10.0.0.2 deny IP: 10.0.0.4 Pod IP: 10.0.0.5 25
L3 filtering – label based, ingress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow frontends to access backends" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - fromEndpoints: - matchLabels: class: frontend 26
L3 filtering – CIDR based, egress Cluster A IP: 10.0.1.1 allow Subnet: 10.0.1.0/24 Pod Labels: role=backend IP: 10.0.0.1 IP: 10.0.2.1 y n e d Subnet: 10.0.2.0/24 Any IP not belonging to 10.0.1.0/24 27
L3 filtering – CIDR based, egress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow backends to access 10.0.1.0/24" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend egress: - toCIDR: - IP: “10.0.1.0/24” 28
L4 filtering apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow to access backends only on TCP/80" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” 29
L4 filtering Pod Labels: role=backend IP: 10.0.0.1 allow TCP/80 deny Any other port 30
L7 filtering – API Aware Security Pod Labels: role=api IP: 10.0.0.1 GET /articles/{id} Pod IP: 10.0.0.5 GET /private 31
L7 filtering – API Aware Security apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "L7 policy to restict access to specific HTTP endpoints" metadata: name: "frontend-backend" endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” rules: http: - method: "GET" path: "/article/$" 32
Recommend
More recommend