20,000 Upgrades Later Lessons From a Year of Managed Kubernetes Upgrades Adam Wolfe Gordon DigitalOcean π¦ do.co/doks 1 @maybeawg
This Talk Started One(ish) Year Ago... Me, in Barcelona DO, in Barcelona π¦ do.co/doks 2 @maybeawg
Generally Available? UPGRADES! π¦ do.co/doks 3 @maybeawg
20,000 Upgrades Later Lessons From a Year of Managed Kubernetes Upgrades Adam Wolfe Gordon DigitalOcean π¦ do.co/doks 5 @maybeawg
Disclaimers! β Lessons from our upgrade process. You might upgrade differently! β β Upgrades of our customersβ clusters. β Your workloads might be different! π¦ do.co/doks 6 @maybeawg
How to Upgrade Kubernetes 1. Upgrade the control plane. 2. Upgrade the worker nodes. ??? 3. 4. Profit! π¦ do.co/doks 7 @maybeawg
How to Upgrade Kubernetes 1. Upgrade the control plane. a. Update any resources that arenβt supported in the target version. b. Upgrade etcd (if needed). c. Upgrade kube-apiserver. d. Upgrade kube-controller-manager. e. Upgrade kube-scheduler. f. Upgrade your CNI plugin (if needed). g. Upgrade provider-specific components (e.g. cloud-controller-manager, CSI controller). h. Upgrade kubelet and kubectl. 2. Upgrade the worker nodes. a. Cordon and drain a worker node. b. Update kubelet configuration (if needed). c. Upgrade the kubelet. d. Uncordon the node. e. Repeat for each node in the cluster. π¦ do.co/doks 8 @maybeawg
Shortcut: Upgrade via Node Replacement 1. Upgrade the control plane. a. Update any resources that arenβt supported in the target version. b. Upgrade etcd (if needed). b. Destroy the original control plane node. c. Upgrade kube-apiserver. c. Provision a new control plane node. d. Upgrade kube-controller-manager. e. Upgrade kube-scheduler. f. Upgrade your CNI plugin (if needed). g. Upgrade provider-specific components (e.g. cloud-controller-manager, CSI controller). h. Upgrade kubelet and kubectl. 2. Upgrade the worker nodes. a. Cordon and drain a worker node. b. Update kubelet configuration (if needed). b. Destroy the node. c. Upgrade the kubelet. c. Provision a new node. d. Uncordon the node. e. Repeat for each node in the cluster. π¦ do.co/doks 9 @maybeawg
Advantages of Node Replacement β Clean slate - no chance for configuration drift. β Fewer steps to manage - good for automation. Same process works for all release types. β β (Mostly) π¦ do.co/doks 10 @maybeawg
Things We Got Right Upgrades via Node Replacement π¦ do.co/doks 11 @maybeawg
Problems Ch-ch-changes β Custom node configuration is reset. β Node names change. Node IPs change. β β Node labels and taints lost. π¦ do.co/doks 12 @maybeawg
Lessons for Operators Managing Change β Re-use node names and IPs if possible. β Retain labels or provide a good alternative. Retain taints or provide a good alternative. β β Provide simple ingress/load balancing. π¦ do.co/doks 13 @maybeawg
Lessons for Developers Tolerating Change β Use Kubernetes to do node customization. DaemonSets β β Init containers β Donβt use node names for scheduling. Use provider-supported label/taint settings. β β Use provider-supported load balancing. π¦ do.co/doks 14 @maybeawg
Things We Got Wrong Break Before Make π¦ do.co/doks 15 @maybeawg
Problems Drain to Nowhere β Insufficient capacity to drain nodes. β Downtime in single-node clusters. Extra churn for workloads. β β Might be drained to a node thatβs about to be deleted. π¦ do.co/doks 16 @maybeawg
Lessons for Operators Drainage Capacity β Add nodes before deleting nodes if possible. β Consider reserving capacity. π¦ do.co/doks 17 @maybeawg
Lessons for Developers Expect to be Drained β Leave capacity for a node to be drained. π¦ do.co/doks 18 @maybeawg
Things We Got Wrong Replacing Nodes One by One π¦ do.co/doks 19 @maybeawg
Problems Ants Go Marching β Replacing nodes one-by-one is slow. β Workloads can get stuck draining β Making replacement even slower. β Upgrades need to be expedient. π¦ do.co/doks 20 @maybeawg
Lessons for Operators Rapid Replacement β Drain and replace multiple nodes at once. This usually requires make-before-break. β β Set reasonable drain timeouts. π¦ do.co/doks 21 @maybeawg
Lessons for Developers Unclog Your Drains β Make sure your workloads can be evicted. Safely: Use PodDisruptionBudgets. β β Quickly: Respond to signals. β Test this! π¦ do.co/doks 22 @maybeawg
Things We Got Wrong (but felt so right) Minor Version Upgrades are Easy π¦ do.co/doks 23 @maybeawg
Lessons for Operators Donβt Worry, Be Happy β Minor version upgrades arenβt that scary. β Try to use the same process for all upgrades. π¦ do.co/doks 24 @maybeawg
Things We Got Right Disabling Alpha Features π¦ do.co/doks 25 @maybeawg
Lessons for Operators Wait for Beta β Alpha features are disabled by default. β Alpha features are likely to change/break. Beta features are less likely to change. β β Consider the upgrade tradeoff. π¦ do.co/doks 26 @maybeawg
Lessons for Developers Alpha as a Last Resort β Avoid using alpha features if possible. β Read release notes before upgrading. π¦ do.co/doks 27 @maybeawg
Common Problems Container Storage Interface (CSI) π¦ do.co/doks 28 @maybeawg
CSI Problems Beta β CSI was promoted to beta in Kubernetes 1.10. β Supporting components were relatively new. CSI drivers were relatively new. β β Out-of-sync state. β Far fewer problems in recent releases. π¦ do.co/doks 29 @maybeawg
CSI Problems Driver Names β In early CSI specs, com.example.csi. β In later CSI specs, csi.example.com. The name is immutable in Kubernetes! β β Solution: detect and persist old naming. π¦ do.co/doks 30 @maybeawg
Lessons Beware the CSI β If youβre using CSI, carefully test upgrades. β Watch for workloads that get stuck. Use Kubernetes 1.14+ if possible. β π¦ do.co/doks 31 @maybeawg
Common Problems Admission Control Webhooks π¦ do.co/doks 32 @maybeawg
Admission Control Webhooks Overview π¦ do.co/doks 33 @maybeawg
Admission Control Webhooks Overview π¦ do.co/doks 34 @maybeawg
Admission Control Webhooks Overview apiVersion: admissionregistration.k8s.io/v1 apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration kind: ValidatingWebhookConfiguration metadata: metadata: name: webhook.example.com name: webhook.example.com webhooks: webhooks: - name: webhook.example.com - name: webhook.example.com rules: rules: - apiGroups: [""] - apiGroups: [""] apiVersions: ["v1"] apiVersions: ["v1"] operations: ["CREATE"] operations: ["CREATE"] resources: ["pods"] resources: ["pods"] scope: "Namespaced" scope: "Namespaced" clientConfig: clientConfig: service: service: namespace: "webhook-namespace" namespace: "webhook-namespace" name: "webhook-service" name: "webhook-service" admissionReviewVersions: ["v1", "v1beta1"] admissionReviewVersions: ["v1", "v1beta1"] sideEffects: None sideEffects: None timeoutSeconds: 30 timeoutSeconds: 30 failurePolicy: Fail failurePolicy: Ignore π¦ do.co/doks 35 @maybeawg
Admission Control Webhooks Trouble for Upgrades β Upgrades update system components. β Some of these components run as workloads. β Usually in the kube-system namespace. β Webhooks can prevent these updates. β Webhooks can also affect their own services! π¦ do.co/doks 36 @maybeawg
Admission Control Webhooks: Problems apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: webhook.example.com webhooks: - name: webhook.example.com rules: - apiGroups: [""] apiVersions: ["v1"] operations: ["CREATE"] resources: ["pods"] webhook-service kube-proxy cilium scope: "Namespaced" clientConfig: service: namespace: "webhook-namespace" name: "webhook-service" admissionReviewVersions: ["v1", "v1beta1"] sideEffects: None timeoutSeconds: 30 failurePolicy: Fail π¦ do.co/doks 37 @maybeawg
Admission Control Webhooks: Solutions apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: webhook.example.com webhooks: - name: webhook.example.com rules: - apiGroups: [""] apiVersions: ["v1"] operations: ["CREATE"] resources: ["pods"] scope: "Namespaced" clientConfig: service: namespace: "webhook-namespace" name: "webhook-service" admissionReviewVersions: ["v1", "v1beta1"] sideEffects: None timeoutSeconds: 30 failurePolicy: Ignore π¦ do.co/doks @maybeawg
Recommend
More recommend