← All posts
kubernetesupgradesplatformreliability

Kubernetes Upgrades Are a Discipline, Not a Project

Most teams I audit are two minor versions behind on k8s and treat each upgrade like a small migration. That is the wrong shape. Upgrades are a habit.

2 September 2024·4 min read

Roughly half the platforms I audit are running a Kubernetes minor version that left support six months ago. The reasons are always the same. Last upgrade was painful. Nobody wants to own the next one. And the cluster works, so leave it alone.

This is the wrong shape. Upgrades are not a project. They are a discipline. Run them as one and the pain goes away.

Why teams fall behind

The mechanics of falling behind are predictable:

  • The team upgrades from 1.24 to 1.25 in a "project" that takes six weeks.
  • The project ends. Everyone celebrates.
  • 1.26 ships. Nobody schedules the next project.
  • Six months later 1.27 is current and the team is on 1.25.
  • Someone notices the gap, sizes the work as "two upgrades, twelve weeks", and quietly de-prioritises it.
  • One year later you are on a version with known CVEs and your CNI driver no longer publishes images for it.

The trap is treating each upgrade as a discrete event. Kubernetes ships a minor every fifteen weeks roughly. If your upgrade cadence is slower than that, you accumulate debt by definition.

What a discipline looks like

Three rules. Boring. Effective.

1. Track n-1, not n

Do not chase the head of the train. The latest minor lands with rough edges, and the ecosystem (CNI, CSI, ingress, operators) takes a release or two to catch up. Track one minor behind the latest stable. That is your target. Always.

2. Upgrade every 12 weeks, on a calendar

Pick a date every quarter. Block it. The work happens whether or not anything has changed. If the upstream version did not move, you still run the drill: validate the upgrade in staging, run the conformance suite, exercise the runbook. Muscle memory is the goal.

3. Pre-flight is automation, not a checklist

The pre-flight should be a pipeline:

  • Lint manifests against the target version's removed APIs (kubent, or built-in API deprecation tooling).
  • Diff the cluster's current admission webhooks against the target's behaviour changes.
  • Run a synthetic load profile and capture latency p50/p95/p99 before and after.
  • Snapshot etcd. Verify restore. Yes, before every upgrade. Yes, even though nothing has changed since last quarter.

If any of those steps require a human running a command, you have built a project, not a discipline.

The upgrade itself

Two patterns work. Pick one and commit:

  • Surge upgrade in place. Drain nodes, replace with nodes on the new version, repeat. Cheap, simple, slow. Good for clusters with capacity headroom.
  • Blue-green clusters. Stand up a new cluster on the target version. Migrate workloads via traffic shift. Tear down the old cluster. Expensive, fast, safe. Good for clusters where in-place would be a single point of failure during the operation.

For most teams I work with the answer is in-place with surge, and a blue-green only when crossing major API boundaries.

What kills upgrades

Three classes of issue, in order of frequency:

  1. Custom controllers using removed APIs. Someone shipped an operator three years ago that uses v1beta1 of an API that no longer exists. The author left. Nobody knows what it does. Find these now, not on upgrade day.
  2. CNI / CSI version skew. Your CNI vendor lags upstream by one or two minors. You will hit the wall when your target version drops support for the CNI's pinned API. Track your data plane's compatibility matrix in the same calendar.
  3. In-tree volume plugins. If you still have any flexVolume or in-tree cloud provider stuff, you are on a clock. Migrate to CSI before the clock runs out, not when it does.

The cultural part

The hardest part of this is not technical. It is convincing leadership that "we upgrade every quarter and nothing breaks" is the same achievement as "we shipped a feature". It looks like nothing happened. That is the point.

The way I sell it to clients is to keep a simple metric: weeks behind the n-1 target. If the number is zero, the platform team is doing their job. If the number drifts above eight, escalate. It is the cleanest reliability KPI you can give a non-technical exec, and it correlates with everything else that matters.

Upgrade as a habit. The day you treat it like a project is the day you have already lost.