Kubernetes at Scale: What We Learned Running 50,000 Pods

Kubernetes is lovely at a thousand pods. It is reasonable at ten thousand. At fifty thousand pods across a few hundred nodes, it stops being a friendly abstraction and starts being a system you must understand in detail. My team at Amazon got there in 2022. Here is what we learned.

The control plane is the bottleneck

At small scale your pods are the thing you're worried about. At large scale, the control plane is. Specifically:

etcd becomes the single most load-bearing component in the cluster. Every object in Kubernetes is a key in etcd. Every change is a watch event. By fifty thousand pods, etcd was the pacemaker for everything.
kube-apiserver starts spending real CPU on serialisation. List-and-watch calls from controllers and operators multiply the load faster than the object count does.
kube-scheduler latency climbs. A pod that took 100 ms to schedule in a small cluster can take seconds when there are a thousand candidate nodes.

The control-plane problem is not theoretical. It manifests as a cluster that is, to outside observers, "slow" — pods taking ages to start, deployments that appear stuck, events arriving late. You will chase the wrong things for weeks before realising the fault is not in any application.

The specific remediations that helped

1. Shard by cluster, not by namespace

We tried to scale a single cluster to all our workloads. It was the wrong choice. At a certain size, the right answer is multiple clusters, each at a manageable scale, with a federation layer above them. We landed on clusters of ~20k pods and ~500 nodes as the comfortable ceiling. Smaller than we'd originally hoped; much easier to operate.

2. Tune etcd mercilessly

Dedicated hosts with local NVMe. No co-location with anything else. Frequent compaction. Aggressive monitoring of the quota. The default etcd configuration is fine at small scale and catastrophic at large. We eventually had runbooks specifically for "etcd is slow" that did not involve any other component.

3. Cut the object cardinality

Every ConfigMap, every Secret, every annotation is a row in etcd. We had workloads creating thousands of per-pod ConfigMaps. Normalising them to shared ConfigMaps reduced the etcd footprint by a third overnight. The lesson: treat etcd like a database, not like a notebook.

4. Cap watch churn from operators

An operator that does list-and-watch across all pods in the cluster is a weapon against the API server. We audited every operator we had installed and required that each one either scope its watches to a label selector or explain why it couldn't. Several operators we'd adopted turned out to have cluster-wide watches for no good reason.

5. Use priority classes, not good intentions

When scheduling pressure hits, you want the scheduler to make deterministic choices about what to evict. Priority classes give it the information; annotations and "we trust the team to behave" do not. Every workload we ran had a declared priority, documented in the catalogue, visible on its dashboards.

The things that are quietly expensive

Three surprises, in rough order of how much they cost us:

Kubelet image pulls. At scale, a 200 MB container image downloaded by a few thousand nodes simultaneously will saturate the image registry. Pre-warm nodes, use layered images aggressively, and run a local registry mirror per availability zone. In-cluster caching saved us more than any scheduler tweak.
Probes. Readiness and liveness probes run per-pod. At 50k pods with a 10-second probe interval, that's 5,000 probes per second flowing through kubelet and hitting your services. Misconfigured probes can easily add 10% of their own load to the cluster.
Finalisers and webhooks. Admission webhooks and deletion finalisers sit on the hot path. A slow webhook will slow down every object change in the cluster. We held every webhook to a 50 ms p99 budget.

The awkward question: do you still want Kubernetes?

At our scale, yes, but only because we had a platform team of eight people dedicated to running it. If we'd had two people it would have been a disaster.

The Kubernetes ecosystem is enormous and the abstractions are good. But operating it at scale is a specialist skill, and there are workloads — long-running stateful services, simple REST APIs, cron jobs — where a well-run EC2 auto-scaling group or a serverless runtime is cheaper, more reliable, and easier to reason about. The question is never "Kubernetes or not". The question is always "is the complexity of running Kubernetes worth what Kubernetes gives you, for this workload, today?" For a lot of teams the answer is no.

If you're about to go from 10k to 50k pods

A short list of things to do before, not during:

Commit to multi-cluster early. Don't try to scale a single cluster past about 20k pods.
Measure your etcd object count and rate of change weekly. It's your early-warning system.
Build cluster-specific dashboards for the control plane components. Applications are not the story any more.
Establish a webhook performance policy before you have a dozen of them.
Write the "etcd is on fire" runbook before etcd is on fire.

— Nivaan