Kubernetes at Scale: Lessons Learned
BlogCloud
Cloud

Kubernetes at Scale: Lessons Learned

22 min read
Back to Blog

Kubernetes is the dominant container orchestration platform, but running it at scale — thousands of nodes, hundreds of clusters, millions of pod starts per day — is a fundamentally different challenge than running a development cluster. The operational complexity does not scale linearly. It scales exponentially, and the organizations that thrive at Kubernetes scale are the ones that built platform abstractions to hide that complexity from application teams.

01

Cluster Strategy: One vs. Many

The first architectural decision at scale is cluster topology. Should you run one large cluster or many smaller clusters? The answer for most large organizations is: many smaller clusters, organized by environment (production, staging, dev), region, and sometimes by business unit or compliance boundary. Large monolithic clusters create blast radius risks — a misconfigured admission controller or an apiserver overload event can impact all workloads simultaneously.

Fleet management platforms like Rancher, ArgoCD Hub, or the Cluster API project help manage dozens or hundreds of clusters as a coherent fleet. GitOps principles applied at the fleet level — using tools like Flux or ArgoCD — ensure cluster configuration is version controlled and auditable.

02

Observability at Kubernetes Scale

Default Kubernetes observability does not scale past a few dozen nodes. Running Prometheus with default configuration against a 500-node cluster will produce a Prometheus instance that consumes 64GB of RAM and cannot keep up with scrape volume. The solution is a tiered observability architecture: per-cluster lightweight collectors (VictoriaMetrics agent, Prometheus with aggressive scrape interval tuning), federation to a central long-term storage layer (Thanos, Cortex, or managed services like Grafana Cloud), and a centralized Grafana instance with multi-cluster dashboards.

Distributed tracing at Kubernetes scale requires a sampling strategy. 100% trace sampling is cost-prohibitive past a few hundred RPS. Tail-based sampling (collecting traces for requests that errored or exceeded latency thresholds) provides dramatically better signal-to-noise ratio at a fraction of the storage cost.

03

Cost Management Is an Engineering Discipline

Cloud Kubernetes costs spiral without deliberate engineering controls. Workloads that do not set resource requests and limits prevent the scheduler from making optimal placement decisions, causing over-provisioned nodes and underutilized capacity. At scale, this translates to millions of dollars in wasted spend annually.

Implement a cost visibility layer (Kubecost, OpenCost) that allocates spend to teams and services. Enforce resource quotas at the namespace level. Implement cluster autoscaler with intelligent node group configuration. Use spot/preemptible instances for stateless, fault-tolerant workloads. At large scale, rightsize recommendations from tools like VPA (Vertical Pod Autoscaler) can generate 30-40% cost reduction with minimal engineering effort.

04

Platform Teams Are the Force Multiplier

The secret to successful Kubernetes at scale is not Kubernetes expertise — it is platform product thinking. The platform team's job is to make it so easy for application teams to deploy, observe, and operate their services on Kubernetes that they never need to think about the underlying infrastructure.

This means Golden Path templates, Internal Developer Portals (Backstage is the dominant open-source choice), self-service cluster onboarding, automated security baseline enforcement (Pod Security Standards, OPA Gatekeeper), and comprehensive runbooks for common operational scenarios. The best platform teams measure themselves by developer experience metrics: time to deploy, mean time to confidence in a deployment, and percentage of incidents that require platform team escalation.

Key Takeaway

"Running Kubernetes at scale is a platform engineering challenge, not just a DevOps challenge. It requires product thinking, a dedicated platform team with a clear charter, and a commitment to developer experience as a first-class metric. The clusters are the implementation detail. The platform is the product."

Topics

KubernetesDevOpsPlatform EngineeringMulti-regionSRE