Skip to main content

On This Page

How Salesforce Migrated from Cluster Autoscaler to Karpenter Across Their Fleet of 1,000 EKS Clusters

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Salesforce operates one of the world’s most complex Kubernetes platforms, managing over 1,000 Amazon EKS clusters. Facing challenges with scalability and efficiency of their previous auto scaling approach, Salesforce migrated to Karpenter, an open-source Kubernetes auto scaler built by AWS. This migration reduced scaling latency from minutes to seconds and improved node utilization.

Why This Matters

Traditional Kubernetes cluster scaling often relies on manual configuration of node groups and auto scaling, which becomes unsustainable at scale. Inefficient bin-packing and slow response to demand spikes can lead to wasted resources and degraded performance. Salesforce’s previous system suffered from these inefficiencies, creating operational bottlenecks and hindering innovation, with the potential for significant cost overruns.

Key Insights

  • 1,000+ EKS clusters: Salesforce manages over 1,000 Amazon EKS clusters.
  • Karpenter transition tool: Salesforce developed an in-house tool for safe and consistent migration to Karpenter.
  • 5% cost savings: Salesforce achieved 5% cost savings in FY2026 through improved bin-packing and reduced idle capacity.

Working Example

metadata:
name: m5.8xlarge-min-300-max-2500
data:
k8s_instance_type: m6i.8xlarge
k8s_root_volume_size: '100'
k8s_root_volume_iops: '3000'
k8s_root_volume_type: 'gp3'
k8s_root_volume_throughput: '125'
k8s_min_node_number: '300'
k8s_max_node_number: '2500'
multi_az_provisioned_workers: 'false'
asg_launch_type: 'launch_template'
gpu_enabled: 'false'

Practical Applications

  • Use Case: Salesforce enabled developers to self-define node pool requirements, accelerating infrastructure provisioning.
  • Pitfall: Overly restrictive Pod Disruption Budgets (PDBs) can block node replacements during migration; proper PDB configuration is essential.

References:

Continue reading

Next article

How This Agentic Memory Research Unifies Long Term and Short Term Memory for LLM Agents

Related Content