Karpenter NodePools for Stateful Workload Scheduling
Context and Problem Statement
Traditional VMSS-based AKS node pools pin a single VM SKU (e.g., Standard_D4as_v5) across all specified availability zones. When any zone lacks capacity for that exact SKU, az aks create fails with OverconstrainedZonalAllocationRequest. This is especially problematic for stateful workloads (Elasticsearch, PostgreSQL, RabbitMQ) that need premium storage-capable VMs and cross-zone spread.
Decision Drivers
- Eliminate
OverconstrainedZonalAllocationRequestfailures on cluster creation and scaling - Maintain cross-zone topology spread for HA stateful workloads
- Support premium storage (Premium_LRS) for Elasticsearch and PostgreSQL PVCs
- Keep the same workload targeting mechanism (
agentpool: statefullabel,workload=stateful:NoScheduletaint)
Considered Options
- Karpenter/NAP with dynamic VM SKU selection
- Multiple VMSS node pools (one per zone)
- Single-zone deployment
Decision Outcome
Chosen option: "Karpenter/NAP with dynamic VM SKU selection", because it dynamically selects from multiple D-series VM SKUs (4-8 vCPU, premium storage-capable) per zone, eliminating capacity failures while maintaining the same scheduling labels and taints.
Consequences
- Good, because eliminates
OverconstrainedZonalAllocationRequest. Karpenter selects any available D-series SKU per zone - Good, because automatic scale-to-zero when no stateful pods are pending (cost savings)
- Good, because consolidation policy (
WhenEmpty, 5 min) removes idle nodes automatically - Good, because workloads use the same
agentpool: statefullabel andworkload=stateful:NoScheduletoleration. No migration needed - Bad, because Karpenter NodePool/AKSNodeClass CRDs are deployed in the platform layer, creating a dependency for all stateful workloads
- Bad, because VM SKU selection is less predictable. Exact SKU varies by zone capacity at scheduling time