An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Node auto-provisioning (NAP) in AKS is designed to complement or replace cluster autoscaler (CA) in scenarios where more flexible, workload‑driven node provisioning is needed.
Key benefits of NAP vs. Cluster Autoscaler
- More optimal VM selection
- NAP uses pending pod resource requirements (CPU, memory, GPU, etc.) to pick the “best fit” VM SKU and quantity at runtime, instead of only scaling existing node pools.
- Through
NodePoolrequirements, NAP can:- Target specific SKU families (D, F, E, L, N) via
karpenter.azure.com/sku-family. - Target exact SKUs via
karpenter.azure.com/sku-name. - Control generations via
karpenter.azure.com/sku-version. - Express constraints like CPU, memory, GPU, accelerated networking, premium storage, etc.
- Target specific SKU families (D, F, E, L, N) via
- This can reduce over‑provisioning compared to CA, which typically scales a fixed set of node pool SKUs.
- Faster, more granular scaling
- NAP provisions, scales, and manages VMs in response to pending pod pressure, similar to Karpenter behavior.
- In practice this often results in:
- Faster scale‑out when new workloads appear (because NAP can create new nodes tailored to the pods instead of only adding replicas of existing pools).
- Faster scale‑in at the node level when workloads drain.
- Your observation that NAP scaled faster than CA is consistent with its design.
- Flexible node pool policies per workload
- NAP introduces Karpenter
NodePoolandAKSNodeClassCRDs:-
NodePooldefines provisioning policies and constraints for workloads (SKU selectors, zones, capacity type, etc.). -
AKSNodeClassdefines Azure‑specific settings (image, OS disk size, max pods per node, kubelet config, subnet, etc.).
-
- Multiple
NodePoolscan point to the sameAKSNodeClass, so common Azure configuration can be shared while still tailoring scheduling and SKU rules per workload. - NAP works best when
NodePoolsare mutually exclusive; if multiple pools match, NAP uses the one with the highest weight.
- Better cost control options
- With
NodePoolrequirements, NAP can:- Prefer Spot or On‑demand via
karpenter.sh/capacity-type. - Constrain max CPU/memory per node and per pool, which helps cap spend.
- Prefer Spot or On‑demand via
- Because NAP chooses SKUs based on pod requirements, it can reduce the number of oversized nodes that CA might keep around.
- Networking and topology flexibility
-
AKSNodeClasslets NAP place nodes into specific subnets viavnetSubnetID, while others use the cluster’s default subnet. -
NodePoolrequirements can target specific availability zones viatopology.kubernetes.io/zone. - This allows:
- Different subnets for different workloads (e.g., isolation or different routing/security).
- Multi‑AZ strategies per workload.
Typical workloads where NAP stands out
Based on the documented behavior, NAP tends to be most beneficial for:
- Highly heterogeneous workloads
- Many different services with different CPU/memory/GPU/storage needs.
- Workloads that would otherwise require many static node pools with carefully chosen SKUs.
- NAP simplifies this by letting workloads drive node shape selection.
- Event‑driven or bursty workloads
- When combined with KEDA (for pod scaling), NAP can:
- Scale pods based on events (KEDA) and
- Provision nodes dynamically for those pods (NAP).
- This is useful for workloads with unpredictable spikes where pre‑scaling is either too expensive or too complex.
- GPU or specialized hardware workloads
- NAP can target GPU SKUs and capabilities via selectors like
karpenter.azure.com/sku-gpu-name,sku-gpu-count, andsku-gpu-manufacturer. - This is more flexible than maintaining dedicated GPU node pools sized for peak.
- Multi‑tenant or mixed priority environments
- Different
NodePoolscan:- Use different capacity types (Spot vs On‑demand).
- Use different SKUs and zones.
- Apply taints/tolerations to isolate workloads.
- This allows cost‑optimized pools for lower‑priority workloads and more reliable pools for critical services.
About the higher error rate you observed
The provided documentation focuses on how to troubleshoot NAP itself rather than comparing error rates with CA. It highlights that:
- NAP is a managed add‑on based on Karpenter.
- Misconfiguration of infrastructure autoscaler components or
NodePool/AKSNodeClasscan cause issues. - There is a dedicated troubleshooting guide for NAP that should be followed when errors occur.
If NAP shows more errors than CA in a given environment, the documented next step is to:
- Use the NAP troubleshooting guidance to identify misconfigurations or unsupported scenarios.
- Verify networking configuration (CNI mode, subnet, unsupported features like Calico network policy or dynamic IP allocation).
- Ensure at least one valid
NodePoolis configured and that taints/tolerations and requirements align with pod specs.
How NAP fits with pre‑scaling strategies
For environments where pre‑scaling to hundreds of nodes on peak days is a strict requirement, CA plus manual/pre‑scheduled scaling may still be preferred for production until NAP behavior is fully validated.
However, for lower environments and smaller apps, NAP can:
- Reduce the need for manual pre‑scaling.
- Optimize VM selection and cost.
- Provide faster, more flexible scaling behavior.
In such cases, NAP is most valuable when workloads are:
- Variable or bursty.
- Diverse in resource requirements.
- Able to tolerate dynamic node shapes and placements defined via
NodePoolandAKSNodeClass.
References:
- Overview of node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
- Configure node pools for node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
- Configure AKSNodeClass resources for node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
- Overview of networking configurations for node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
- Scaling options for applications in Azure Kubernetes Service (AKS)
- Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)