AKS Node Auto Provisioning - NAP

Question

AKS Node Auto Provisioning - NAP

Subin 0

Hello. We have an app deployed on AKS that is pre scaled on our peak business days. It would scale to hundreds of nodes on peak business days. There are multiple node pools for different services. Since NAP is GA, we decide to test it out. NAP is not an option in our prod since pre scaling is a process we follow to be on safe side. The objective of our test was find if NAP could be beneficial both cost and scaling wise on lower environments and other smaller apps. We did a normal test to test the scaling time and NAP was relatively scaling out and scaling in faster than Cluster Autoscaler. But we got significantly high number of errors in NAP compared to CA. If anyone is using NAP technique for scaling, can you explain the benefits, on what workloads are you using this, is there any benefits that could standout than CA etc.

Manish Deshpande 5,255 Reputation points Microsoft External Staff Moderator

2026-04-08T08:26:39.82+00:00
Hi Subin, great testing! Let’s turn those NAP results into production-ready wins. ✅

You nailed the key differences: NAP scaled noticeably faster than Cluster Autoscaler (exactly as designed), but you’re seeing a higher error count. This is super common in early tests and almost always fixable with a few targeted tweaks to your NodePool/AKSNodeClass setup.

Quick recap of NAP advantages:

Picks the perfect VM SKU for every pending pod → less waste, lower cost.

Scales out/in faster and more granularly than CA.

Handles diverse, bursty, or GPU-heavy workloads beautifully without forcing you to maintain tons of static node pools.

Gives you real workload-driven policies and better capacity resilience.

It shines on dynamic/heterogeneous apps, event-driven workloads, and anything where pre-defining every VM size feels painful. For rock-solid “pre-scale to hundreds of nodes on peak days” safety in prod, many customers keep CA for now while running NAP in lower environments.

Official deep-dive (highly recommended read):
https://dotnet.territoriali.olinfo.it/en-us/azure/aks/node-auto-provisioning

Action plan to slash those errors:

Double-check your NodePool + AKSNodeClass CRDs (mutually exclusive rules + fallback priorities).

Follow the official troubleshooting guide step-by-step:
https://dotnet.territoriali.olinfo.it/en-us/troubleshoot/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision

If errors continue after that, just drop the exact error messages (or sanitized YAML) here in the comments — we will jump on it immediately and get you fully optimized. We’ve got your back and are here to make NAP production-ready for your peak-day scaling needs!

Thanks,
Manish.

Subin 0

Thanks for the reply @Manish Deshpande . I made two clusters one using CA and other using NAP for scaling. Also deployed a sample application with some endpoints, and hit it continuously using a tool. NAP cluster was throwing very large number of errors as compared to CA, more like around 20 times. I will paste the error, the nodepool templates and aksnodeclass yaml files below:

Error:

Error detected by NeoLoad
Error Code: NL-NETWORK-01
Message:       Network error: An IO error occurred sending the request.
Details:
      Please refer to the documentation for further details and advice

general-purpose.yaml:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  disruption:
    budgets:
    - nodes: 30%
    consolidateAfter: 30s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "200"
    memory: 200Gi
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - Standard_D2s_v5
        - Standard_D2as_v5
        - Standard_D4s_v5
        - Standard_D4as_v5
        - Standard_D8s_v5
        - Standard_D8as_v5
        - Standard_D2ds_v5
        - Standard_D4ds_v5
  weight: 100

aksnodeclass.yaml:

apiVersion: karpenter.azure.com/v1beta1
kind: AKSNodeClass
metadata:
  name: general
spec:
  imageFamily: Ubuntu
  osDiskSizeGB: 30
  tags:
    Application: web-service
    CostCenter: engineering
    Environment: production
  images:
  - id: /subscriptions/<id>/resourceGroups/AKS-Ubuntu/providers/Microsoft.Compute/galleries/AKSUbuntu/images/2204gen2containerd/versions/202603.12.1
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.azure.com/sku-hyperv-generation
      operator: In
      values:
      - "2"
  kubernetesVersion: 1.33.7

1 answer

Your answer

Manish Deshpande 5,255 Reputation points Microsoft External Staff Moderator

2026-04-08T08:26:39.82+00:00

Hi Subin, great testing! Let’s turn those NAP results into production-ready wins. ✅

You nailed the key differences: NAP scaled noticeably faster than Cluster Autoscaler (exactly as designed), but you’re seeing a higher error count. This is super common in early tests and almost always fixable with a few targeted tweaks to your NodePool/AKSNodeClass setup.

Quick recap of NAP advantages:

Picks the perfect VM SKU for every pending pod → less waste, lower cost.

Scales out/in faster and more granularly than CA.

Handles diverse, bursty, or GPU-heavy workloads beautifully without forcing you to maintain tons of static node pools.

Gives you real workload-driven policies and better capacity resilience.

It shines on dynamic/heterogeneous apps, event-driven workloads, and anything where pre-defining every VM size feels painful. For rock-solid “pre-scale to hundreds of nodes on peak days” safety in prod, many customers keep CA for now while running NAP in lower environments.

Official deep-dive (highly recommended read):
https://dotnet.territoriali.olinfo.it/en-us/azure/aks/node-auto-provisioning

Action plan to slash those errors:

Double-check your NodePool + AKSNodeClass CRDs (mutually exclusive rules + fallback priorities).

Follow the official troubleshooting guide step-by-step:
https://dotnet.territoriali.olinfo.it/en-us/troubleshoot/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision

If errors continue after that, just drop the exact error messages (or sanitized YAML) here in the comments — we will jump on it immediately and get you fully optimized. We’ve got your back and are here to make NAP production-ready for your peak-day scaling needs!

Thanks,
Manish.

Answer 1

Node auto-provisioning (NAP) in AKS is designed to complement or replace cluster autoscaler (CA) in scenarios where more flexible, workload‑driven node provisioning is needed.

Key benefits of NAP vs. Cluster Autoscaler

More optimal VM selection

NAP uses pending pod resource requirements (CPU, memory, GPU, etc.) to pick the “best fit” VM SKU and quantity at runtime, instead of only scaling existing node pools.
Through NodePool requirements, NAP can:
- Target specific SKU families (D, F, E, L, N) via karpenter.azure.com/sku-family.
- Target exact SKUs via karpenter.azure.com/sku-name.
- Control generations via karpenter.azure.com/sku-version.
- Express constraints like CPU, memory, GPU, accelerated networking, premium storage, etc.
This can reduce over‑provisioning compared to CA, which typically scales a fixed set of node pool SKUs.

Faster, more granular scaling

NAP provisions, scales, and manages VMs in response to pending pod pressure, similar to Karpenter behavior.
In practice this often results in:
- Faster scale‑out when new workloads appear (because NAP can create new nodes tailored to the pods instead of only adding replicas of existing pools).
- Faster scale‑in at the node level when workloads drain.
Your observation that NAP scaled faster than CA is consistent with its design.

Flexible node pool policies per workload

NAP introduces Karpenter NodePool and AKSNodeClass CRDs:
- NodePool defines provisioning policies and constraints for workloads (SKU selectors, zones, capacity type, etc.).
- AKSNodeClass defines Azure‑specific settings (image, OS disk size, max pods per node, kubelet config, subnet, etc.).
Multiple NodePools can point to the same AKSNodeClass, so common Azure configuration can be shared while still tailoring scheduling and SKU rules per workload.
NAP works best when NodePools are mutually exclusive; if multiple pools match, NAP uses the one with the highest weight.

Better cost control options

With NodePool requirements, NAP can:
- Prefer Spot or On‑demand via karpenter.sh/capacity-type.
- Constrain max CPU/memory per node and per pool, which helps cap spend.
Because NAP chooses SKUs based on pod requirements, it can reduce the number of oversized nodes that CA might keep around.

Networking and topology flexibility

AKSNodeClass lets NAP place nodes into specific subnets via vnetSubnetID, while others use the cluster’s default subnet.
NodePool requirements can target specific availability zones via topology.kubernetes.io/zone.
This allows:
- Different subnets for different workloads (e.g., isolation or different routing/security).
- Multi‑AZ strategies per workload.

Typical workloads where NAP stands out

Based on the documented behavior, NAP tends to be most beneficial for:

Highly heterogeneous workloads

Many different services with different CPU/memory/GPU/storage needs.
Workloads that would otherwise require many static node pools with carefully chosen SKUs.
NAP simplifies this by letting workloads drive node shape selection.

Event‑driven or bursty workloads

When combined with KEDA (for pod scaling), NAP can:
- Scale pods based on events (KEDA) and
- Provision nodes dynamically for those pods (NAP).
This is useful for workloads with unpredictable spikes where pre‑scaling is either too expensive or too complex.

GPU or specialized hardware workloads

NAP can target GPU SKUs and capabilities via selectors like karpenter.azure.com/sku-gpu-name, sku-gpu-count, and sku-gpu-manufacturer.
This is more flexible than maintaining dedicated GPU node pools sized for peak.

Multi‑tenant or mixed priority environments

Different NodePools can:
- Use different capacity types (Spot vs On‑demand).
- Use different SKUs and zones.
- Apply taints/tolerations to isolate workloads.
This allows cost‑optimized pools for lower‑priority workloads and more reliable pools for critical services.

About the higher error rate you observed

The provided documentation focuses on how to troubleshoot NAP itself rather than comparing error rates with CA. It highlights that:

NAP is a managed add‑on based on Karpenter.
Misconfiguration of infrastructure autoscaler components or NodePool/AKSNodeClass can cause issues.
There is a dedicated troubleshooting guide for NAP that should be followed when errors occur.

If NAP shows more errors than CA in a given environment, the documented next step is to:

Use the NAP troubleshooting guidance to identify misconfigurations or unsupported scenarios.
Verify networking configuration (CNI mode, subnet, unsupported features like Calico network policy or dynamic IP allocation).
Ensure at least one valid NodePool is configured and that taints/tolerations and requirements align with pod specs.

How NAP fits with pre‑scaling strategies

For environments where pre‑scaling to hundreds of nodes on peak days is a strict requirement, CA plus manual/pre‑scheduled scaling may still be preferred for production until NAP behavior is fully validated.

However, for lower environments and smaller apps, NAP can:

Reduce the need for manual pre‑scaling.
Optimize VM selection and cost.
Provide faster, more flexible scaling behavior.

In such cases, NAP is most valuable when workloads are:

Variable or bursty.
Diverse in resource requirements.
Able to tolerate dynamic node shapes and placements defined via NodePool and AKSNodeClass.

References:

Share via

AKS Node Auto Provisioning - NAP

1 answer

Your answer