Share via

AKS Node Auto Provisioning - NAP

Subin 0 Reputation points
2026-04-02T05:23:43.03+00:00

Hello. We have an app deployed on AKS that is pre scaled on our peak business days. It would scale to hundreds of nodes on peak business days. There are multiple node pools for different services. Since NAP is GA, we decide to test it out. NAP is not an option in our prod since pre scaling is a process we follow to be on safe side. The objective of our test was find if NAP could be beneficial both cost and scaling wise on lower environments and other smaller apps. We did a normal test to test the scaling time and NAP was relatively scaling out and scaling in faster than Cluster Autoscaler. But we got significantly high number of errors in NAP compared to CA. If anyone is using NAP technique for scaling, can you explain the benefits, on what workloads are you using this, is there any benefits that could standout than CA etc.

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-04-02T05:24:07.35+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Node auto-provisioning (NAP) in AKS is designed to complement or replace cluster autoscaler (CA) in scenarios where more flexible, workload‑driven node provisioning is needed.

    Key benefits of NAP vs. Cluster Autoscaler

    1. More optimal VM selection
    • NAP uses pending pod resource requirements (CPU, memory, GPU, etc.) to pick the “best fit” VM SKU and quantity at runtime, instead of only scaling existing node pools.
    • Through NodePool requirements, NAP can:
      • Target specific SKU families (D, F, E, L, N) via karpenter.azure.com/sku-family.
      • Target exact SKUs via karpenter.azure.com/sku-name.
      • Control generations via karpenter.azure.com/sku-version.
      • Express constraints like CPU, memory, GPU, accelerated networking, premium storage, etc.
    • This can reduce over‑provisioning compared to CA, which typically scales a fixed set of node pool SKUs.
    1. Faster, more granular scaling
    • NAP provisions, scales, and manages VMs in response to pending pod pressure, similar to Karpenter behavior.
    • In practice this often results in:
      • Faster scale‑out when new workloads appear (because NAP can create new nodes tailored to the pods instead of only adding replicas of existing pools).
      • Faster scale‑in at the node level when workloads drain.
    • Your observation that NAP scaled faster than CA is consistent with its design.
    1. Flexible node pool policies per workload
    • NAP introduces Karpenter NodePool and AKSNodeClass CRDs:
      • NodePool defines provisioning policies and constraints for workloads (SKU selectors, zones, capacity type, etc.).
      • AKSNodeClass defines Azure‑specific settings (image, OS disk size, max pods per node, kubelet config, subnet, etc.).
    • Multiple NodePools can point to the same AKSNodeClass, so common Azure configuration can be shared while still tailoring scheduling and SKU rules per workload.
    • NAP works best when NodePools are mutually exclusive; if multiple pools match, NAP uses the one with the highest weight.
    1. Better cost control options
    • With NodePool requirements, NAP can:
      • Prefer Spot or On‑demand via karpenter.sh/capacity-type.
      • Constrain max CPU/memory per node and per pool, which helps cap spend.
    • Because NAP chooses SKUs based on pod requirements, it can reduce the number of oversized nodes that CA might keep around.
    1. Networking and topology flexibility
    • AKSNodeClass lets NAP place nodes into specific subnets via vnetSubnetID, while others use the cluster’s default subnet.
    • NodePool requirements can target specific availability zones via topology.kubernetes.io/zone.
    • This allows:
      • Different subnets for different workloads (e.g., isolation or different routing/security).
      • Multi‑AZ strategies per workload.

    Typical workloads where NAP stands out

    Based on the documented behavior, NAP tends to be most beneficial for:

    1. Highly heterogeneous workloads
    • Many different services with different CPU/memory/GPU/storage needs.
    • Workloads that would otherwise require many static node pools with carefully chosen SKUs.
    • NAP simplifies this by letting workloads drive node shape selection.
    1. Event‑driven or bursty workloads
    • When combined with KEDA (for pod scaling), NAP can:
      • Scale pods based on events (KEDA) and
      • Provision nodes dynamically for those pods (NAP).
    • This is useful for workloads with unpredictable spikes where pre‑scaling is either too expensive or too complex.
    1. GPU or specialized hardware workloads
    • NAP can target GPU SKUs and capabilities via selectors like karpenter.azure.com/sku-gpu-name, sku-gpu-count, and sku-gpu-manufacturer.
    • This is more flexible than maintaining dedicated GPU node pools sized for peak.
    1. Multi‑tenant or mixed priority environments
    • Different NodePools can:
      • Use different capacity types (Spot vs On‑demand).
      • Use different SKUs and zones.
      • Apply taints/tolerations to isolate workloads.
    • This allows cost‑optimized pools for lower‑priority workloads and more reliable pools for critical services.

    About the higher error rate you observed

    The provided documentation focuses on how to troubleshoot NAP itself rather than comparing error rates with CA. It highlights that:

    • NAP is a managed add‑on based on Karpenter.
    • Misconfiguration of infrastructure autoscaler components or NodePool/AKSNodeClass can cause issues.
    • There is a dedicated troubleshooting guide for NAP that should be followed when errors occur.

    If NAP shows more errors than CA in a given environment, the documented next step is to:

    • Use the NAP troubleshooting guidance to identify misconfigurations or unsupported scenarios.
    • Verify networking configuration (CNI mode, subnet, unsupported features like Calico network policy or dynamic IP allocation).
    • Ensure at least one valid NodePool is configured and that taints/tolerations and requirements align with pod specs.

    How NAP fits with pre‑scaling strategies

    For environments where pre‑scaling to hundreds of nodes on peak days is a strict requirement, CA plus manual/pre‑scheduled scaling may still be preferred for production until NAP behavior is fully validated.

    However, for lower environments and smaller apps, NAP can:

    • Reduce the need for manual pre‑scaling.
    • Optimize VM selection and cost.
    • Provide faster, more flexible scaling behavior.

    In such cases, NAP is most valuable when workloads are:

    • Variable or bursty.
    • Diverse in resource requirements.
    • Able to tolerate dynamic node shapes and placements defined via NodePool and AKSNodeClass.

    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.