Share via

In-Place Migration Failed for an AKS Cluster

Aniket Pandey (c) 0 Reputation points
2026-03-27T04:12:54.5066667+00:00

I am performing a migration of an existing AKS cluster from the kubenet network plugin to Azure CNI overlay, as the cluster uses Calico for network policy. Based on Microsoft guidance, the migration involves the following steps:

  1. Remove Calico
  2. Update network plugin to Azure CNI (overlay mode)
  3. Reinstall Calico

My cluster has 4 nodes.

To optimize the process, I have combined steps 1 and 2 into a single ARM template deployment, and performed step 3 in a second deployment:

Deployment 1 (successful):

Network policy: none

Network plugin: azure

Network plugin mode: overlay

Deployment 2 (failed):

Network policy: calico

Network plugin: azure

Network plugin mode: overlay

The second ARM deployment failed during node drain with the following error:

Drain node failed when evicting pod coredns-* due to "Too Many Requests". Cannot evict pod as it would violate the pod's disruption budget. PDB debug info: coredns-pdb (MinAvailable: 1, CurrentHealthy: 1, DesiredHealthy: 1, ExpectedPods: 2) One pod is in Pending state.

Current PDB configuration:

NAMESPACE       NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS
calico-system   calico-typha         N/A             1                 1
kube-system     coredns-pdb          1               N/A               0
kube-system     konnectivity-agent   1               N/A               0
kube-system     metrics-server-pdb   1               N/A               0

However, initial state of my cluster before starting the migration process was

NAMESPACE       NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
calico-system   calico-typha         N/A             1                 1                     14m
kube-system     coredns-pdb          1               N/A               1                     17m
kube-system     konnectivity-agent   1               N/A               1                     17m
kube-system     metrics-server-pdb   1               N/A               1                     17m

Questions:

These PDBs were not explicitly configured by us — are they managed by AKS by default? If so, what could cause the change in ALLOWED DISRUPTIONS from 1 to 0?

Is the failure caused by the Pending CoreDNS pod blocking eviction due to the PDB constraint?

What is the recommended approach to handle this scenario during migration (e.g., temporarily modifying/removing PDBs, scaling replicas, or other best practices)?

Are there any AKS-specific considerations when performing network plugin migration with Calico that could help avoid this issue?

Any guidance on root cause and recommended resolution would be greatly appreciated.I am performing a migration of an existing AKS cluster from the kubenet network plugin to Azure CNI overlay, as the cluster uses Calico for network policy. Based on Microsoft guidance, the migration involves the following steps:

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-03-27T04:13:38.61+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The failure is caused by PodDisruptionBudgets (PDBs) blocking eviction of critical system pods (CoreDNS, konnectivity-agent, metrics-server) during the node drain that is part of the AKS update operation. When one replica is Pending and minAvailable=1, allowedDisruptions becomes 0, so the drain cannot evict the remaining Running pod.

    From the error details and the current PDB state:

    • coredns-pdb (MinAvailable: 1, CurrentHealthy: 1, DesiredHealthy: 1, ExpectedPods: 2) and one pod in Pending means:
      • 2 pods are expected.
      • Only 1 pod is actually Running (CurrentHealthy=1).
      • Because minAvailable=1, the disruption controller calculates allowedDisruptions=0 until the second pod becomes Ready.
      • Any attempt to evict the only healthy pod would violate the PDB, so the drain fails with Too Many Requests.

    This explains why earlier runs showed ALLOWED DISRUPTIONS=1: at that time, both replicas were Running and Ready, so with minAvailable=1 the controller allowed 1 disruption. In the current run, the extra pod is Pending, so no disruption is allowed.

    Regarding the specific questions:

    1. Ownership of these PDBs and change in ALLOWED DISRUPTIONS
      • The PDBs for coredns, konnectivity-agent, and metrics-server are part of the system components managed by AKS. They are not user-created in this scenario.
      • ALLOWED DISRUPTIONS is not a static configuration; it is computed dynamically by the Kubernetes disruption controller based on minAvailable, the number of Ready pods, and the expected replicas. When one of the replicas is Pending or NotReady, allowedDisruptions can drop from 1 to 0 even though the PDB spec has not changed.
    2. Is the failure caused by the Pending CoreDNS pod?
      Yes. With minAvailable=1 and only 1 healthy CoreDNS pod, the PDB does not allow eviction of that pod. The Pending pod prevents the controller from counting it as healthy, so the node drain cannot proceed.
    3. Recommended approach to handle this during migration
      The general pattern for PDB-related upgrade failures in AKS is:
      • Ensure all replicas of system pods are Running and Ready before starting the operation. In this case, confirm that both CoreDNS pods (and konnectivity-agent, metrics-server) are healthy after the first deployment and before enabling Calico.
      • If an upgrade or update is already blocked by PDBs, remove or relax the conflicting PDBs, then retrigger the operation. The documented pattern for retriggering reconciliation in AKS is:
        • Run az aks update -g <resource-group> -n <cluster-name> with no changes to issue an empty PutManagedCluster and re-run backend reconciliation logic.
        • Alternatively, use az resource update on the managed cluster resource, or perform a nodepool scale operation to retrigger reconciliation if nodes are left in a bad state.
                 az aks update -g <resource-group> -n <cluster-name>
          
                 az resource update --ids /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.ContainerService/managedClusters/<cluster-name>
          
                 az aks nodepool scale -g <resource-group> --cluster-name <cluster-name> -n <nodepool-name> --node-count <desired-count>
          
      • For a one-time migration, a practical sequence is:
        1. Temporarily delete or edit the PDBs for coredns, konnectivity-agent, and metrics-server to allow disruptions (for example, lower minAvailable or remove the PDBs).
        2. Run the network plugin migration (the ARM deployment that enables Calico).
        3. After the cluster is stable and all system pods are healthy, restore the original PDBs.
      • Another mitigation is to temporarily increase the replica count for CoreDNS and other critical components so that minAvailable can still be satisfied during drain, but this only works if the additional replicas can actually schedule and become Ready.
    4. AKS-specific considerations for kubenet → Azure CNI overlay with Calico
      From the AKS networking and network policy documentation:
      • Migration to Azure CNI Overlay is a one-way, cluster-wide operation; node pools are reimaged simultaneously during the update, similar to a Kubernetes version upgrade. This makes PDB behavior particularly important, because any PDB that cannot be satisfied can block node drain and thus the entire operation.
      • When moving from kubenet with Calico to Azure CNI Overlay with Calico, Calico changes role: in kubenet it acts as both CNI and network policy engine, while in Azure CNI it is only the network policy engine. This can introduce a short delay between pod start and when Calico allows outbound traffic. Ensuring that system pods (CoreDNS, metrics-server, konnectivity-agent) are fully Ready before the migration step that re-enables Calico helps avoid transient Pending/NotReady states that interact badly with strict PDBs.
      • The AKS guidance for failed upgrades due to conflicting PDBs is to remove or relax the PDBs, then retrigger the upgrade or reconciliation using az aks update, az resource update, or a nodepool scale operation.

    In summary, the root cause is the interaction between strict system PDBs and a Pending CoreDNS pod during node drain. The recommended resolution is to ensure system pods are healthy before the migration, temporarily relax or remove the conflicting PDBs if necessary, then retrigger the AKS update so the migration can complete, and finally restore the PDBs once the cluster is stable.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.