An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
The failure is caused by PodDisruptionBudgets (PDBs) blocking eviction of critical system pods (CoreDNS, konnectivity-agent, metrics-server) during the node drain that is part of the AKS update operation. When one replica is Pending and minAvailable=1, allowedDisruptions becomes 0, so the drain cannot evict the remaining Running pod.
From the error details and the current PDB state:
-
coredns-pdb (MinAvailable: 1, CurrentHealthy: 1, DesiredHealthy: 1, ExpectedPods: 2)and one pod inPendingmeans:- 2 pods are expected.
- Only 1 pod is actually Running (
CurrentHealthy=1). - Because
minAvailable=1, the disruption controller calculatesallowedDisruptions=0until the second pod becomes Ready. - Any attempt to evict the only healthy pod would violate the PDB, so the drain fails with
Too Many Requests.
This explains why earlier runs showed ALLOWED DISRUPTIONS=1: at that time, both replicas were Running and Ready, so with minAvailable=1 the controller allowed 1 disruption. In the current run, the extra pod is Pending, so no disruption is allowed.
Regarding the specific questions:
- Ownership of these PDBs and change in
ALLOWED DISRUPTIONS- The PDBs for
coredns,konnectivity-agent, andmetrics-serverare part of the system components managed by AKS. They are not user-created in this scenario. -
ALLOWED DISRUPTIONSis not a static configuration; it is computed dynamically by the Kubernetes disruption controller based onminAvailable, the number of Ready pods, and the expected replicas. When one of the replicas is Pending or NotReady,allowedDisruptionscan drop from 1 to 0 even though the PDB spec has not changed.
- The PDBs for
- Is the failure caused by the Pending CoreDNS pod?
Yes. WithminAvailable=1and only 1 healthy CoreDNS pod, the PDB does not allow eviction of that pod. The Pending pod prevents the controller from counting it as healthy, so the node drain cannot proceed. - Recommended approach to handle this during migration
The general pattern for PDB-related upgrade failures in AKS is:- Ensure all replicas of system pods are Running and Ready before starting the operation. In this case, confirm that both CoreDNS pods (and konnectivity-agent, metrics-server) are healthy after the first deployment and before enabling Calico.
- If an upgrade or update is already blocked by PDBs, remove or relax the conflicting PDBs, then retrigger the operation. The documented pattern for retriggering reconciliation in AKS is:
- Run
az aks update -g <resource-group> -n <cluster-name>with no changes to issue an emptyPutManagedClusterand re-run backend reconciliation logic. - Alternatively, use
az resource updateon the managed cluster resource, or perform a nodepool scale operation to retrigger reconciliation if nodes are left in a bad state.az aks update -g <resource-group> -n <cluster-name>az resource update --ids /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.ContainerService/managedClusters/<cluster-name>az aks nodepool scale -g <resource-group> --cluster-name <cluster-name> -n <nodepool-name> --node-count <desired-count>
- Run
- For a one-time migration, a practical sequence is:
- Temporarily delete or edit the PDBs for
coredns,konnectivity-agent, andmetrics-serverto allow disruptions (for example, lowerminAvailableor remove the PDBs). - Run the network plugin migration (the ARM deployment that enables Calico).
- After the cluster is stable and all system pods are healthy, restore the original PDBs.
- Temporarily delete or edit the PDBs for
- Another mitigation is to temporarily increase the replica count for CoreDNS and other critical components so that
minAvailablecan still be satisfied during drain, but this only works if the additional replicas can actually schedule and become Ready.
- AKS-specific considerations for kubenet → Azure CNI overlay with Calico
From the AKS networking and network policy documentation:- Migration to Azure CNI Overlay is a one-way, cluster-wide operation; node pools are reimaged simultaneously during the update, similar to a Kubernetes version upgrade. This makes PDB behavior particularly important, because any PDB that cannot be satisfied can block node drain and thus the entire operation.
- When moving from kubenet with Calico to Azure CNI Overlay with Calico, Calico changes role: in kubenet it acts as both CNI and network policy engine, while in Azure CNI it is only the network policy engine. This can introduce a short delay between pod start and when Calico allows outbound traffic. Ensuring that system pods (CoreDNS, metrics-server, konnectivity-agent) are fully Ready before the migration step that re-enables Calico helps avoid transient Pending/NotReady states that interact badly with strict PDBs.
- The AKS guidance for failed upgrades due to conflicting PDBs is to remove or relax the PDBs, then retrigger the upgrade or reconciliation using
az aks update,az resource update, or a nodepool scale operation.
In summary, the root cause is the interaction between strict system PDBs and a Pending CoreDNS pod during node drain. The recommended resolution is to ensure system pods are healthy before the migration, temporarily relax or remove the conflicting PDBs if necessary, then retrigger the AKS update so the migration can complete, and finally restore the PDBs once the cluster is stable.
References:
- Troubleshoot failed upgrades from conflicting PodDisruptionBudgets in AKS
- Update Azure CNI IPAM mode and data plane technology for Azure Kubernetes Service (AKS) clusters
- Secure traffic between pods with network policies in Azure Kubernetes Service (AKS)
- Azure Kubernetes Service (AKS) CNI networking overview