In-Place Migration Failed for an AKS Cluster

Question

In-Place Migration Failed for an AKS Cluster

Aniket Pandey (c) 0

I am performing a migration of an existing AKS cluster from the kubenet network plugin to Azure CNI overlay, as the cluster uses Calico for network policy. Based on Microsoft guidance, the migration involves the following steps:

Remove Calico
Update network plugin to Azure CNI (overlay mode)
Reinstall Calico

My cluster has 4 nodes.

To optimize the process, I have combined steps 1 and 2 into a single ARM template deployment, and performed step 3 in a second deployment:

Deployment 1 (successful):

Network policy: none

Network plugin: azure

Network plugin mode: overlay

Deployment 2 (failed):

Network policy: calico

Network plugin: azure

Network plugin mode: overlay

The second ARM deployment failed during node drain with the following error:

Drain node failed when evicting pod coredns-* due to "Too Many Requests". Cannot evict pod as it would violate the pod's disruption budget. PDB debug info: coredns-pdb (MinAvailable: 1, CurrentHealthy: 1, DesiredHealthy: 1, ExpectedPods: 2) One pod is in Pending state.

Current PDB configuration:

NAMESPACE       NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS
calico-system   calico-typha         N/A             1                 1
kube-system     coredns-pdb          1               N/A               0
kube-system     konnectivity-agent   1               N/A               0
kube-system     metrics-server-pdb   1               N/A               0

However, initial state of my cluster before starting the migration process was

NAMESPACE       NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
calico-system   calico-typha         N/A             1                 1                     14m
kube-system     coredns-pdb          1               N/A               1                     17m
kube-system     konnectivity-agent   1               N/A               1                     17m
kube-system     metrics-server-pdb   1               N/A               1                     17m

Questions:

These PDBs were not explicitly configured by us — are they managed by AKS by default? If so, what could cause the change in ALLOWED DISRUPTIONS from 1 to 0?

Is the failure caused by the Pending CoreDNS pod blocking eviction due to the PDB constraint?

What is the recommended approach to handle this scenario during migration (e.g., temporarily modifying/removing PDBs, scaling replicas, or other best practices)?

Are there any AKS-specific considerations when performing network plugin migration with Calico that could help avoid this issue?

Any guidance on root cause and recommended resolution would be greatly appreciated.I am performing a migration of an existing AKS cluster from the kubenet network plugin to Azure CNI overlay, as the cluster uses Calico for network policy. Based on Microsoft guidance, the migration involves the following steps:

Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-03-27T04:54:57.69+00:00

Hello Aniket Pandey (c),

Thank you for reaching out to the Microsoft Q&A forum.

Your migration failed because AKS protects CoreDNS with a Pod Disruption Budget. During the network plugin change + Calico reinstall, one CoreDNS pod stopped running. With only one healthy DNS pod left, AKS correctly refused to evict it, blocking node drain.

This is by design, not a bug which is seen in upgrades, deletes, and migrations & fully aligns with Microsoft documentation

The Recommended final runbook for retry is before deployment 2 try the shell command.

kubectl scale deployment coredns -n kube-system --replicas=3

Then Verify:

All CoreDNS pods Running

No Pending pods in kube-system

Run ARM Deployment 2 (Calico re-enable)

After stabilization run the below shell command.

kubectl scale deployment coredns -n kube-system --replicas=2
Aniket Pandey (c) 0 Reputation points

2026-03-27T20:03:35.45+00:00
Now, I am currently facing an issue with the konnectivity-agent pod during an AKS cluster migration from Kubenet → Azure CNI Overlay with Calico.

Drain node aks-agentpool-aasdcs-vmss000002 failed when evicting pod konnectivity-agent-234e23-pg7sr failed with Too Many Requests error. Original error: Cannot evict pod as it would violate the pod's disruption budget. PDB debug info: kube-system/konnectivity-agent blocked by pdb konnectivity-agent (MinAvailable: 1) (CurrentHealthy: 1) (DesiredHealthy: 1) (ExpectedPods: 2) with 1 unready pod: [kube-system/konnectivity-agent-xxxxx: Pending]

Migration Approach

I have a script that performs the migration in 3 steps, and it works successfully:

Step 1 – Remove Calico

az aks update \ --resource-group "$RESOURCE_GROUP" \ --name "$CLUSTER_NAME" \ --network-policy none

Step 2 – Update network plugin

az aks update \ --resource-group "$RESOURCE_GROUP" \ --name "$CLUSTER_NAME" \ --network-plugin azure \ --network-plugin-mode overlay

Step 3 – Add Calico back

az aks update \ --resource-group "$RESOURCE_GROUP" \ --name "$CLUSTER_NAME" \ --network-policy calico

✅ This CLI-based approach migrates the cluster successfully.

⚠️ Issue with ARM Deployment

When performing the same migration using ARM templates:

Step 1 and Step 2 deployments are submitted successfully

However, Step 2 fails during node drain with a PDB violation
Previously, this 2-step ARM approach worked fine, but it is now consistently failing.

ARM Template Snippet

"networkPlugin": "[parameters('networkPlugin')]",

Observability & RCA

I collected cluster state every minute (~1.5 hours) using:

kubectl get pods

kubectl get nodes

kubectl get events

kubectl get svc

Based on analysis, the failure appears to follow this pattern:

Root Cause Summary

During Step 2 (ARM-triggered update): AKS performs a rolling node upgrade

On new/reimaged nodes: calico-node DaemonSet runs install-cni This overwrites Azure CNI config with Calico CNI

Calico CNI fails because:

It expects node.spec.podCIDR

But Azure CNI Overlay uses NodeNetworkConfig CRDs instead

Result:

Pods fail with no podCidr for node

Pods get stuck in ContainerCreating

System pods become unhealthy

PDB (minAvailable: 1) blocks eviction of the last healthy pod

Node drain fails → rolling upgrade deadlocks

📊 Key Observations

calico-node is Running (hostNetwork=true) → masks issue

azure-cns is healthy → Azure CNI infra is fine

Pods fail only after Calico CNI replaces Azure CNI config

Some pods succeed briefly before install-cni runs

Multiple nodes end up SchedulingDisabled

Drain retries fail due to PDB constraints

❓ Key Questions

If the CLI-based 3-step migration works, why does ARM fail during Step 2 with the same configuration?

Has there been any recent change in AKS behavior related to:

PDB enforcement during node drain

Calico installation mode with Azure CNI Overlay

Should Calico be installed in policy-only mode during this migration?

f yes, is there a missing ARM parameter to enforce this?

Is this a known issue with Kubenet → Azure CNI Overlay + Calico migration via ARM?

🛠️ Additional Context

Source cluster: Kubenet

Step 1 → networkPolicy = none

Step 2 → networkPlugin = azure, networkPluginMode = overlay, networkPolicy = calico

Guidance Requested

Recommended way to safely perform this migration via ARM

Whether this is an AKS platform issue vs configuration gap

Any mitigation steps (e.g., Calico mode, PDB handling, sequencing changes)
Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-03-30T17:38:40.97+00:00

Hello Aniket Pandey (c),

The recommended, SAFE ARM‑based migration pattern is Supported and reliable, use explicit sequencing with waits between deployments.

Deployment 1:

networkPolicy = none

(wait until all nodes Ready)

Deployment 2:

networkPlugin = azure

networkPluginMode = overlay

(wait for node image upgrade completion)

Deployment 3:

networkPolicy = calico ← Azure‑managed Calico

Donot combine steps 2 and 3 in ARM. This mirrors the CLI behavior and avoids the race.

The optional but effective mitigations for temporarily scale system replicas before Deployment 2 is

CoreDNS, konnectivity-agent, metrics-server.

This gives PDBs breathing room during reimage.

Also validate calico mode post‑migration and ensure No install-cni container & No /etc/cni/net.d/10-calico.conflist

The reason

Why CLI works but ARM fails is that ARM reconciles Calico too early. Recent AKS change? Indirect, via overlay/network reconciliation. PDB enforcement changed? No — behaving correctly. Calico install mode issue? Yes — CNI must be disabled. Missing ARM parameter? Yes — policy‑only Calico not exposed. Platform issue vs config? AKS platform issue. Recommended ARM approach? Strict multi‑deployment sequencing.

Hope this helps & let us know if any questions?
Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-04-01T13:51:58.2+00:00

Hello Aniket Pandey (c),

Just checking in to see if the solution shared above help you to resolve your issue. please reach out to us If you have any further questions.
Naveena Patlolla 9,310 Reputation points Microsoft External Staff Moderator

2026-04-06T18:18:31.6766667+00:00

Hello Aniket Pandey (c),
Please let us know whether your issue has been resolved. If not, do you still need any guidance from us?

1 answer

Your answer

Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-03-27T04:54:57.69+00:00

Hello Aniket Pandey (c),

Thank you for reaching out to the Microsoft Q&A forum.

Your migration failed because AKS protects CoreDNS with a Pod Disruption Budget. During the network plugin change + Calico reinstall, one CoreDNS pod stopped running. With only one healthy DNS pod left, AKS correctly refused to evict it, blocking node drain.

This is by design, not a bug which is seen in upgrades, deletes, and migrations & fully aligns with Microsoft documentation

The Recommended final runbook for retry is before deployment 2 try the shell command.

kubectl scale deployment coredns -n kube-system --replicas=3

Then Verify:

All CoreDNS pods Running

No Pending pods in kube-system

Run ARM Deployment 2 (Calico re-enable)

After stabilization run the below shell command.

kubectl scale deployment coredns -n kube-system --replicas=2
Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-03-30T17:38:40.97+00:00

Hello Aniket Pandey (c),

The recommended, SAFE ARM‑based migration pattern is Supported and reliable, use explicit sequencing with waits between deployments.

Deployment 1:

networkPolicy = none

(wait until all nodes Ready)

Deployment 2:

networkPlugin = azure

networkPluginMode = overlay

(wait for node image upgrade completion)

Deployment 3:

networkPolicy = calico ← Azure‑managed Calico

Donot combine steps 2 and 3 in ARM. This mirrors the CLI behavior and avoids the race.

The optional but effective mitigations for temporarily scale system replicas before Deployment 2 is

CoreDNS, konnectivity-agent, metrics-server.

This gives PDBs breathing room during reimage.

Also validate calico mode post‑migration and ensure No install-cni container & No /etc/cni/net.d/10-calico.conflist

The reason

Why CLI works but ARM fails is that ARM reconciles Calico too early. Recent AKS change? Indirect, via overlay/network reconciliation. PDB enforcement changed? No — behaving correctly. Calico install mode issue? Yes — CNI must be disabled. Missing ARM parameter? Yes — policy‑only Calico not exposed. Platform issue vs config? AKS platform issue. Recommended ARM approach? Strict multi‑deployment sequencing.

Hope this helps & let us know if any questions?
Siva shunmugam Nadessin 7,835 Reputation points Microsoft External Staff Moderator

2026-04-01T13:51:58.2+00:00

Hello Aniket Pandey (c),

Just checking in to see if the solution shared above help you to resolve your issue. please reach out to us If you have any further questions.
Naveena Patlolla 9,310 Reputation points Microsoft External Staff Moderator

2026-04-06T18:18:31.6766667+00:00

Hello Aniket Pandey (c),
Please let us know whether your issue has been resolved. If not, do you still need any guidance from us?

Answer 1

The failure is caused by PodDisruptionBudgets (PDBs) blocking eviction of critical system pods (CoreDNS, konnectivity-agent, metrics-server) during the node drain that is part of the AKS update operation. When one replica is Pending and minAvailable=1, allowedDisruptions becomes 0, so the drain cannot evict the remaining Running pod.

From the error details and the current PDB state:

coredns-pdb (MinAvailable: 1, CurrentHealthy: 1, DesiredHealthy: 1, ExpectedPods: 2) and one pod in Pending means:
- 2 pods are expected.
- Only 1 pod is actually Running (CurrentHealthy=1).
- Because minAvailable=1, the disruption controller calculates allowedDisruptions=0 until the second pod becomes Ready.
- Any attempt to evict the only healthy pod would violate the PDB, so the drain fails with Too Many Requests.

This explains why earlier runs showed ALLOWED DISRUPTIONS=1: at that time, both replicas were Running and Ready, so with minAvailable=1 the controller allowed 1 disruption. In the current run, the extra pod is Pending, so no disruption is allowed.

Regarding the specific questions:

Ownership of these PDBs and change in ALLOWED DISRUPTIONS
- The PDBs for coredns, konnectivity-agent, and metrics-server are part of the system components managed by AKS. They are not user-created in this scenario.
- ALLOWED DISRUPTIONS is not a static configuration; it is computed dynamically by the Kubernetes disruption controller based on minAvailable, the number of Ready pods, and the expected replicas. When one of the replicas is Pending or NotReady, allowedDisruptions can drop from 1 to 0 even though the PDB spec has not changed.
Is the failure caused by the Pending CoreDNS pod?
Yes. With minAvailable=1 and only 1 healthy CoreDNS pod, the PDB does not allow eviction of that pod. The Pending pod prevents the controller from counting it as healthy, so the node drain cannot proceed.
Recommended approach to handle this during migration
The general pattern for PDB-related upgrade failures in AKS is:
- Ensure all replicas of system pods are Running and Ready before starting the operation. In this case, confirm that both CoreDNS pods (and konnectivity-agent, metrics-server) are healthy after the first deployment and before enabling Calico.
- If an upgrade or update is already blocked by PDBs, remove or relax the conflicting PDBs, then retrigger the operation. The documented pattern for retriggering reconciliation in AKS is:
  - Run az aks update -g <resource-group> -n <cluster-name> with no changes to issue an empty PutManagedCluster and re-run backend reconciliation logic.
  - Alternatively, use az resource update on the managed cluster resource, or perform a nodepool scale operation to retrigger reconciliation if nodes are left in a bad state.
```
       az aks update -g <resource-group> -n <cluster-name>
```
```
       az resource update --ids /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.ContainerService/managedClusters/<cluster-name>
```
```
       az aks nodepool scale -g <resource-group> --cluster-name <cluster-name> -n <nodepool-name> --node-count <desired-count>
```
- For a one-time migration, a practical sequence is:
  1. Temporarily delete or edit the PDBs for coredns, konnectivity-agent, and metrics-server to allow disruptions (for example, lower minAvailable or remove the PDBs).
  2. Run the network plugin migration (the ARM deployment that enables Calico).
  3. After the cluster is stable and all system pods are healthy, restore the original PDBs.
- Another mitigation is to temporarily increase the replica count for CoreDNS and other critical components so that minAvailable can still be satisfied during drain, but this only works if the additional replicas can actually schedule and become Ready.
AKS-specific considerations for kubenet → Azure CNI overlay with Calico
From the AKS networking and network policy documentation:
- Migration to Azure CNI Overlay is a one-way, cluster-wide operation; node pools are reimaged simultaneously during the update, similar to a Kubernetes version upgrade. This makes PDB behavior particularly important, because any PDB that cannot be satisfied can block node drain and thus the entire operation.
- When moving from kubenet with Calico to Azure CNI Overlay with Calico, Calico changes role: in kubenet it acts as both CNI and network policy engine, while in Azure CNI it is only the network policy engine. This can introduce a short delay between pod start and when Calico allows outbound traffic. Ensuring that system pods (CoreDNS, metrics-server, konnectivity-agent) are fully Ready before the migration step that re-enables Calico helps avoid transient Pending/NotReady states that interact badly with strict PDBs.
- The AKS guidance for failed upgrades due to conflicting PDBs is to remove or relax the PDBs, then retrigger the upgrade or reconciliation using az aks update, az resource update, or a nodepool scale operation.

In summary, the root cause is the interaction between strict system PDBs and a Pending CoreDNS pod during node drain. The recommended resolution is to ensure system pods are healthy before the migration, temporarily relax or remove the conflicting PDBs if necessary, then retrigger the AKS update so the migration can complete, and finally restore the PDBs once the cluster is stable.

References:

Share via

In-Place Migration Failed for an AKS Cluster

1 answer

Your answer