AKS Node Image Upgrade Failures

Question

AKS Node Image Upgrade Failures

Andreea Stoenescu 20

Hello Support Team,

We are experiencing issues with node image upgrades across the Azure Kubernetes Service (AKS) clusters.

On 24th March, the scheduled node image upgrades failed across all Kubernetes clusters. The logs show the following message: “Upgrade Failed with status Unspecified, error: Unknown error”

User's image

Could you please investigate the root cause of these upgrade failures.

Thank you,

Andreea Stoenescu

Ankit Yadav 13,135 Reputation points Microsoft External Staff Moderator

2026-03-25T16:58:37.44+00:00

Hello @Andreea Stoenescu,

Thank you for reaching out regarding the AKS node image upgrade failure. It appears to be an underlying infrastructure issue, and we are continuing to investigate the case. I have also contacted you via email and private message, could you please review them and let us know your availability for a call to discuss further details?
Ankit Yadav 13,135 Reputation points Microsoft External Staff Moderator

2026-03-30T15:09:26.0233333+00:00

Hello Andreea Stoenescu ,

Good day!

Kindly share us your availability for call to gather more details about the issue over the Private Message

1 answer

Your answer

Ankit Yadav 13,135 Reputation points Microsoft External Staff Moderator

2026-03-25T16:58:37.44+00:00

Hello @Andreea Stoenescu,

Thank you for reaching out regarding the AKS node image upgrade failure. It appears to be an underlying infrastructure issue, and we are continuing to investigate the case. I have also contacted you via email and private message, could you please review them and let us know your availability for a call to discuss further details?
Ankit Yadav 13,135 Reputation points Microsoft External Staff Moderator

2026-03-30T15:09:26.0233333+00:00

Hello Andreea Stoenescu ,

Good day!

Kindly share us your availability for call to gather more details about the issue over the Private Message

Answer 1

The activity log message “Upgrade Failed with status Unspecified, error: Unknown error” indicates that the node image upgrade PUT operation on the agent pool failed, but the platform didn’t surface a specific error code in that view. To identify the root cause and remediate, use the standard AKS failed-operation troubleshooting flow and then check for the common failure conditions that affect upgrades.

In the Azure portal, open the AKS cluster → Activity log and filter Status = Failed and the time range around 24 March. Open the failed Create or Update Agent Pool event and review the Status section and the JSON payload for any error code and message. This is the primary place where the underlying failure reason is exposed.
If an error code is present (for example QuotaExceeded, RequestDisallowedByPolicy, PublicIPCountLimitReached, SubnetIsFull, OverconstrainedAllocationRequest, OutboundConnFailVMExtensionError, or a drain/PDB-related error), use the corresponding remediation:
- RequestDisallowedByPolicy: a policy is blocking the operation. Review the policy assignment on the subscription/resource group and modify or exclude it so the agent pool update is allowed.
- QuotaExceeded: the region or SKU quota is exhausted. Check current usage and either delete unused resources or request a quota increase before retrying the node image upgrade.
- PublicIPCountLimitReached: the subscription/region has hit the public IP limit. Free unused public IPs or request a higher limit.
- OverconstrainedAllocationRequest or allocation failures: the requested VM size/zone has no capacity. Choose a different VM size or region for the node pool.
- OutboundConnFailVMExtensionError: the VM extension cannot reach required endpoints. Verify NSG and firewall rules allow outbound access to https://aka.ms/aks/outbound, https://aka.ms/aks/ssh, https://aka.ms/aks/agent, and https://aka.ms/aks/containerinsights.
- Drain/PDB-related errors: pods cannot be drained from nodes. Use kubectl get pods --all-namespaces -o wide and kubectl describe pod to find stuck pods; temporarily relax or remove PodDisruptionBudgets, or force-delete problematic pods, then retry the upgrade.
- Subnet too small or SubnetIsFull: the subnet cannot allocate more IPs for upgraded nodes. Move the node pool to a larger subnet or recreate it with a larger address space.
If the cluster itself is marked as Failed after the upgrade attempt, follow the failed-cluster guidance: use the activity log and the Diagnose and solve problems blade on the AKS resource to detect and fix common issues, including network, DNS, or infrastructure problems.
After correcting the identified issue (quota, policy, networking, PDB, subnet, or capacity), re-run the node image upgrade for the affected node pools.

Because the failure is reported only as “Unknown error” in the screenshot, the concrete root cause will be one of the underlying errors surfaced in the detailed activity log JSON or Diagnose and Solve Problems output. Those tools should be used to pinpoint which of the documented failure patterns applies to the environment on 24 March.

References:

Share via

AKS Node Image Upgrade Failures

1 answer

Your answer