AKS Node OS Image Update caused production outage

Question

AKS Node OS Image Update caused production outage

Sophie Yavuz 0

AKS Node OS Image Update causes production outage – Eraser Controller issue & Maintenance Window too long

Cluster: AAE | Region: Switzerland North | K8s Version: 1.32.10

Incident

On 01.04.2026, the automatic Node OS Image Update caused a complete production outage lasting 2.5 hours. After the node upgrade, the AKS-managed eraser-controller-manager created imagejobs using the deprecated API eraser.sh/v1alpha1. This caused Gatekeeper to crash (OOMKill), leaving 17 Eraser pods stuck with UnexpectedAdmissionError. The resulting resource exhaustion on the nodes brought down our entire production environment.

__Issue 1: Eraser Controller
__The eraser-controller-manager uses an API that Azure itself flags as deprecated, causing instability in combination with Gatekeeper. After manually scaling it to 0 replicas, it is automatically scaled back up. We need a way to permanently disable the AKS Image Cleaner or an update to a stable API version. Is there a supported way to either disable the AKS Image Cleaner permanently or update it to use a stable API version?

Issue 2: Maintenance Window
The minimum maintenance window of 4 hours is not feasible for us. Our possible deployment window is tight to a varying schedule and most of the time limited to 1 or 2 hours at night. Is there the possibility of decreasing the update window to ensure the downtime at certain times ?

We have temporarily disabled auto-upgrades.

Thank you in advance for your feedback !

Himanshu Shekhar 5,140 Reputation points Microsoft External Staff Moderator

2026-04-01T09:28:56.3666667+00:00

Sophie Yavuz - Thank you for reaching out regarding your AKS issue. We are currently investigating the matter and will provide our recommendations as soon as possible. If you have any further information or concerns, please feel free to share them. We appreciate your patience as we address this issue. Additionally reached for details via private messages
Himanshu Shekhar 5,140 Reputation points Microsoft External Staff Moderator

2026-04-01T10:23:58.0266667+00:00
AKS enforces a minimum 4‑hour window for Node OS image upgrades.

Shorter windows are intentionally blocked to avoid:

Half‑upgraded node pools

Control plane / data plane skew and repeated drain/rollback cycles

This is correct and safe as a temporary stabilization step on disabling auto-upgrades.

1 answer

Your answer

Himanshu Shekhar 5,140 Reputation points Microsoft External Staff Moderator

2026-04-01T09:28:56.3666667+00:00

Sophie Yavuz - Thank you for reaching out regarding your AKS issue. We are currently investigating the matter and will provide our recommendations as soon as possible. If you have any further information or concerns, please feel free to share them. We appreciate your patience as we address this issue. Additionally reached for details via private messages
Himanshu Shekhar 5,140 Reputation points Microsoft External Staff Moderator

2026-04-01T10:23:58.0266667+00:00

AKS enforces a minimum 4‑hour window for Node OS image upgrades.

Shorter windows are intentionally blocked to avoid:

Half‑upgraded node pools

Control plane / data plane skew and repeated drain/rollback cycles

This is correct and safe as a temporary stabilization step on disabling auto-upgrades.

Answer 1

Hello Himanshu Shekhar,

Thank you for your response regarding the maintenance window. We have to deal with the problem that the system is used in public transportation inftrastructure. The outage directly impacted passenger services. Given the 4-hour minimum requirement, how would you recommend we handle this situation?

Moreover, my primary concern issue 1 was not addressed: The AKS-managed eraser-controller-manager (v1.4.0) uses the deprecated API eraser.sh/v1alpha1, which Azure's own diagnostics flags as deprecated. This caused a cascading failure:

Ereaser created imagejobs via deprecated API
Gatekeeper crashed trying to validate them (OOMKill)
17 Eraser pods stuck with UnexpectedAdmissionError
Resource exhaustion → 2.5 hour production outage

How can we prevent this Ereaser issue when we re-enable autmatic Node OS or Kubernetes Upgrade updates?

Greetings, Sophie Yavuz

Share via

AKS Node OS Image Update caused production outage

1 answer

Your answer