Share via

AKS Node OS Image Update caused production outage

Sophie Yavuz 0 Reputation points
2026-04-01T09:20:12.1366667+00:00

AKS Node OS Image Update causes production outage – Eraser Controller issue & Maintenance Window too long

Cluster: AAE | Region: Switzerland North | K8s Version: 1.32.10

Incident

On 01.04.2026, the automatic Node OS Image Update caused a complete production outage lasting 2.5 hours. After the node upgrade, the AKS-managed eraser-controller-manager created imagejobs using the deprecated API eraser.sh/v1alpha1. This caused Gatekeeper to crash (OOMKill), leaving 17 Eraser pods stuck with UnexpectedAdmissionError. The resulting resource exhaustion on the nodes brought down our entire production environment.

__Issue 1: Eraser Controller
__The eraser-controller-manager uses an API that Azure itself flags as deprecated, causing instability in combination with Gatekeeper. After manually scaling it to 0 replicas, it is automatically scaled back up. We need a way to permanently disable the AKS Image Cleaner or an update to a stable API version. Is there a supported way to either disable the AKS Image Cleaner permanently or update it to use a stable API version?

Issue 2: Maintenance Window
The minimum maintenance window of 4 hours is not feasible for us. Our possible deployment window is tight to a varying schedule and most of the time limited to 1 or 2 hours at night. Is there the possibility of decreasing the update window to ensure the downtime at certain times ?

We have temporarily disabled auto-upgrades.

Thank you in advance for your feedback !

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


1 answer

Sort by: Most helpful
  1. Sophie Yavuz 0 Reputation points
    2026-04-02T11:40:26.6333333+00:00

    Hello Himanshu Shekhar,

    Thank you for your response regarding the maintenance window. We have to deal with the problem that the system is used in public transportation inftrastructure. The outage directly impacted passenger services. Given the 4-hour minimum requirement, how would you recommend we handle this situation?

    Moreover, my primary concern issue 1 was not addressed: The AKS-managed eraser-controller-manager (v1.4.0) uses the deprecated API eraser.sh/v1alpha1, which Azure's own diagnostics flags as deprecated. This caused a cascading failure:

    1. Ereaser created imagejobs via deprecated API
    2. Gatekeeper crashed trying to validate them (OOMKill)
    3. 17 Eraser pods stuck with UnexpectedAdmissionError
    4. Resource exhaustion → 2.5 hour production outage

    How can we prevent this Ereaser issue when we re-enable autmatic Node OS or Kubernetes Upgrade updates?

    Greetings, Sophie Yavuz

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.