Share via

A community member has associated this post with a similar question:
Node restarted after being marked as NotReady

Only moderators can edit this content.

Node restarted after being marked as NotReady

Dimitris Bratsos 0 Reputation points
2026-04-08T15:41:20.7333333+00:00

Our node was suddenly marked as NotReady, which then caused it to restart. It has been running with no issues over the past weeks, and no changes in its configuration are done.

Our node is running on version v1.34.2

We have disabled the automatic OS upgrades and we have a manual upgrade policy set. So it (in theory) should not be an automatic update of some sort.

From the journalctl through a debug pod in that node, we can see the following before the restart:

"Node became not ready" node="[VMSS_NAME REMOVED]" condition={"type":"Ready","status":"False","lastHeartbeatTime":"2026-04-08T13:20:44Z","lastTransitionTime":"2026-04-08T13:20:44Z","reason":"KubeletNotReady","message":"container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}

We are trying to find out the root cause, and not sure if this is it. Another thing we noticed, was a spike in our PVCs, which caused them to fill up, but would not expect this to be connected, as I would only expect the pods to be affected, not the node itself.

Is there something else we might have missed, or any reason why the node restarted/got marked as NotReady in the first place?

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.

2 answers

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  2. Ankit Yadav 13,600 Reputation points Microsoft External Staff Moderator
    2026-04-08T16:39:57.0166667+00:00

    Hello Dimitris,Thanks for sharing the query in detailed manner.

    Root Cause:

    Based on the kubelet log (NetworkPluginNotReady: cni plugin not initialized), the likely causes are:

    • Network plugin initialization failure – The kubelet marked the node NotReady because the CNI/network plugin was not ready. When networking cannot be initialized, the kubelet stops reporting heartbeats and the node transitions to NotReady.
    • Node‑level resource pressure (especially disk pressure) – Even though PVCs are pod‑scoped, a spike in storage usage can indirectly impact the node’s OS disk or ephemeral storage (for example under /var/lib/containerd or /var/lib/cni). Disk pressure is a documented cause of kubelet/container runtime instability.
    • Platform repair action after prolonged NotReady – If a node remains unhealthy for some time, AKS may automatically initiate a repair action (such as a reboot or reimage). This is independent of OS auto‑upgrade settings and is not considered a configuration change.  

    Recommendations (to avoid similar impact in future):

    • Monitor node‑level disk and ephemeral storage usage, not only PVC capacity.
    • Set ephemeral‑storage requests and limits on pods to prevent runaway writes from affecting node storage.
    • Ensure sufficient OS disk capacity and headroom, especially on GPU nodes where container images and logs are larger.
    • Keep the cluster on the latest supported patch version for the chosen AKS minor release to avoid known kubelet/CNI issues.

    Troubleshooting suggestions (In case if this happens again):

    • If a node transitions to NotReady in the future, customers can: Check node conditions and events:
        kubectl describe node <node-name>
      
      Look for DiskPressure, MemoryPressure, or NetworkUnavailable around the timestamp.
    • Verify node disk usage (via debug pod)
        df -h
        du -sh /var/lib/containerd /var/lib/cni /var/log
      
    • Review kubelet and container runtime logs
        journalctl -u kubelet
        journalctl -u containerd
      
      Look for disk, OOM, or CNI initialization errors in them.

    Hope this helped to answer your query!!

    Reference: