Node restarted after being marked as NotReady

Question

A community member has associated this post with a similar question:
Node restarted after being marked as NotReady

Only moderators can edit this content.

Node restarted after being marked as NotReady

Dimitris Bratsos 0

Our node was suddenly marked as NotReady, which then caused it to restart. It has been running with no issues over the past weeks, and no changes in its configuration are done.

Our node is running on version v1.34.2

We have disabled the automatic OS upgrades and we have a manual upgrade policy set. So it (in theory) should not be an automatic update of some sort.

From the journalctl through a debug pod in that node, we can see the following before the restart:

"Node became not ready" node="[VMSS_NAME REMOVED]" condition={"type":"Ready","status":"False","lastHeartbeatTime":"2026-04-08T13:20:44Z","lastTransitionTime":"2026-04-08T13:20:44Z","reason":"KubeletNotReady","message":"container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}

We are trying to find out the root cause, and not sure if this is it. Another thing we noticed, was a spike in our PVCs, which caused them to fill up, but would not expect this to be connected, as I would only expect the pods to be affected, not the node itself.

Is there something else we might have missed, or any reason why the node restarted/got marked as NotReady in the first place?

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Ankit Yadav 13,600 Reputation points Microsoft External Staff Moderator

2026-04-13T18:09:10.6966667+00:00

We are marking this case as duplicate as this case is being discussed in detail over here: https://dotnet.territoriali.olinfo.it/en-us/answers/questions/5855496/node-restarted-after-being-marked-as-notready

2 answers

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Ankit Yadav 13,600 Reputation points Microsoft External Staff Moderator

2026-04-13T18:09:10.6966667+00:00

We are marking this case as duplicate as this case is being discussed in detail over here: https://dotnet.territoriali.olinfo.it/en-us/answers/questions/5855496/node-restarted-after-being-marked-as-notready

Answer 1

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Answer 2

Hello Dimitris,Thanks for sharing the query in detailed manner.

Root Cause:

Based on the kubelet log (NetworkPluginNotReady: cni plugin not initialized), the likely causes are:

Network plugin initialization failure – The kubelet marked the node NotReady because the CNI/network plugin was not ready. When networking cannot be initialized, the kubelet stops reporting heartbeats and the node transitions to NotReady.
Node‑level resource pressure (especially disk pressure) – Even though PVCs are pod‑scoped, a spike in storage usage can indirectly impact the node’s OS disk or ephemeral storage (for example under /var/lib/containerd or /var/lib/cni). Disk pressure is a documented cause of kubelet/container runtime instability.
Platform repair action after prolonged NotReady – If a node remains unhealthy for some time, AKS may automatically initiate a repair action (such as a reboot or reimage). This is independent of OS auto‑upgrade settings and is not considered a configuration change.

Recommendations (to avoid similar impact in future):

Monitor node‑level disk and ephemeral storage usage, not only PVC capacity.
Set ephemeral‑storage requests and limits on pods to prevent runaway writes from affecting node storage.
Ensure sufficient OS disk capacity and headroom, especially on GPU nodes where container images and logs are larger.
Keep the cluster on the latest supported patch version for the chosen AKS minor release to avoid known kubelet/CNI issues.

Troubleshooting suggestions (In case if this happens again):

If a node transitions to NotReady in the future, customers can: Check node conditions and events:
```
  kubectl describe node <node-name>
```
Look for DiskPressure, MemoryPressure, or NetworkUnavailable around the timestamp.

Verify node disk usage (via debug pod)

  df -h
  du -sh /var/lib/containerd /var/lib/cni /var/log

Review kubelet and container runtime logs
```
  journalctl -u kubelet
  journalctl -u containerd
```
Look for disk, OOM, or CNI initialization errors in them.

Hope this helped to answer your query!!

Reference:

Ankit Yadav 13,600 Reputation points Microsoft External Staff Moderator

2026-04-09T14:54:09.81+00:00

Hey Dimitris,

Thanks for sharing the observed events in order.

Based on the detailed timeline shared, the initial trigger was a platform‑initiated VM redeploy, not a workload or configuration change.

The key indicator is:

VMEventScheduled: Redeploy Scheduled

Resource Health: Redeploying to different host

VM redeploys can be triggered by:

Underlying host hardware issues

Host node being marked unhealthy or unallocatable

Planned or unplanned platform maintenance

These events can occur even when automatic OS upgrades are disabled and without any customer‑initiated changes.

What can be done in future to avoid the impact?

Platform‑initiated redeploys cannot be fully prevented, but their impact can be minimized by:

Running multiple nodes per node pool (avoid single‑node GPU pools where possible)

Configuring Pod Disruption Budgets (PDBs) for critical workloads

Monitoring Azure Activity Log and Resource Health events

Ensuring workloads are resilient to node restarts (stateless design where possible)

Hope it answers your query!

Share via

Node restarted after being marked as NotReady

2 answers