An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Hello Dimitris,Thanks for sharing the query in detailed manner.
Root Cause:
Based on the kubelet log (NetworkPluginNotReady: cni plugin not initialized), the likely causes are:
- Network plugin initialization failure – The kubelet marked the node NotReady because the CNI/network plugin was not ready. When networking cannot be initialized, the kubelet stops reporting heartbeats and the node transitions to NotReady.
- Node‑level resource pressure (especially disk pressure) – Even though PVCs are pod‑scoped, a spike in storage usage can indirectly impact the node’s OS disk or ephemeral storage (for example under
/var/lib/containerdor/var/lib/cni). Disk pressure is a documented cause of kubelet/container runtime instability. - Platform repair action after prolonged NotReady – If a node remains unhealthy for some time, AKS may automatically initiate a repair action (such as a reboot or reimage). This is independent of OS auto‑upgrade settings and is not considered a configuration change.
Recommendations (to avoid similar impact in future):
- Monitor node‑level disk and ephemeral storage usage, not only PVC capacity.
- Set ephemeral‑storage requests and limits on pods to prevent runaway writes from affecting node storage.
- Ensure sufficient OS disk capacity and headroom, especially on GPU nodes where container images and logs are larger.
- Keep the cluster on the latest supported patch version for the chosen AKS minor release to avoid known kubelet/CNI issues.
Troubleshooting suggestions (In case if this happens again):
- If a node transitions to
NotReadyin the future, customers can: Check node conditions and events:
Look for DiskPressure, MemoryPressure, or NetworkUnavailable around the timestamp.kubectl describe node <node-name> - Verify node disk usage (via debug pod)
df -h du -sh /var/lib/containerd /var/lib/cni /var/log - Review kubelet and container runtime logs
Look for disk, OOM, or CNI initialization errors in them.journalctl -u kubelet journalctl -u containerd
Hope this helped to answer your query!!
Reference: