An Azure service that is used to collect, analyze, and act on telemetry data from Azure and on-premises environments.
Yep - To enable GPU monitoring for Azure VMs (NC, ND, NV series), you need to implement a guest-based collection strategy as these are not available as standard host-level platform metrics.
- Primary documentation/approach
The current official guidance for monitoring NVIDIA GPUs on Azure involves using the Azure Monitor Agent (AMA) and NVIDIA DCGM Exporter.
- Linux VMs: The standard recommended path is to use the NVIDIA DCGM Exporter to expose metrics, which can then be scraped by the Azure Monitor Agent. Alternatively, Microsoft provides a comprehensive guide on using Telegraf with the Azure Monitor output plugin.
- Windows VMs: You must configure Data Collection Rules (DCRs) to ingest specific GPU performance counters if the drivers expose them to the Windows Performance Monitor.
- Log Analytics tables
Depending on your collection method, metrics will populate different tables:
-
PerfTable: Standard performance counters (like CPU and Memory) and custom counters collected via DCRs appear here. -
InsightsMetricsTable: While used by VM Insights for standard metrics, custom GPU metrics often require a separate namespace (e.g.,Telegraf/nvidia-smi). - Azure Monitor Metrics: Metrics sent via the Telegraf plugin can be viewed in the Metrics Explorer under the
telegraf/nvidia-sminamespace.
- Implementation steps
- Driver Installation: Ensure the latest NVIDIA GPU Drivers are installed. Using the NVIDIA GPU-Optimized VMI is often the easiest starting point.
- Enable VM Insights: This installs the Azure Monitor Agent and creates a default Data Collection Rule.
- Deploy DCGM Exporter (Linux): Run the exporter as a service to translate GPU telemetry into a format the agent can read.
- Configure Custom DCR: For specific counters not in the default set, create a new Data Collection Rule to capture additional performance metrics.
- Key metric names
Using the NVIDIA DCGM/Telegraf method, you can track:
-
gpu_utilization: Percentage of time the kernels were active. -
gpu_memory_used: Current framebuffer memory in use. -
gpu_temperature: Current core temperature. -
gpu_power_usage: Real-time power draw in Watts.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin