Stay organized with collections Save and categorize content based on your preferences.
Linux
Tip: If you want to monitor A4 or A3 Ultra machine types that are deployed using the features provided by AI Hypercomputer, see Monitor VMs and clusters in the AI Hypercomputer documentation instead.You can track metrics such as GPU utilization and GPU memory from your virtual machine (VM) instances by using the Ops Agent, which is Google's recommended telemetry collection solution for Compute Engine. By using the Ops Agent, you can manage your GPU VMs as follows:
This document covers the procedures for monitoring GPUs on Linux VMs by using the Ops Agent. Alternatively, a reporting script is available on GitHub that can also be setup for monitoring GPU usage on Linux VMs, see compute-gpu-monitoring
monitoring script. This script is not actively maintained.
For monitoring GPUs on Windows VMs, see Monitoring GPU performance (Windows).
OverviewThe Ops Agent, version 2.38.0 or later, can automatically track GPU utilization and GPU memory usage rates on your Linux VMs that have the agent installed. These metrics, obtained from the NVIDIA Management Library (NVML), are tracked per GPU and per process for any process that uses GPUs. To view the metrics that are monitored by the Ops Agent, see Agent metrics: gpu.
You can also set up the NVIDIA Data Center GPU Manager (DCGM) integration with the Ops Agent. This integration allows the Ops Agent to track metrics using the hardware counters on the GPU. DCGM provides access to the GPU device-level metrics. These include Streaming Multiprocessor (SM) block utilization, SM occupancy, SM pipe utilization, PCIe traffic rate, and NVLink traffic rate. To view the metrics monitored by the Ops Agent, see Third-party application metrics: NVIDIA Data Center GPU Manager (DCGM).
To review GPU metrics by using the Ops Agent, complete the following steps:
On each of your VMs, check that you meet the following requirements:
sudo
access to each VM.To install the Ops Agent, complete the following steps:
If you were previously using the compute-gpu-monitoring
monitoring script to track GPU utilization, disable the service before installing the Ops Agent. To disable the monitoring script, run the following command:
sudo systemctl --no-reload --now disable google_gpu_monitoring_agent
Install the latest version of the Ops Agent. For detailed instructions, see Installing the Ops Agent.
After you have installed the Ops agent, if you need to install or upgrade your GPU drivers by using the installation scripts provided by Compute Engine, review the limitations section.
You can review the NVML metrics that the Ops Agent collects from the Observability tabs for Compute Engine Linux VM instances.
To view the metrics for a single VM do the following:
In the Google Cloud console, go to the VM instances page.
Select a VM to open the Details page.
Click the Observability tab to display information about the VM.
Select the GPU quick filter.
To view the metrics for multiple VMs, do the following:
In the Google Cloud console, go to the VM instances page.
Click the Observability tab.
Select the GPU quick filter.
The Ops Agent also provides integration for NVIDIA Data Center GPU Manager (DCGM) to collect key advanced GPU metrics such as Streaming Multiprocessor (SM) block utilization, SM occupancy, SM pipe utilization, PCIe traffic rate, and NVLink traffic rate.
These advanced GPU metrics are not collected from NVIDIA P100 and P4 models.
For detailed instructions on how to setup and use this integration on each VM, see NVIDIA Data Center GPU Manager (DCGM).
Review DCGM metrics in Cloud MonitoringIn the Google Cloud console, go to the Monitoring > Dashboards page.
Select the Sample Library tab.
In the filter_list Filter field, type NVIDIA. The NVIDIA GPU Monitoring Overview (GCE and GKE) dashboard displays.
If you have set up the NVIDIA Data Center GPU Manager (DCGM) integration, the NVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only) dashboard also displays.
For the required dashboard, click Preview. The Sample dashboard preview page displays.
From the Sample dashboard preview page, click Import sample dashboard.
The NVIDIA GPU Monitoring Overview (GCE and GKE) dashboard displays the GPU metrics such as GPU utilization, NIC traffic rate, and GPU memory usage.
Your GPU utilization display is similar to the following output:
The NVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only) dashboard displays key advanced metrics such as SM utilization, SM occupancy, SM pipe utilization, PCIe traffic rate, and NVLink traffic rate.
Your Advanced DCGM Metric display is similar to the following output:
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-10-13 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-10-13 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.5