A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://run-ai-docs.nvidia.com/self-hosted/platform-management/monitor-performance/metrics below:

Metrics and Telemetry | Run:ai Documentation

Metrics and Telemetry | Run:ai Documentation
  1. Platform management
  2. Monitor Performance and Health
Metrics and Telemetry

Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.

NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.

A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.

Data is aggregated at the node level.

Data is aggregated at the node pool level.

Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.

The basic unit of execution.

The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.

Departments are a grouping of projects.

Metric name in UI per grid

GPU_MEMORY_USAGE_BYTES_PER_GPU

GPU_MEMORY_UTILIZATION_PER_GPU

GPU memory utilization per GPU

GPU_UTILIZATION_DISTRIBUTION

GPU utilization distribution

CPU_ALLOCATION_MILLICORES

NVIDIA provides extended metrics as shown here . To enable these metrics, please contact NVIDIA Run:ai customer support.

GPU_FP16_ENGINE_ACTIVITY_PER_GPU

GPU_FP32_ENGINE_ACTIVITY_PER_GPU

GPU_FP64_ENGINE_ACTIVITY_PER_GPU

GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU

GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU

Memory bandwidth utilization

GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU

NVLink received bandwidth

GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU

NVLink transmitted bandwidth

GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU

GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU

PCIe transmitted bandwidth

GPU_TENSOR_ACTIVITY_PER_GPU

GPU_OOMKILL_SWAP_OUT_OF_RAM_COUNT_PER_GPU

OOMKill swap out of RAM count

GPU_OOMKILL_BURST_COUNT_PER_GPU

GPU_OOMKILL_IDLE_COUNT_PER_GPU

GPU_SWAP_MEMORY_BYTES_PER_GPU

Ready / Total GPU devices

Ready / Total GPU devices

Idle allocated GPU devices

ALLOCATED_CPU_MEMORY_BYTES

GPU_ALLOCATION_NON_PREEMPTIBLE

CPU_ALLOCATION_NON_PREEMPTIBLE

MEMORY_ALLOCATION_NON_PREEMPTIBLE


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4