Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.
NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.
A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.
Data is aggregated at the node level.
Data is aggregated at the node pool level.
Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.
The basic unit of execution.
The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.
Departments are a grouping of projects.
Metric name in UI per gridGPU_MEMORY_USAGE_BYTES_PER_GPU
GPU_MEMORY_UTILIZATION_PER_GPU
GPU memory utilization per GPU
GPU_UTILIZATION_DISTRIBUTION
GPU utilization distribution
GPU devices (unallocated)
CPU_ALLOCATION_MILLICORES
NVIDIA provides extended metrics as shown here . To enable these metrics, please contact NVIDIA Run:ai customer support.
GPU_FP16_ENGINE_ACTIVITY_PER_GPU
GPU_FP32_ENGINE_ACTIVITY_PER_GPU
GPU_FP64_ENGINE_ACTIVITY_PER_GPU
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU
Memory bandwidth utilization
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU
NVLink received bandwidth
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU
NVLink transmitted bandwidth
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU
PCIe transmitted bandwidth
GPU_TENSOR_ACTIVITY_PER_GPU
GPU_OOMKILL_SWAP_OUT_OF_RAM_COUNT_PER_GPU
OOMKill swap out of RAM count
GPU_OOMKILL_BURST_COUNT_PER_GPU
GPU_OOMKILL_IDLE_COUNT_PER_GPU
GPU_SWAP_MEMORY_BYTES_PER_GPU
Ready / Total GPU devices
Ready / Total GPU devices
Idle allocated GPU devices
ALLOCATED_CPU_MEMORY_BYTES
GPU_ALLOCATION_NON_PREEMPTIBLE
CPU_ALLOCATION_NON_PREEMPTIBLE
MEMORY_ALLOCATION_NON_PREEMPTIBLE
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4