A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://run-ai-docs.nvidia.com/self-hosted/workloads-in-nvidia-run-ai/workloads below:

Workloads | Run:ai Documentation

Workloads | Run:ai Documentation
  1. Workloads in NVIDIA Run:ai
Workloads

This section explains the procedure for managing workloads.

The Workloads table can be found under Workload manager in the NVIDIA Run:ai platform.

The workloads table provides a list of all the workloads scheduled on the NVIDIA Run:ai Scheduler, and allows you to manage them.

The Workloads table consists of the following columns:

The purpose of the workload - Default: Build, Train, or Deploy

The scheduling priority assigned to the workload within its project

The different phases in a workload lifecycle

The project in which the workload runs

The department that the workload is associated with. This column is visible only if the department toggle is enabled by your administrator.

The user who created the workload

The node pools utilized by the workload

The number of running pods out of the requested

The requested number of runs the workload must finish to be considered complete

The timestamp of when the workload was created

The timestamp the workload reached a terminal state (failed/completed)

The total time a workload spent in Pending state

The method by which you can access and interact with the running workload. It's essentially the "doorway" through which you can reach and use the tools the workload provide. (E.g node port, external URL, etc). Click one of the values in the column to view the list of connections and their parameters.

Data resources used by the workload

Standard or distributed training:

Amount of GPU devices requested

Amount of GPU devices allocated

Amount of GPU memory Requested

Amount of GPU memory allocated

The number of allocated GPU devices that have been idle for more than 5 minutes

Amount of CPU cores requested

Amount of CPU cores allocated

Amount of CPU memory requested

Amount of CPU memory allocated

The cluster that the workload is associated with

The following table describes the different phases in a workload life cycle. The UI provides additional details for some of the below workload statuses which can be viewed by clicking the icon next to the status.

Workload setup is initiated in the cluster. Resources and pods are now provisioning.

A multi-pod group is created

Workload is queued and awaiting resource allocation

Workload is retrieving images, starting containers, and preparing pods

All pods are initialized or a failure to initialize is detected

Workload is currently in progress with all pods operational

All pods initialized (all containers in pods are ready)

Workload completion or failure

Workload is transitioning back to active after being suspended

A previously suspended workload is resumed by the user

Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details.

Workload and its associated resources are being decommissioned from the cluster

Resources are fully deleted

Workload is being suspended and its pods are being deleted

User initiates a suspend action

All pods are terminated and the workload is no longer active

Workload is on hold and resources are intact but inactive

Stopping the workload without deleting resources

Transitioning back to the initializing phase or proceeding to deleting the workload

Image retrieval failed or containers experienced a crash. Check your logs for specific details

An error occurs preventing the successful completion of the workload

Workload has successfully finished its execution

The workload has finished processing without errors

Pods Associated with the Workload

Click one of the values in the Running/requested pods column, to view the list of pods and their parameters.

The node on which the pod resides

The node pool in which the pod resides (applicable if node pools are enabled)

Amount of GPU devices allocated for the pod

Amount of GPU memory allocated for the pod

Connections Associated with the Workload

A connection refers to the method by which you can access and interact with the running workloads. It is essentially the "doorway" through which you can reach and use the applications (tools) these workloads provide.

Click one of the values in the Connection(s) column, to view the list of connections and their parameters. Connections are network interfaces that communicate with the application running in the workload. Connections are either the URL the application exposes or the IP and the port of the node that the workload is running on.

The name of the application running on the workload

The network connection type selected for the workload

Who is authorized to use this connection (everyone, specific groups/users)

Enabled only for supported tools

Data Sources Associated with the Workload

Click one of the values in the Data source(s) column to view the list of data sources and their parameters.

The name of the data source mounted to the workload

Customizing the Table View

Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the following tabs:

Displays the workload status over time. It displays events describing the workload lifecycle and alerts on notable events. Use the filter to search through the history for specific events.

The Metrics screen contains a dropdown allowing you to switch between Default and Advanced metrics views:

Note

Advanced metrics are disabled by default. If unavailable, your Administrator must enable it under General Settings → Analytics → Advanced metrics.

Select Advanced from the dropdown to view extended GPU device-level metrics such as memory bandwidth, SM occupancy, and other data.

Workload events are ordered in chronological order. The logs contain events from the workload’s lifecycle to help monitor and debug issues.

Note

Logs are available only while the workload is in a non-terminal state. Once the workload completes or fails, logs are no longer accessible.

Before starting, make sure you have created a project or have one created for you to work with workloads.

To create a new workload:

  1. Select a workload type - Follow the links below to view the step-by-step guide for each workload type:

Stopping one or more workloads kills their pods and releases the associated resources.

  1. Select one or more workloads from the list

Running one or more workloads spins up new pods and resumes the workload execution from where it was stopped.

  1. Select one or more workloads that you want to resume

To connect to an application running in the workload (for example, Jupyter Notebook)

  1. Select the workload you want to connect

  2. Select the tool from the drop-down list

  3. The selected tool is opened in a new tab on your browser

  1. Select the workload you want to copy

  2. Enter a name for the workload. The name must be unique.

  3. Update the workload and click CREATE WORKLOAD

  1. Select the workload you want to delete

  2. On the dialog, click DELETE to confirm the deletion

Note

Once a workload is deleted you can view it in the Deleted tab in the workloads view. This tab is displayed only if enabled by your Administrator, under General settings → Workloads → Deleted workloads

Managing Workload Properties

Administrators can change the default priority and category assigned to a workload type by updating the mapping using the NVIDIA Run:ai API :

Go to the Workloads API reference to view the available actions

To understand the condition of the workload, review the workload status in the Workload table. For more information, see check the workload’s event history.

Listed below are a number of known issues when working with workloads and how to fix them:

Cluster connectivity issues (there are issues with your connection to the cluster error message)

Workload in “Initializing” status for some time

Workload has been pending for some time

PVCs created using the K8s API or kubectl are not visible or mountable in NVIDIA Run:ai

This is by design.

  1. Create a new data source of type PVC in the NVIDIA Run:ai UI

  2. In the Data mount section, select Existing PVC

  3. Select the PVC you created via the K8S API

You are now able to select and mount this PVC in your NVIDIA Run:ai submitted workloads.

Workload is not visible in the UI


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4