A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://run-ai-docs.nvidia.com/self-hosted/workloads-in-nvidia-run-ai/using-inference/nim-inference below:

Deploy Inference Workloads with NVIDIA NIM

Deploy Inference Workloads with NVIDIA NIM | Run:ai Documentation
  1. Workloads in NVIDIA Run:ai
  2. Deploy Models Using Inference
Deploy Inference Workloads with NVIDIA NIM

This section explains how to deploy a GenAI model from Nvidia NIM as an inference workload via the NVIDIA Run:ai UI.

An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.

The inference workload is assigned to a project and is affected by the project’s quota.

To learn more about the inference workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see Workload types.

Note

Selecting the Inference type is disabled by default. If you cannot see it in the menu, then it must be enabled by your administrator, under General settings → Workloads → Models.

By default, inference workloads in NVIDIA Run:ai are assigned a priority of very-high, which is non-preemptible. This behavior ensures that inference workloads, which often serve real-time or latency-sensitive traffic, are guaranteed the resources they need and will not be disrupted by other workloads. For more details on the available options, see Workload priority control.

Note

Changing the priority is not supported for NVIDIA NIM workloads.

Creating a NIM Inference Workload

To add a new inference workload:

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference Within the new inference form:

  3. Select under which cluster to create the inference workload

  4. Select the project in which your inference will run

  5. Select NIM from Inference type

  6. Enter a unique name for the inference workload (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE In the next step:

  8. Select the NIM model and set how to access

  9. Select the compute resource for your inference workload

  10. Select the data source that will serve as the model store Select a data source where the model is already cached to reduce loading time or click +NEW DATA SOURCE to add a new data source to the gallery. This will cache the model and reduce loading time for future use. If there are issues with the connectivity to the cluster, or issues while creating the data source, the data source won't be available for selection. For a step-by-step guide on adding data sources to the gallery, see data sources. Once created, the new data source will be automatically selected.

  11. Optional - General settings:

After the inference workload is created, it is added to the Workloads table, where it can be managed and monitored.

Accessing the Inference Workload

You can programmatically consume an inference workload via API by making direct calls to the serving endpoint, typically from other workloads or external integrations.

Once an inference workload is deployed, the serving endpoint URL appears in the Connections column of the inference workloads grid.

Note

If the serving endpoint URL ends with .svc.cluster.local, it is accessible only within the cluster. To enable external access, your administrator must configure the cluster as described in the inference requirements section.

Access to the inference serving API depends on how the serving endpoint access was configured when submitting the inference workload:

Follow the below steps to obtain a token:

  1. Use the Tokens API with:

  2. Use the obtained token to make API calls to the inference serving endpoint. For example:

    #replace <serving-endpoint-url> and <model-name> (e.g. "meta-llama/Llama-3.1-8B-Instruct")
    curl <serving-endpoint-url>/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer <your-access-token>" \
      -d '{
        "model": "<model-name>",
        "messages": [{
          "role": "user",
          "content": "Write a short poem on AI"
        }]
      }'

To view the available actions, see the inference workload CLI v2 reference.

To view the available actions for creating an inference workload, see the Inferences API reference.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4