A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements below:

Cluster System Requirements | Run:ai Documentation

Cluster System Requirements | Run:ai Documentation
  1. Getting Started
  2. Installation
Cluster System Requirements

The NVIDIA Run:ai cluster is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai cluster.

The system requirements needed depend on where the control plane and cluster are installed. The following applies for Kubernetes only:

The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set node roles, to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU Machines.

NVIDIA Run:ai Cluster - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.

NVIDIA Run:ai Cluster - Worker Nodes

The NVIDIA Run:ai cluster supports x86 and ARM (see the below note) CPUs, and NVIDIA GPUs from the T, V, A, L, H, B, GH and GB architecture families. For the list of supported GPU models, see Supported NVIDIA Data Center GPUs and Systems .

The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:

Note

To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in Worker nodes.

NVIDIA Run:ai workloads must be able to access data from any worker node in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.

Typical protocols are Network File Storage (NFS) or Network-attached storage (NAS). NVIDIA Run:ai cluster supports both, for more information see Shared storage.

The following software requirements must be fulfilled on the Kubernetes cluster.

NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:

Note

For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:

Supported Kubernetes versions Supported OpenShift versions

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy .

NVIDIA Run:ai supports the following container runtimes . Make sure your Kubernetes cluster is configured with one of these runtimes:

Kubernetes Pod Security Admission

NVIDIA Run:ai supports restricted policy for Pod Security Admission (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged policy.

For NVIDIA Run:ai on OpenShift to run with PSA restricted policy:

pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/warn=privileged

The NVIDIA Run:ai must be installed in a namespace or project (OpenShift) called runai. Use the following to create the namespace/project:

Kubernetes Ingress Controller

NVIDIA Run:ai cluster requires Kubernetes Ingress Controller to be installed on the Kubernetes cluster.

There are many ways to install and configure different ingress controllers. A simple example to install and configure NGINX ingress controller using helm :

Vanilla Kubernetes

Run the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace \
    --set controller.kind=DaemonSet \
    --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
Managed Kubernetes (EKS, GKE, AKS)

Run the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace
Oracle Kubernetes Engine (OKE)

Run the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace ingress-nginx --create-namespace \
    --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
    --set controller.service.externalTrafficPolicy=Local \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster
Fully Qualified Domain Name (FQDN)

Note

Fully Qualified Domain Name applies for Kubernetes only.

You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai cluster (ex: runai.mycorp.local). This cannot be an IP. The domain name must be accessible inside the organization's private network.

Wildcard FQDN for Inference (Optional)

In order to make inference serving endpoints available externally to the cluster, configure a wildcard DNS record (*.runai-inference.mycorp.local) that resolves to the cluster’s public IP address, or to the cluster's load balancer IP address in on-prem environments. This ensures each inference workload receives a unique subdomain under the wildcard domain.

You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-cluster-domain-tls-secret in the runai namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

kubectl create secret tls runai-cluster-domain-tls-secret -n runai \    
  --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate    
  --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
Local Certificate Authority

A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority. Follow the below steps to configure the local certificate authority.

In air-gapped environments, you must configure and install the local CA's public key in the Kubernetes cluster. This is required for the installation to succeed:

  1. Add the public key to the required namespace:

kubectl -n runai create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
kubectl label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
oc -n runai create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
oc -n openshift-monitoring create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
oc label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
  1. When installing the cluster, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See Install cluster.

NVIDIA Run:ai cluster requires NVIDIA GPU Operator to be installed on the Kubernetes cluster. GPU Operator versions 22.9 to 25.3 are supported.

Note

For Multi-Node NVLink support (e.g. GB200), GPU Operator 25.3 and above is required.

For air-gapped installation, follow the instructions in Install NVIDIA GPU Operator in Air-Gapped Environments .

See Installing the NVIDIA GPU Operator , followed by notes below:

OpenShift Container Platform (OCP)

The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator in OpenShift. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console. For more information, see Installing the Node Feature Discovery (NFD) Operator .

Elastic Kubernetes Service (EKS)

For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false.

Google Kubernetes Engine (GKE)

Before installing the GPU Operator:

  1. Create the gpu-operator namespace by running:

kubectl create ns gpu-operator
  1. Create the following file:

#resourcequota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
name: gcp-critical-pods
namespace: gpu-operator
spec:
scopeSelector:
    matchExpressions:
    - operator: In
    scopeName: PriorityClass
    values:
    - system-node-critical
    - system-cluster-critical
kubectl apply -f resourcequota.yaml
Rancher Kubernetes Engine 2 (RKE2)

Make sure to specify the CONTAINERD_CONFIG option exactly as outlined in the documentation and custom configuration guide , using the path /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl. Do not create the file manually if it does not already exist. The GPU Operator will handle this configuration during deployment.

Oracle Kubernetes Engine (OKE)

For troubleshooting information, see the NVIDIA GPU Operator Troubleshooting Guide .

When deploying on clusters with RDMA or Multi Node NVLink‑capable nodes (e.g. B200, GB200), the NVIDIA Network Operator is required to enable high-performance networking features such as GPUDirect RDMA in Kubernetes. Network Operator versions v24.4 and above are supported.

The Network Operator works alongside the NVIDIA GPU Operator to provide:

The Network Operator must be installed and configured as follows:

For air-gapped installation, follow the instructions in Network Operator Deployment in an Air-gapped Environment .

NVIDIA Dynamic Resource Allocation (DRA) Driver

When deploying on clusters with Multi-Node NVLink (e.g. GB200), the NVIDIA DRA driver is essential to enable Dynamic Resource Allocation at the Kubernetes level. To install, follow the instructions in Configure and Helm-install the driver .

After installation, update runaiconfig using the GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

Note

Installing Prometheus applies for Kubernetes only.

NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.

There are many ways to install Prometheus. A simple example to install the community Kube-Prometheus Stack using helm , run the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
    -n monitoring --create-namespace --set grafana.enabled=false
Additional Software Requirements

Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.

Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:

There are several ways to install each framework. A simple method of installation example is the Kubeflow Training Operator which includes TensorFlow, PyTorch, XGBoost and JAX.

It is recommended to use Kubeflow Training Operator v1.9.2, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as Stopping a workload and Scheduling rules.

kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.2"
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml

Note

If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:

kubectl patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob", "--enable-scheme=pytorchjob", "--enable-scheme=xgboostjob", "--enable-scheme=jaxjob"]}]'
kubectl delete crd mpijobs.kubeflow.org

Inference enables serving of AI models. This requires the Knative Serving framework to be installed on the cluster and supports Knative versions 1.11 to 1.18. Follow the Installing Knative instructions. Once installed, follow the below steps.

  1. Configure Knative to use the NVIDIA Run:ai Scheduler and other features using the following command:

    kubectl patch configmap/config-autoscaler \
      --namespace knative-serving \
      --type merge \
      --patch '{"data":{"enable-scale-to-zero":"true"}}' && \
    kubectl patch configmap/config-features \
      --namespace knative-serving \
      --type merge \
      --patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-nodeselector": "enabled","kubernetes.podspec-affinity":"enabled","kubernetes.podspec-tolerations":"enabled","kubernetes.podspec-volumes-emptydir":"enabled","kubernetes.podspec-securitycontext":"enabled","kubernetes.containerspec-addcapabilities":"enabled","kubernetes.podspec-persistent-volume-claim":"enabled","kubernetes.podspec-persistent-volume-write":"enabled","multi-container":"enabled","kubernetes.podspec-init-containers":"enabled","kubernetes.podspec-fieldref":"enabled"}}'
  2. Optional: If inference serving endpoints should be accessible outside the cluster:

    1. Patch the Knative service and assign the DNS for inference workloads to the Knative ingress service:

      # Replace <runai-inference.mycorp.local> with your FQDN for Inference (without the wildcard)
      kubectl patch configmap/config-domain \
         --namespace knative-serving \
         --type merge \
         --patch '{"data":{"<runai-inference.mycorp.local>":""}}'

NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:

Using a custom metric (for example, Latency) requires installing the Kubernetes Horizontal Pod Autoscaler (HPA) . Use the following command to install. Make sure to update the {VERSION} in the below command with a supported Knative version.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-{VERSION}/serving-hpa.yaml

NVIDIA Run:ai supports distributed inference (multi-node) deployments using the Leader Worker Set (LWS). To enable this capability, you must install the LWS Helm chart on your cluster:

CHART_VERSION=0.6.2
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=$CHART_VERSION \
  --namespace lws-system \
  --create-namespace \
  --wait --timeout 300s

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4