The NVIDIA Run:ai control plane is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai control plane. Before you start, make sure to review the Installation overview.
The machine running the installation script (typically the Kubernetes master) must have:
At least 50GB of free space
Note
If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai software artifacts include the Helm binary.
The following hardware requirements are for the control plane system nodes. By default, all NVIDIA Run:ai control plane services run on all available nodes.
x86 - Supported for both Kubernetes and OpenShift deployments.
ARM - Supported for Kubernetes only. ARM is currently not supported for OpenShift.
This configuration is the minimum requirement you need to install and use NVIDIA Run:ai control plane:
Note
To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.
If NVIDIA Run:ai control plane is planned to be installed on the same Kubernetes cluster as the NVIDIA Run:ai cluster, make sure the cluster Hardware requirements are considered in addition to the NVIDIA Run:ai control plane hardware requirements.
The following software requirements must be fulfilled.
Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator
Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.
Nodes are required to be synchronized by time using NTP (Network Time Protocol) for proper system functionality.
NVIDIA Run:ai control plane requires Kubernetes. The following Kubernetes distributions are supported:
OpenShift Container Platform (OCP)
NVIDIA Base Command Manager (BCM)
Elastic Kubernetes Engine (EKS)
Google Kubernetes Engine (GKE)
Azure Kubernetes Service (AKS)
Oracle Kubernetes Engine (OKE)
Rancher Kubernetes Engine (RKE1)
Rancher Kubernetes Engine 2 (RKE2)
Note
The latest release of the NVIDIA Run:ai control plane supports Kubernetes 1.31 to 1.33 and OpenShift 4.15 to 4.19.
See the following Kubernetes version support matrix for the latest NVIDIA Run:ai releases:
Supported Kubernetes versions Supported OpenShift versionsFor information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy .
The NVIDIA Run:ai control plane uses a namespace or project (OpenShift) called runai-backend
. Use the following to create the namespace/project:
kubectl create namespace runai-backend
oc new-project runai-backend
Note
Default storage class applies for Kubernetes only.
The NVIDIA Run:ai control plane requires a default storage class to create persistent volume claims for NVIDIA Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior, whether the NVIDIA Run:ai persistent data is saved or deleted when the NVIDIA Run:ai control plane is deleted.
Note
For a simple (non-production) storage class example see Kubernetes Local Storage Class . The storage class will set the directory /opt/local-path-provisioner
to be used across all nodes as the path for provisioning persistent volumes. Then set the new storage class as default:
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Kubernetes Ingress Controller
Note
Installing ingress controller applies for Kubernetes only.
The NVIDIA Run:ai control plane requires Kubernetes Ingress Controller to be installed.
OpenShift, RKE and RKE2 come with a pre-installed ingress controller.
Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.
Make sure that a default ingress controller is set.
There are many ways to install and configure different ingress controllers. The following shows a simple example to install and configure NGINX ingress controller using helm :
Vanilla KubernetesRun the following commands:
For cloud deployments, both the internal IP and external IP are required.
For on-prem deployments, only the external IP is needed.
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
--namespace nginx-ingress --create-namespace \
--set controller.kind=DaemonSet \
--set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
Managed Kubernetes (EKS, GKE, AKS)
Run the following commands:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
--namespace nginx-ingress --create-namespace
Oracle Kubernetes Engine (OKE)
Run the following commands:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace \
--set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
--set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
--set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
--set controller.service.externalTrafficPolicy=Local \
--set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster
Fully Qualified Domain Name (FQDN)
Note
Fully Qualified Domain Name applies for Kubernetes only.
You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai control plane (ex: runai.mycorp.local
). This cannot be an IP. The FQDN must be resolvable within the organization's private network.
You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-backend-tls
in the runai-backend
namespace and include the path to the TLS --cert
and its corresponding private --key
by running the following:
kubectl create secret tls runai-backend-tls -n runai-backend \
--cert /path/to/fullchain.pem \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
--key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on configuring certificates .
Local Certificate AuthorityA local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority. Follow the below steps to configure the local certificate authority.
In air-gapped environments, you must configure and install the local CA's public key in the Kubernetes cluster. This is required for the installation to succeed:
Add the public key to the runai-backend
namespace:
kubectl -n runai-backend create secret generic runai-ca-cert \
--from-file=runai-ca.pem=<ca_bundle_path>
oc -n runai-backend create secret generic runai-ca-cert \
--from-file=runai-ca.pem=<ca_bundle_path>
When installing the control plane, make sure the following flag is added to the helm command --set global.customCA.enabled=true
. See Install control plane.
The NVIDIA Run:ai control plane installation includes a default PostgreSQL database. However, you may opt to use an existing PostgreSQL database if you have specific requirements or preferences as detailed in External Postgres database configuration. Please ensure that your PostgreSQL database is version 16 or higher.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4