Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.
Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:
Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking
Look at additional components installed and analyze their relevance to a successful installation
For more information, see preinstall diagnostics . To run the preinstall diagnostics tool, download the latest version, and run:
chmod +x ./preinstall-diagnostics-<platform> && \
./preinstall-diagnostics-<platform> \
--domain ${CONTROL_PLANE_FQDN} \
--cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
--image ${PRIVATE_REGISTRY_IMAGE_URL}
In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.
#Save the image locally
docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
#Load the image to the organization's registry
docker load --input preinstall-diagnostics.tar
docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG}
docker push ${CLIENT_IMAGE_AND_TAG}
Run the binary with the --image
parameter to modify the diagnostics image to be used:
chmod +x ./preinstall-diagnostics-darwin-arm64 && \
./preinstall-diagnostics-darwin-arm64 \
--domain ${CONTROL_PLANE_FQDN} \
--cluster-domain ${CLUSTER_FQDN} \
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
--image ${PRIVATE_REGISTRY_IMAGE_URL}
NVIDIA Run:ai requires Helm 3.14 or later. To install Helm, see Installing Helm . If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.
Using a Kubernetes user with the cluster-admin
role to ensure a successful installation is recommended. For more information, see Using RBAC authorization .
Follow the steps below to add a new cluster.
Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.
In the NVIDIA Run:ai platform, go to Resources
Enter a unique name for your cluster
Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
Installing NVIDIA Run:ai cluster
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.
Follow the installation instructions and run the commands provided on your Kubernetes cluster.
The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.
Tip: Use the --dry-run
flag to gain an understanding of what is being installed before the actual installation. For more details, see see Understanding cluster access roles.
Follow the steps below to add a new cluster.
Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log-in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.
In the NVIDIA Run:ai platform, go to Resources
Enter a unique name for your cluster
Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
Installing NVIDIA Run:ai cluster
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.
Follow the installation instructions and run the commands provided on your Kubernetes cluster.
On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:
Do not add the helm repository and do not run helm repo update
.
Instead, edit the helm upgrade
command.
Replace runai/runai-cluster
with runai-cluster-<VERSION>.tgz
.
Add --set global.image.registry=<DOCKER REGISTRY ADDRESS>
where the registry address is as entered in the preparation section
Add --set global.customCA.enabled=true
as described here
The command should look like the following:
helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
--set controlPlane.url=... \
--set controlPlane.clientSecret=... \
--set cluster.uid=... \
--set cluster.url=... --create-namespace \
--set global.image.registry=registry.mycompany.local \
--set global.customCA.enabled=true
The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.
Tip: Use the --dry-run
flag to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.
Follow the steps below to add a new cluster.
Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.
In the NVIDIA Run:ai platform, go to Resources
Enter a unique name for your cluster
Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
Installing NVIDIA Run:ai cluster
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.
Follow the installation instructions and run the commands provided on your Kubernetes cluster.
The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.
Air-gappedWhen creating a new cluster, select the OpenShift target platform.
Follow the steps below to add a new cluster.
Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.
In the NVIDIA Run:ai platform, go to Resources
Enter a unique name for your cluster
Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
Installing NVIDIA Run:ai cluster
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.
Follow the installation instructions and run the commands provided on your Kubernetes cluster.
On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:
Do not add the helm repository and do not run helm repo update
.
Instead, edit the helm upgrade
command.
Replace runai/runai-cluster
with runai-cluster-<VERSION>.tgz
.
Add --set global.image.registry=<DOCKER REGISTRY ADDRESS>
where the registry address is as entered in the preparations section
Add --set global.customCA.enabled=true
as described here
The command should look like the following:
helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
--set controlPlane.url=... \
--set controlPlane.clientSecret=... \
--set cluster.uid=... \
--set cluster.url=... --create-namespace \
--set global.image.registry=registry.mycompany.local \
--set global.customCA.enabled=true
The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.
If you encounter an issue with the installation, try the troubleshooting scenario below.
If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster troubleshooting scenarios.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4